Biological research frequently involves the study of phenotyping data. Many of these studies focus on rare event categorical data, and in functional genomics typically study the presence or absence of an abnormal phenotype. With the growing interest in the role of sex, there is a need to assess the phenotype for sexual dimorphism. The identification of abnormal phenotypes for downstream research is challenged by the small sample size, the rare event nature, and the multiple testing problem, as many variables are monitored simultaneously. Here we develop a statistical pipeline to assess statistical and biological significance whilst managing the multiple testing problem. We propose a two-step pipeline to initially assess for a treatment effect, in our case example genotype, and then test for an interaction with sex. We compare multiple statistical methods and use simulations to investigate the control of the type one error rate and power. To maximize the power whilst addressing the multiple testing issue we implement filters to remove datasets where the hypotheses to be tested cannot achieve significance. A motivating case study utilizing a large scale high throughput mouse phenotyping dataset from the Wellcome Trust Sanger Institute Mouse Genetics Project, where the treatment is a gene ablation, demonstrates the benefits of the new pipeline on the downstream biological calls.