↓ Skip to main content

Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPi

Overview of attention for article published in BioData Mining, June 2018
Altmetric Badge

Mentioned by

twitter
1 X user

Readers on

mendeley
218 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPi
Published in
BioData Mining, June 2018
DOI 10.1186/s13040-018-0169-5
Pubmed ID
Authors

Kimberly T. To, Rebecca C. Fry, David M. Reif

Abstract

The Toxicological Priority Index (ToxPi) is a method for prioritization and profiling of chemicals that integrates data from diverse sources. However, individual data sources ("assays"), such as in vitro bioassays or in vivo study endpoints, often feature sections of missing data, wherein subsets of chemicals have not been tested in all assays. In order to investigate the effects of missing data and recommend solutions, we designed simulation studies around high-throughput screening data generated by the ToxCast and Tox21 programs on chemicals highlighted by the Agency for Toxic Substances and Disease Registry's (ATSDR) Substance Priority List (SPL), which helps prioritize environmental research and remediation resources. Our simulations explored a wide range of scenarios concerning data (0-80% assay data missing per chemical), modeling (ToxPi models containing from 160-700 different assays), and imputation method (k-Nearest-Neighbor, Max, Mean, Min, Binomial, Local Least Squares, and Singular Value Decomposition). We find that most imputation methods result in significant changes to ToxPi score, except for datasets with a small number of assays. If we consider rank change conditional on these significant changes to ToxPi score, we find that ranks of chemicals in the minimum value imputation, SVD imputation, and kNN imputation sets are more sensitive to the score changes. We found that the choice of imputation strategy exerted significant influence over both scores and associated ranks, and the most sensitive scenarios were those involving fewer assays plus higher proportions of missing data. By characterizing the effects of missing data and the relative benefit of imputation approaches across real-world data scenarios, we can augment confidence in the robustness of decisions regarding the health and ecological effects of environmental chemicals.

X Demographics

X Demographics

The data shown below were collected from the profile of 1 X user who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 218 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 218 100%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 3 1%
Other 2 <1%
Student > Doctoral Student 2 <1%
Student > Bachelor 2 <1%
Researcher 2 <1%
Other 3 1%
Unknown 204 94%
Readers by discipline Count As %
Biochemistry, Genetics and Molecular Biology 3 1%
Environmental Science 2 <1%
Pharmacology, Toxicology and Pharmaceutical Science 2 <1%
Mathematics 1 <1%
Nursing and Health Professions 1 <1%
Other 2 <1%
Unknown 207 95%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 14 June 2018.
All research outputs
#16,479,026
of 24,250,928 outputs
Outputs from BioData Mining
#236
of 316 outputs
Outputs of similar age
#213,057
of 332,485 outputs
Outputs of similar age from BioData Mining
#6
of 6 outputs
Altmetric has tracked 24,250,928 research outputs across all sources so far. This one is in the 21st percentile – i.e., 21% of other outputs scored the same or lower than it.
So far Altmetric has tracked 316 research outputs from this source. They typically receive more attention than average, with a mean Attention Score of 7.6. This one is in the 17th percentile – i.e., 17% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 332,485 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 27th percentile – i.e., 27% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 6 others from the same source and published within six weeks on either side of this one.