↓ Skip to main content

Using machine learning to parse breast pathology reports

Overview of attention for article published in Breast Cancer Research and Treatment, November 2016
Altmetric Badge

About this Attention Score

  • In the top 25% of all research outputs scored by Altmetric
  • High Attention Score compared to outputs of the same age (85th percentile)
  • High Attention Score compared to outputs of the same age and source (83rd percentile)

Mentioned by

news
1 news outlet
twitter
6 X users

Citations

dimensions_citation
99 Dimensions

Readers on

mendeley
169 Mendeley
citeulike
1 CiteULike
Title
Using machine learning to parse breast pathology reports
Published in
Breast Cancer Research and Treatment, November 2016
DOI 10.1007/s10549-016-4035-1
Pubmed ID
Authors

Adam Yala, Regina Barzilay, Laura Salama, Molly Griffin, Grace Sollender, Aditya Bardia, Constance Lehman, Julliette M. Buckley, Suzanne B. Coopey, Fernanda Polubriaginof, Judy E. Garber, Barbara L. Smith, Michele A. Gadd, Michelle C. Specht, Thomas M. Gudewicz, Anthony J. Guidi, Alphonse Taghian, Kevin S. Hughes

Abstract

Extracting information from electronic medical record is a time-consuming and expensive process when done manually. Rule-based and machine learning techniques are two approaches to solving this problem. In this study, we trained a machine learning model on pathology reports to extract pertinent tumor characteristics, which enabled us to create a large database of attribute searchable pathology reports. This database can be used to identify cohorts of patients with characteristics of interest. We collected a total of 91,505 breast pathology reports from three Partners hospitals: Massachusetts General Hospital, Brigham and Women's Hospital, and Newton-Wellesley Hospital, covering the period from 1978 to 2016. We trained our system with annotations from two datasets, consisting of 6295 and 10,841 manually annotated reports. The system extracts 20 separate categories of information, including atypia types and various tumor characteristics such as receptors. We also report a learning curve analysis to show how much annotation our model needs to perform reasonably. The model accuracy was tested on 500 reports that did not overlap with the training set. The model achieved accuracy of 90% for correctly parsing all carcinoma and atypia categories for a given patient. The average accuracy for individual categories was 97%. Using this classifier, we created a database of 91,505 parsed pathology reports. Our learning curve analysis shows that the model can achieve reasonable results even when trained on a few annotations. We developed a user-friendly interface to the database that allows physicians to easily identify patients with target characteristics and export the matching cohort. This model has the potential to reduce the effort required for analyzing large amounts of data from medical records, and to minimize the cost and time required to glean scientific insight from these data.

X Demographics

X Demographics

The data shown below were collected from the profiles of 6 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 169 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 169 100%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 29 17%
Researcher 26 15%
Student > Master 16 9%
Other 14 8%
Student > Bachelor 13 8%
Other 30 18%
Unknown 41 24%
Readers by discipline Count As %
Medicine and Dentistry 39 23%
Computer Science 30 18%
Biochemistry, Genetics and Molecular Biology 14 8%
Engineering 11 7%
Agricultural and Biological Sciences 6 4%
Other 20 12%
Unknown 49 29%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 12. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 07 May 2018.
All research outputs
#2,474,923
of 22,903,988 outputs
Outputs from Breast Cancer Research and Treatment
#367
of 4,662 outputs
Outputs of similar age
#45,263
of 312,900 outputs
Outputs of similar age from Breast Cancer Research and Treatment
#10
of 60 outputs
Altmetric has tracked 22,903,988 research outputs across all sources so far. Compared to these this one has done well and is in the 88th percentile: it's in the top 25% of all research outputs ever tracked by Altmetric.
So far Altmetric has tracked 4,662 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 7.2. This one has done particularly well, scoring higher than 91% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 312,900 tracked outputs that were published within six weeks on either side of this one in any source. This one has done well, scoring higher than 85% of its contemporaries.
We're also able to compare this research output to 60 others from the same source and published within six weeks on either side of this one. This one has done well, scoring higher than 83% of its contemporaries.