↓ Skip to main content

Modeling Statistical Properties of Written Text

Overview of attention for article published in PLOS ONE, April 2009
Altmetric Badge

Mentioned by

twitter
2 X users
facebook
1 Facebook page

Citations

dimensions_citation
90 Dimensions

Readers on

mendeley
121 Mendeley
citeulike
5 CiteULike
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Modeling Statistical Properties of Written Text
Published in
PLOS ONE, April 2009
DOI 10.1371/journal.pone.0005372
Pubmed ID
Authors

M. Ángeles Serrano, Alessandro Flammini, Filippo Menczer

Abstract

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

X Demographics

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 121 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Germany 4 3%
United States 4 3%
Switzerland 3 2%
Philippines 2 2%
China 2 2%
Vietnam 1 <1%
Australia 1 <1%
Italy 1 <1%
Argentina 1 <1%
Other 3 2%
Unknown 99 82%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 26 21%
Researcher 26 21%
Student > Master 16 13%
Professor 9 7%
Professor > Associate Professor 9 7%
Other 25 21%
Unknown 10 8%
Readers by discipline Count As %
Computer Science 31 26%
Physics and Astronomy 17 14%
Social Sciences 11 9%
Agricultural and Biological Sciences 9 7%
Linguistics 8 7%
Other 34 28%
Unknown 11 9%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 2. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 15 February 2016.
All research outputs
#14,170,039
of 22,710,079 outputs
Outputs from PLOS ONE
#115,901
of 193,906 outputs
Outputs of similar age
#76,824
of 92,749 outputs
Outputs of similar age from PLOS ONE
#423
of 512 outputs
Altmetric has tracked 22,710,079 research outputs across all sources so far. This one is in the 35th percentile – i.e., 35% of other outputs scored the same or lower than it.
So far Altmetric has tracked 193,906 research outputs from this source. They typically receive a lot more attention than average, with a mean Attention Score of 15.0. This one is in the 36th percentile – i.e., 36% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 92,749 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 16th percentile – i.e., 16% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 512 others from the same source and published within six weeks on either side of this one. This one is in the 16th percentile – i.e., 16% of its contemporaries scored the same or lower than it.