↓ Skip to main content

Domain adaptation of statistical machine translation with domain-focused web crawling

Overview of attention for article published in Language Resources and Evaluation, December 2014
Altmetric Badge

About this Attention Score

  • Above-average Attention Score compared to outputs of the same age (53rd percentile)

Mentioned by

twitter
2 X users
googleplus
1 Google+ user

Citations

dimensions_citation
11 Dimensions

Readers on

mendeley
37 Mendeley
Title
Domain adaptation of statistical machine translation with domain-focused web crawling
Published in
Language Resources and Evaluation, December 2014
DOI 10.1007/s10579-014-9282-3
Pubmed ID
Authors

Pavel Pecina, Antonio Toral, Vassilis Papavassiliou, Prokopis Prokopidis, Aleš Tamchyna, Andy Way, Josef van Genabith

Abstract

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English-French and English-Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.

X Demographics

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 37 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Spain 1 3%
Cyprus 1 3%
Unknown 35 95%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 7 19%
Student > Master 5 14%
Researcher 4 11%
Student > Bachelor 4 11%
Other 3 8%
Other 10 27%
Unknown 4 11%
Readers by discipline Count As %
Computer Science 17 46%
Linguistics 8 22%
Business, Management and Accounting 2 5%
Environmental Science 1 3%
Mathematics 1 3%
Other 4 11%
Unknown 4 11%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 3. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 05 September 2015.
All research outputs
#7,917,073
of 23,854,458 outputs
Outputs from Language Resources and Evaluation
#92
of 331 outputs
Outputs of similar age
#110,751
of 367,388 outputs
Outputs of similar age from Language Resources and Evaluation
#2
of 2 outputs
Altmetric has tracked 23,854,458 research outputs across all sources so far. This one is in the 44th percentile – i.e., 44% of other outputs scored the same or lower than it.
So far Altmetric has tracked 331 research outputs from this source. They receive a mean Attention Score of 3.8. This one has gotten more attention than average, scoring higher than 54% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 367,388 tracked outputs that were published within six weeks on either side of this one in any source. This one has gotten more attention than average, scoring higher than 53% of its contemporaries.
We're also able to compare this research output to 2 others from the same source and published within six weeks on either side of this one.