Report for: Domain adaptation of statistical machine translation with domain-focused web crawling

Title	Domain adaptation of statistical machine translation with domain-focused web crawling
Published in	Language Resources and Evaluation, December 2014
DOI	10.1007/s10579-014-9282-3
Pubmed ID	26120290
Authors	Pavel Pecina, Antonio Toral, Vassilis Papavassiliou, Prokopis Prokopidis, Aleš Tamchyna, Andy Way, Josef van Genabith
Abstract	In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English-French and English-Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
Netherlands	1	50%
Unknown	1	50%

Demographic breakdown

Type	Count	As %
Members of the public	2	100%

Mendeley readers

The data shown below were compiled from readership statistics for 37 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Spain	1	3%
Cyprus	1	3%
Unknown	35	95%

Demographic breakdown

Readers by professional status	Count	As %
Student > Ph. D. Student	7	19%
Student > Master	5	14%
Researcher	4	11%
Student > Bachelor	4	11%
Other	3	8%
Other	10	27%
Unknown	4	11%

Readers by discipline	Count	As %
Computer Science	17	46%
Linguistics	8	22%
Business, Management and Accounting	2	5%
Environmental Science	1	3%
Mathematics	1	3%
Other	4	11%
Unknown	4	11%

Attention Score in Context

This research output has an Altmetric Attention Score of 3. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 05 September 2015.

All research outputs

#7,917,073

of 23,854,458 outputs

Outputs from Language Resources and Evaluation

#92

of 331 outputs

Outputs of similar age

#110,751

of 367,388 outputs

Outputs of similar age from Language Resources and Evaluation

of 2 outputs

Altmetric has tracked 23,854,458 research outputs across all sources so far. This one is in the 44th percentile – i.e., 44% of other outputs scored the same or lower than it.

So far Altmetric has tracked 331 research outputs from this source. They receive a mean Attention Score of 3.8. This one has gotten more attention than average, scoring higher than 54% of its peers.

Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 367,388 tracked outputs that were published within six weeks on either side of this one in any source. This one has gotten more attention than average, scoring higher than 53% of its contemporaries.

We're also able to compare this research output to 2 others from the same source and published within six weeks on either side of this one.

Domain adaptation of statistical machine translation with domain-focused web crawling

About this Attention Score

Mentioned by

Citations

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown

Attention Score in Context