Common use of Evaluation Data Clause in Contracts

Evaluation Data. Two types of evaluation data were used: • For efforts, person hours were counted. As many of the tools (crawling, ▇▇▇▇▇) take significant computation effort, the effort was not measured in machine time but in the time where people were involved. Machine time is usually much higher. • For quality, human evaluation was used in addition to automatic scores (BLEU, using one reference translation). As only comparative (COMP) evaluation was done, the test persons had to decide which one of two translation outputs was better, if any. Quality change was calculated by: number improvements minus number deteriorations, divided by total amount sentences. This is a standard measure used in industrial MT comparisons. For evaluation, the Sisyphos-II tools made by Linguatec were used. Two evaluators were engaged to do the evaluations. For the quality comparison, a test set of 1500 sentences from the automotive domain was used. The test set was taken from crawled automotive texts, so it contains spelling errors, wrong segmentation etc.; no cleaning was done. None of those 1500 sentences is in the training corpus. This test set was used for KFZ and UNK. For DCU system adaptation, 500 sentences out of the test set were used as development set, so the test set was reduced to 1000 sentences. This reduced test set was used for DCU and GLO.

Appears in 1 contract

Sources: Grant Agreement

Evaluation Data. Two types of evaluation data were used: For efforts, person hours were counted. As many of the tools (crawling, ▇▇▇▇▇) take significant computation effort, the effort was not measured in machine time but in the time where people were involved. Machine time is usually much higher. For quality, human evaluation was used in addition to automatic scores (BLEU, using one reference translation). As only comparative (COMP) evaluation was done, the test persons had to decide which one of two translation outputs was better, if any. Quality change was calculated by: number improvements minus number deteriorations, divided by total amount sentences. This is a standard measure used in industrial MT comparisons. For evaluation, the Sisyphos-II tools made by Linguatec were used. Two evaluators were engaged to do the evaluations. For the quality comparison, a test set of 1500 sentences from the automotive domain was used. The test set was taken from crawled automotive texts, so it contains spelling errors, wrong segmentation etc.; no cleaning was done. None of those 1500 sentences is in the training corpus. This test set was used for KFZ and UNK. For DCU system adaptation, 500 sentences out of the test set were used as development set, so the test set was reduced to 1000 sentences. This reduced test set was used for DCU and GLO.

Appears in 1 contract

Sources: Grant Agreement