Future Directions. While the main findings of this thesis are an important first step in understanding the relationship between syntactic annotation quality and machine translation performance, the results presented here raise some additional questions that bear further investigation. One slightly curious finding has to do with the disparity between tuning and test perfor- ▇▇▇▇▇. As we predicted, the new syntactic annotations resulted in higher BLEU scores on both the tuning and test data. However, in general, the tuning set improvements were quite a bit higher than those on the test set. This in itself is a relatively banal finding – optimizer overfitting is hardly an uncommon phenonomenon – but Table 4.3 shows a different ▇▇▇- ▇▇▇▇ of results from experiments where we improved the word alignments and held the trees constant. In these word alignment experiments, which are otherwise identical to the other MT experiments in this thesis, we saw tuning and test set improvements that were much closer together in magnitude, suggesting that the overfitting effect is stronger when parses are improved than when word alignments are improved. These results alone are probably not enough to be conclusive, but given the large recent interest in methods for overcoming MT optimizer instability (see for example ▇▇▇▇▇▇ et al., 2008; ▇▇▇▇▇▇▇▇ et al., 2008; ▇▇▇▇▇ et al., 2009; ▇▇▇▇▇ et al., 2011) , it seems to be worth investigating the interaction between parameter optimization and syntactic MT specifically. The final result of this thesis was the somewhat disappointing finding that the two basic approaches presented (statistical modeling to improve parser performance and tree trans- formations to improve agreement with alignements) do not automatically stack together to achieve even stronger MT performance. One possibility for this result is that the agree- ment score metric we defined is designed specifically for isolating problems that appear in monolingual parses. When the starting point is parses that were generated from a joint model, the initial agreement is already much higher, so perhaps the remaining disagree- ments that are relevant to MT performance are not adequately captured by continuing to optimize agreement score. However, we do see in Table 5.7 that the annotations with the highest agreement score (Joint PA + Transformation) also yield the highest BLEU score on the tuning set,1 so another possibility is that we’re just running up against the limits of the parameter optimization, as suggested above. A related possibility is that the parameteriza- tion of the MT system is not rich enough to fully capture the effects of higher agreement, so perhaps incorporating additional feature engineering a la Chiang et al. (2009) would be beneficial. Finally, one simplifying assumption made throughout this work is that syntactic infor- mation is provided to the MT system via one-best syntactic parse trees. We’ve shown that 1Albeit with only a very slight advantage over the next-highest score that could also be due to chance. by improving the ways in which parsing ambiguities are resolved, we can achive better MT performance. As discussed in Chapter 1, though, there has also been a great deal of investigation into an alternative approach of simply maintaining a higher degree of ambi- ▇▇▇▇▇ through the MT pipeline, especially by retaining entire parse forests rather than just one-best trees (▇▇ and ▇▇▇▇▇, 2008; ▇▇▇▇▇ et al., 2009). A natural extension of the current work is to see if the techniques described here can be applied in a more highly ambiguous setting, for example learning to apply contextually appropriate transformations that operate on structures within a parse forest rather than just on a single tree.
Appears in 1 contract
Sources: Dissertation
Future Directions. While the main findings of this thesis are an important first step in understanding the relationship between syntactic annotation quality and machine translation performance, the results presented here raise some additional questions that bear further investigation. One slightly curious finding has to do with the disparity between tuning and test perfor- ▇▇▇▇▇. As we predicted, the new syntactic annotations resulted in higher BLEU scores on both the tuning and test data. However, in general, the tuning set improvements were quite a bit higher than those on the test set. This in itself is a relatively banal finding – optimizer overfitting is hardly an uncommon phenonomenon – but Table 4.3 shows a different ▇▇▇- ▇▇▇▇ of results from experiments where we improved the word alignments and held the trees constant. In these word alignment experiments, which are otherwise identical to the other MT experiments in this thesis, we saw tuning and test set improvements that were much closer together in magnitude, suggesting that the overfitting effect is stronger when parses are improved than when word alignments are improved. These results alone are probably not enough to be conclusive, but given the large recent interest in methods for overcoming MT optimizer instability (see for example ▇▇▇▇▇▇ Chiang et al., 2008; ▇▇▇▇▇▇▇▇ Macherey et al., 2008; ▇▇▇▇▇ Pauls et al., 2009; ▇▇▇▇▇ et al., 2011) , it seems to be worth investigating the interaction between parameter optimization and syntactic MT specifically. The final result of this thesis was the somewhat disappointing finding that the two basic approaches presented (statistical modeling to improve parser performance and tree trans- formations to improve agreement with alignements) do not automatically stack together to achieve even stronger MT performance. One possibility for this result is that the agree- ment score metric we defined is designed specifically for isolating problems that appear in monolingual parses. When the starting point is parses that were generated from a joint model, the initial agreement is already much higher, so perhaps the remaining disagree- ments that are relevant to MT performance are not adequately captured by continuing to optimize agreement score. However, we do see in Table 5.7 that the annotations with the highest agreement score (Joint PA + Transformation) also yield the highest BLEU score on the tuning set,1 so another possibility is that we’re just running up against the limits of the parameter optimization, as suggested above. A related possibility is that the parameteriza- tion of the MT system is not rich enough to fully capture the effects of higher agreement, so perhaps incorporating additional feature engineering a la Chiang et al. (2009) would be beneficial. Finally, one simplifying assumption made throughout this work is that syntactic infor- mation is provided to the MT system via one-best syntactic parse trees. We’ve shown that 1Albeit with only a very slight advantage over the next-highest score that could also be due to chance. by improving the ways in which parsing ambiguities are resolved, we can achive better MT performance. As discussed in Chapter 1, though, there has also been a great deal of investigation into an alternative approach of simply maintaining a higher degree of ambi- ▇▇▇▇▇ through the MT pipeline, especially by retaining entire parse forests rather than just one-best trees (▇▇ Mi and ▇▇▇▇▇, 2008; ▇▇▇▇▇ ; Zhang et al., 2009). A natural extension of the current work is to see if the techniques described here can be applied in a more highly ambiguous setting, for example learning to apply contextually appropriate transformations that operate on structures within a parse forest rather than just on a single tree.
Appears in 1 contract
Sources: Dissertation