Common use of User Requirements Specifically, we focus on performance Clause in Contracts

User Requirements Specifically, we focus on performance. processing speed, feedback provided about the crawl progress, error handling, and quality of the documentation (Table 3). The BootCat toolkit (▇▇▇▇▇▇ et al., 2004) is a well-known suite of Perl scripts for bootstrapping specialized language corpora from the web. Bootcat initially creates random tuples from a seed term list and runs a query for each tuple (on the Yahoo! search engine). After keeping the first 10 results from each query, it constructs a URL list, downloads corresponding web pages and removes boilerplate. Two software applications that integrate the BootCat tools for constructing simple web corpora are the WebBootCat (▇▇▇▇▇▇ et al., 2006) and the BootCat front-end, a web service front-end and a graphical user interface to the core tool, respectively. Although these applications are primarily designed for end users, they can be employed for certain initial (off- line) tasks (e.g. construction and testing of a seed URL list in a specific domain). In addition, a modified version of the BootCat toolkit can be used as an alternative tool for acquisition of monolingual corpora in specific domains. Heritrix (▇▇▇▇ et al., 2004) is an open-source and extensible web crawler. It is implemented in Java and the main interface is accessible using a web browser. Heritrix is one of the most configurable tools for crawling. To the best of our knowledge, it does not include functions about focused crawling based on a predefined list of terms of a specific topic. Combine (Ardo, 2005) is an open system, implemented in Perl, for crawling Internet resources. It is based on a combination of a general Web crawler and an automated topic classifier. The classification is provided by a focus filter using a topic definition implemented as a list of terms describing this topic. One critical issue is the fact that, Combine, in its current implementation, does not sort and follow the most promising URLs in the frontier (i.e. links within pages with high relevance to the topic). In other words, it is a breadth-first crawler followed by a topic classifier. We believe that a modification to the Combine‘s strategy, which will include sorting URLs in the frontier, would be beneficial for Panacea purposes. It is worth mentioning that Combine: a) is an ―active project‖, b) includes modules for language identification and topic classification, c) is modular and open-source and d) allows monitoring of the crawl progress by logging its actions in a relational database. HTTrack (Roche, 2007) is actually a web site copier. In general, it downloads Web sites from the Internet to a local directory, building recursively all directories and storing HTML pages, images, and all other files from the server. It is fully configurable and supports filters and parameters that guide the harvesting. For example, a combination of the filters ―-* +*/el/*.pdf +*/en/*.pdf downloads only pdf files from URLs which contain the string ―/el/‖ or ―/en/‖. HTTrack is a component of Bitextor (Esplà-Gomis, 2010), which to the best of our knowledge, is the only open-source application for building bilingual comparable corpora from multilingual websites. It uses HTTrack as mentioned above and makes two assumptions: i) candidate parallel pages should be under the same web domain and ii) should have similar html structure. Bitextor extracts four features (file size, length of plain text, tag structure, and list of numbers in the web page) for each downloaded page. Then it computes the relative differences of the first two features and the edit distances of the others (for every candidate pair of pages in different languages). Then pairs are classified as bitexts3 or not based on a comparison of computed values against predefined thresholds. After comparing the positions of text blocks (i.e. text between html tags) in each bitext, the tool stores those pairs of text segments that are strong candidates for being translations of each other. The main shortcoming might be that the term segment does not correspond to any linguistic unit. Our experiments showed that segments are actually text blocks between some predefined html tags. Thus a segment could be a sentence, a paragraph, a part of a sentence, etc. 3 In the Bitextor terminology, bitexts are pairs of files which contain approximately the same text in two different languages. Input Output Functionality BootCat toolkit List of terms in a specific topic XML file that contains a monolingual corpus and metadata (url, date, size in words, domain) Toolkit including scripts to bootstrap specialized corpora and terms from the web Heritrix Seed url list Html files, multiple log files Multithreaded, breadth-first web crawler. All parameters can be configured via web based user interface Combine 1) Seed url list 2) List of terms in a specific topic 1) XML file including metadata (date, url, topic, language, etc.), 2) HTML files in UTF-8. Multithreaded, best-first web crawler. Configurations can be set via an XML file. HTTrack Seed URL list A mirror directory of each downloaded web site. Multithreaded web copier. Filters and parameters can be set via a GUI. Bitextor Seed URL list 1) log file in which generated bitexts are recorded 2) HTML files in UTF-8 3) TMX file, which contains pairs of text segments (i.e. parts of text between successive html tags) that are strong candidates for being translations of each other An automatic bitext generator integrating HTTrack. All parameters for filtering and comparing can be configured via an XML file. BootCat toolkit GPL Several (including all Panacea languages) WebBootCat Heritrix LGPL Not applicable Web GUI Combine GPL Uses Lingua::Identify Perl module for language identification. 33 languages supported (not Greek). No HTTrack GPL Not applicable No Bitextor GPL The integrated LibTextCat library contains fingerprints for 69 languages. No Table 2 Licensing, languages supported and availability as WS. Performance4 Feedback Error handling Documentation BootCat toolkit Given 4 terms in topic ―Machine Translation‖, 41 pages were retrieved, resulting in a corpus of 144K words in 2.5 minutes Progress bar (for WebBootCat) Fully integrated In progress Heritrix Speed depends on configuration Multiple log files (e.g. crawl path, filtering results), full report at the end Fully integrated Comprehensive5. Large developer and user community. Combine Handles up to 200 URLs per minute. In an experiment for the topic ―carnivorous plants‖, about 35% of all visited pages were judged relevant Log table in an SQL DB Fully integrated Comprehensive 6 HTTrack Not mentioned Progress bar Fully integrated Comprehensive 7 Bitextor Results depend on the structure of each website. On the well structured website of the Parliament of Canada, a 99% precision and a 85.33% recall were reported. Respective values on a heterogeneous web site were 86% and 61%. Log messages for each major step (e.g. downloading, comparing, generating bitexts) ; needs improvement Needs improvement Needs improvement 4 All performance reports are provided by developers or members of the developer groups for each toolkit 5 ▇▇▇▇://▇▇▇▇▇▇▇.▇▇▇▇▇▇▇.▇▇▇/articles/user_manual/index.html 6 ▇▇▇▇://▇▇▇▇▇▇▇.▇▇.▇▇▇.▇▇/documentation/ 7 ▇▇▇▇://▇▇▇.▇▇▇▇▇▇▇.▇▇▇/html/index.html Corpus clean-up and normalization involve removing irrelevant parts of downloaded web pages in order to produce clean monolingual and bilingual data in uniform format applicable for training an MT system.

Appears in 1 contract

Sources: Grant Agreement

User Requirements Specifically, we focus on performance. processing speed, feedback provided about the crawl progress, error handling, and quality of the documentation (Table 3). The BootCat toolkit (▇▇▇▇▇▇ et al., 2004) is a well-known suite of Perl scripts for bootstrapping specialized language corpora from the web. Bootcat initially creates random tuples from a seed term list and runs a query for each tuple (on the Yahoo! search engine). After keeping the first 10 results from each query, it constructs a URL list, downloads corresponding web pages and removes boilerplate. Two software applications that integrate the BootCat tools for constructing simple web corpora are the WebBootCat (▇▇▇▇▇▇ et al., 2006) and the BootCat front-end, a web service front-end and a graphical user interface to the core tool, respectively. Although these applications are primarily designed for end users, they can be employed for certain initial (off- line) tasks (e.g. construction and testing of a seed URL list in a specific domain). In addition, a modified version of the BootCat toolkit can be used as an alternative tool for acquisition of monolingual corpora in specific domains. Heritrix (▇▇▇▇ et al., 2004) is an open-source and extensible web crawler. It is implemented in Java and the main interface is accessible using a web browser. Heritrix is one of the most configurable tools for crawling. To the best of our knowledge, it does not include functions about focused crawling based on a predefined list of terms of a specific topic. Combine (Ardo, 2005) is an open system, implemented in Perl, for crawling Internet resources. It is based on a combination of a general Web crawler and an automated topic classifier. The classification is provided by a focus filter using a topic definition implemented as a list of terms describing this topic. One critical issue is the fact that, Combine, in its current implementation, does not sort and follow the most promising URLs in the frontier (i.e. links within pages with high relevance to the topic). In other words, it is a breadth-first crawler followed by a topic classifier. We believe that a modification to the Combine‘s strategy, which will include sorting URLs in the frontier, would be beneficial for Panacea purposes. It is worth mentioning that Combine: a) is an ―active project‖, b) includes modules for language identification and topic classification, c) is modular and open-source and d) allows monitoring of the crawl progress by logging its actions in a relational database. HTTrack (Roche, 2007) is actually a web site copier. In general, it downloads Web sites from the Internet to a local directory, building recursively all directories and storing HTML pages, images, and all other files from the server. It is fully configurable and supports filters and parameters that guide the harvesting. For example, a combination of the filters ―-* +*/el/*.pdf +*/en/*.pdf downloads only pdf files from URLs which contain the string ―/el/‖ or ―/en/‖. HTTrack is a component of Bitextor (Esplà-Gomis, 2010), which to the best of our knowledge, is the only open-source application for building bilingual comparable corpora from multilingual websites. It uses HTTrack as mentioned above and makes two assumptions: i) candidate parallel pages should be under the same web domain and ii) should have similar html structure. Bitextor extracts four features (file size, length of plain text, tag structure, and list of numbers in the web page) for each downloaded page. Then it computes the relative differences of the first two features and the edit distances of the others (for every candidate pair of pages in different languages). Then pairs are classified as bitexts3 or not based on a comparison of computed values against predefined thresholds. After comparing the positions of text blocks (i.e. text between html tags) in each bitext, the tool stores those pairs of text segments that are strong candidates for being translations of each other. The main shortcoming might be that the term segment does not correspond to any linguistic unit. Our experiments showed that segments are actually text blocks between some predefined html tags. Thus a segment could be a sentence, a paragraph, a part of a sentence, etc. 3 In the Bitextor terminology, bitexts are pairs of files which contain approximately the same text in two different languages. Input Output Functionality BootCat toolkit List of terms in a specific topic XML file that contains a monolingual corpus and metadata (url, date, size in words, domain) Toolkit including scripts to bootstrap specialized corpora and terms from the web Heritrix Seed url list Html files, multiple log files Multithreaded, breadth-first web crawler. All parameters can be configured via web based user interface Combine 1) Seed url list 2) List of terms in a specific topic 1) XML file including metadata (date, url, topic, language, etc.), 2) HTML files in UTF-8. Multithreaded, best-first web crawler. Configurations can be set via an XML file. HTTrack Seed URL list A mirror directory of each downloaded web site. Multithreaded web copier. Filters and parameters can be set via a GUI. Bitextor Seed URL list 1) log file in which generated bitexts are recorded 2) HTML files in UTF-8 3) TMX file, which contains pairs of text segments (i.e. parts of text between successive html tags) that are strong candidates for being translations of each other An automatic bitext generator integrating HTTrack. All parameters for filtering and comparing can be configured via an XML file. BootCat toolkit GPL Several (including all Panacea languages) WebBootCat Heritrix LGPL Not applicable Web GUI Combine GPL Uses Lingua::Identify Perl module for language identification. 33 languages supported (not Greek). No HTTrack GPL Not applicable No Bitextor GPL The integrated LibTextCat library contains fingerprints for 69 languages. No Table 2 Licensing, languages supported and availability as WS. Performance4 Feedback Error handling Documentation BootCat toolkit Given 4 terms in topic ―Machine Translation‖, 41 pages were retrieved, resulting in a corpus of 144K words in 2.5 minutes Progress bar (for WebBootCat) Fully integrated In progress Heritrix Speed depends on configuration Multiple log files (e.g. crawl path, filtering results), full report at the end Fully integrated Comprehensive5. Large developer and user community. Combine Handles up to 200 URLs per minute. In an experiment for the topic ―carnivorous plants‖, about 35% of all visited pages were judged relevant Log table in an SQL DB Fully integrated Comprehensive 6 HTTrack Not mentioned Progress bar Fully integrated Comprehensive 7 Bitextor Results depend on the structure of each website. On the well structured website of the Parliament of Canada, a 99% precision and a 85.33% recall were reported. Respective values on a heterogeneous web site were 86% and 61%. Log messages for each major step (e.g. downloading, comparing, generating bitexts) ; needs improvement Needs improvement Needs improvement 4 All performance reports are provided by developers or members of the developer groups for each toolkit 5 ▇▇▇▇://▇▇▇▇▇▇▇.▇▇▇▇▇▇▇.▇▇▇/articles/user_manual/index.html 6 ▇▇▇▇://▇▇▇▇▇▇▇.▇▇.▇▇▇.▇▇/documentation/ 7 ▇▇▇▇://▇▇▇.▇▇▇▇▇▇▇.▇▇▇/html/index.html Corpus clean-up and normalization involve removing irrelevant parts of downloaded web pages in order to produce clean monolingual and bilingual data in uniform format applicable for training an MT system.

Appears in 1 contract

Sources: Grant Agreement