Common use of Exporter Clause in Contracts

Exporter. The Exporter module is applied for the generation of an XML file in the TO1/cesDoc format for each stored web document. The XML files contain the textual content converted into UTF-8 and segmented in paragraphs. Moreover, each XML file contains metadata about the corresponding document inside a <cesHeader> element. The first element of the header, the <fileDesc> element, includes general information about the document. Specifically, the <titleStmt> sub-element contains the title of the document (<title> container) and the PANACEA partner responsible for these operations on this particular document. The <publicationStmt> sub-element holds information about the status (i.e. distributor and its e-address, availability and publication date) of the document. The <sourceDesc> sub-element groups bibliographical information for the document such as the title, the author, the publisher, the date downloaded and the URL it was downloaded from. The second element of the header, the <profileDesc> includes information about the content of the document. In particular, the <langUsage> sub-element reports the language of the document and the <textClass> holds the key terms of the document, the sub-domain as identified by the Topic Classifier (see 2.1.5). It is worth mentioning that the key terms included inside the <keywords> sub-element of <textClass> are the keywords extracted from the metadata of the web document. Therefore, these terms should not be confused with the terms detected in this particular document during comparison with the domain definition. The <annotations> sub-element of <profileDesc> is used for storing links to other documents relevant to this basic version. After the exporting phase, there is only one <annotation> which points to the original HTML document. The <body> element contains the content of the document segmented in paragraphs. Besides the normalized text, each paragraph element <p> is enriched with attributes providing more information about the process outcome. Specifically, (<p>) elements in the XML files may contain the following attributes: 1. crawlinfo with possible values: a. boilerplate, meaning that the paragraph has been considered boilerplate by the Cleaner module (see subsection 2.1.3) as shown in the following example: <p id="p1" crawlinfo="boilerplate">Home</p> <p id="p2" crawlinfo="boilerplate">Partners</p> <p id="p3" crawlinfo="boilerplate">Main Menu</p> <p id="p4" crawlinfo="boilerplate">Home</p> <p id="p5" crawlinfo="boilerplate">Background</p> <p id="p6" crawlinfo="boilerplate">The Theme for 2011</p> <p id="p7" crawlinfo="boilerplate">How can you participate?</p> <p id="p8" crawlinfo="boilerplate">Register your Activity</p> <p id="p9" crawlinfo="boilerplate">WMBD Around the World</p> <p id="p10" crawlinfo="boilerplate">WMBD Community</p> <p id="p11" crawlinfo="boilerplate">Press / Materials</p> <p id="p12" crawlinfo="boilerplate">Related Links</p> <p id="p13" crawlinfo="boilerplate">Partners</p> <p id="p14" crawlinfo="boilerplate">Translate this Site:</p> <p id="p15" crawlinfo="boilerplate">Partners &amp; Sponsors</p> <p id="p16" crawlinfo="ooi-length">WMBD Partners:</p> <p id="p17" topic="sustainable development">United Nations Environment Programme (UNEP) is the voice for the environment in the United Nations system. It is an advocate, educator, catalyst and facilitator, promoting the wise use of the planet's natural assets for sustainable development.</p> ▇. ▇▇▇-▇▇▇▇, denoting that the paragraph is not in the targeted language. One of the results of manual evaluation in the first evaluation cycle, reported in D7.2 First evaluation report. Evaluation of PANACEA v1 and produced resources was that about 5% of the acquired documents contained at least one paragraph not in the targeted language. Therefore, the Exporter applies the embedded language identifier (see subsection 2.1.4) at paragraph level as well. If a paragraph is not in the targeted language, the attribute crawlinfo takes the value ▇▇▇-▇▇▇▇. As an example, notice p63 paragraph in the listing below. <p id="p61" topic="delta;▇▇▇▇▇">The ▇▇▇▇▇▇ of the Danube, which flow into the Black Sea, form the largest and best preserved of Europe's deltas. The Danube delta hosts over 300 species of birds as well as 45 freshwater fish species in its numerous lakes and marshes.</p> <p id="p62" crawlinfo="ooi-length">Delta du Danube</p> <p id="p63" crawlinfo="▇▇▇-▇▇▇▇">Les eaux du Danube se jettent dans la mer Noire en formant le plus vaste et le mieux préservé des deltas européens. Ses innombrables lacs et marais abritent plus de 300 espèces d'oiseaux ainsi que 45 espèces de poissons d'eau douce.</p> ▇. ▇▇▇-length, denoting that this paragraph is so short that either it is not useful, or it can confuse the language identifier. Another finding from the first evaluation cycle was that a very large proportion of the documents (approx. 80%) contained at least one short paragraph of only limited or no use. To eliminate this, the Exporter compares the length of each paragraph (in terms of tokes) with a predefined threshold provided by the user (see parameter minimumLength in 2.1.10) and classifies short paragraphs as out of interest (i.e. adds the value ooi-length to the crawlinfo attribute). For an example, see p41 and p43 paragraphs in the listing below (and p62 in the listing above). <p id="p40" type="listitem" topic="forest;nature reserve">National Trust membership gives you access to green space and helps fund conservation. The trust manages 250,000 hectares of land, including forest, ▇▇▇▇▇, nature reserves, farmland and moorland, as well as 707 miles of coastline in England, Wales and Northern Ireland.</p> <p id="p41" crawlinfo="ooi-length">Plantlife</p> <p id="p42">Plantlife works to protect wild plants and their habitats. Activities include rescuing wild plants from the brink of extinction, and ensuring that common plants don't become rare in the wild. It actively campaigns on a number of issues affecting wild plants and fungi. The Plantlife website has a wealth of downloadable information about wild plants and plant conservation. Find out how you can support the organisation here .</p> <p id="p43" crawlinfo="ooi-length">Buglife - The Invertebrate Conservation Trust</p> 2. type with possible values: title, heading and listitem as identified by the Cleaner module (see 2.1.3). 3. topic with a string value including all terms from the domain definition detected in this paragraph.

Appears in 2 contracts

Sources: Grant Agreement, Grant Agreement