Obstacles and Risks. For academics seeking to undertake research in large-scale IR systems there are obvious risks, primarily in regard to achieving genuine scale. Many of the research questions that offer the greatest potential for improvement – and the greatest possibilities for economic savings – involve working with large volumes of data, and hence significant computational investment. Finding ways of collaborating across groups, for example, to share hardware and software resources, and to amortize development costs, is a clear area for improvement. Current practice in academic research in this area tends to revolve around one-off software developments, often by graduate students who are not necessarily software engineers, as convoluted extensions to previous code bases. At the end of each student’s project, their software artifacts may in turn be published to GitHub or the like, but be no less a combination of string and glue (and awk and sed perhaps) than what they started with. Agreeing across research groups on some common data formats, and some common starting implementations, would be an investment that should pay off relatively quickly. If nothing else, it would avoid the ever-increasing burden for every starting graduate student to spend multiple months acquiring, modifying, and extending a code base that will provide baseline outcomes for their experimentation. Harder to address is the question of data scale and hardware scale. Large compute instal- lations are expensive, and while it remains possible, to at least some extent, for a single server to be regarded as a micro-unit of a large server farm, there are also interactions that cannot be adequately handled in this way, including issues associated with the interactions between different parts of what is overall a very complex system. Acquiring a large hardware resource that can be shared across groups might prove difficult. Perhaps a combined approach to, for example, Amazon Web Services might be successful in being granted a large slab of storage and compute time to a genuinely collaborative and international research group. Harder still is to arrange access to large-scale data. Public web crawls such as the Common Crawl can be used as a source of input data, but query logs are inherently proprietary and difficult to share. Whether public logs can be used in a sensible way is an ongoing question. Several prior attempts to build large logs have not been successful: the logs of CiteSeer and DBLP are heavily skewed towards titles and authors, and academic groups have been unable to mobilize sufficiently large volumes of users to adopt instrumented toolbar and browser plugins. Attempts to use institutional proxy logs have shown that even with tens of thousands of users, the log is relatively sparse. While efficiency does not automatically demand relevance judgments or similar “quality of retrieval” resources, there is a need for at least some level of quality assurance to be provided as there is often an efficiency / effectiveness trade-off to be quantified. Obtaining access to assessments at the required scale may also become a problem. To date, TREC resources have typically been used, noting that it is acceptable practice to measure effectiveness using one set of queries and documents, and then throughput using another set.
Appears in 3 contracts
Sources: End User Agreement, End User Agreement, End User Agreement