SPAM Detection. Spam detection is important for all blog crawling services. This is especially important when using ping servers or allowing access to an arbitrary list of weblogs beyond a defined list. As a separate process to fetching, spam filtering is about identifying and stopping blog posts that should not be further processed and stored in the repository. Spam blogs (splogs) are an increasing problem when capturing blogs beyond a list of qualified weblogs. Splogs are generated with two often overlapping motives. The first motive is the creation of fake blogs, containing gibberish or hijacked content from other blogs and news sources with the sole purpose of hosting profitable context based advertisements. The second, and a better understood form, is to create false blogs that constitute a link farm intended to unjustifiably increase the ranking of affiliated sites [7]. There are several techniques for detecting Spam, and several freeware tools available such as ▇▇▇▇▇▇▇▇.▇▇▇. However most of these are too simple to be implemented in a weblog spider. Another technique would be to implement our own Spam-blog Detection, and three different techniques are described in [7]. Given a blog profile, we present three (obviously non-exhaustive) scoring functions based on the heuristics stated below, denoted by SF1 to SF3. Each of them independently attempts to estimate the likelihood of a blog being a splog. For the ease of discussion, each state tuple in a given blog profile b is denoted as ST. A blog profile consists of the blog's URL and a sequence of blog state tuples, each of which is denoted as ( t, N, p.spam_score).
Appears in 2 contracts
Sources: Grant Agreement, Grant Agreement