5

We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.

We have noticed, that there is keyword-stuffing abuse.

We have determined two main kinds of abuse:

  1. Repeating the same term, again and again
  2. Many, irrelevant terms added to the document en-masse

These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.

Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.

My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?

Any guidance would be appreciated!

Dave Bish
  • 19,263
  • 7
  • 46
  • 63
  • 1
    Why remove anything at all? Why not just attempt to detect (ultimately that's the best you can do, attempt) and inform the user or penalise their rank, so to speak. – Grant Thomas Jun 06 '13 at 11:47
  • Unless you have a very high volume, is it possible to take an educated guess and flag the documents for review by an administrator/moderator? That might help you avoid penalizing people with false-positives. – Chris Sinclair Jun 06 '13 at 11:47
  • @GrantThomas - How could we detect? You mean just look at font-size & colour? – Dave Bish Jun 06 '13 at 11:48
  • @ChrisSinclair We have to deal with about 50k documents /day - so too many – Dave Bish Jun 06 '13 at 11:49
  • Maybe you can automate one of those optical character recognition processes (like [this one](https://www.ocrtools.com/fi/Download.aspx)). "Print" the MSWord document to an image, run it through the OCR, and maybe use it for the words. Or if there's a significant difference between the OCR text and the Word text, flag it for review. Presumably if the text is visually hidden or super tiny, the OCR won't pick it up. – Chris Sinclair Jun 06 '13 at 11:54
  • @ChrisSinclair hah! I had the same idea! I think the processing load would be pretty insane, however... – Dave Bish Jun 06 '13 at 12:01
  • @DaveBish Time for testing and benchmarking! Perhaps the tasks can be offloaded to a separate machine(s) from your main server(s). If you still can't process all 50k daily documents, either add more processing machines or test randomly (maybe attempting at least 1 document per user). But maybe do some manual tests (I think the link I provided has a free executable you can use) to see how viable/accurate it is. It may be a dead end with too many false-positives anyway. – Chris Sinclair Jun 06 '13 at 12:04
  • Perhaps another option is to run a grammar check (either Word's built-in via interop, or perhaps an existing .NET library). If there are a tonne of spelling/grammar errors, maybe it's because they inserted a tonne of irrelevant terms (see your point #2). As for point #1, you might be able to check for excessively repeating words simply enough with some basic string checking. – Chris Sinclair Jun 06 '13 at 12:08

4 Answers4

4

There's a surprisingly simple solution to your problem using the notion of compressibility.

If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.

Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.

After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)

Here are some useful links about compressibility:

http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf

http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf

You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.

FRiverai
  • 97
  • 1
  • 6
1

As Chris Sinclair commented in your question, unless you have google level algorithms (and even they get it wrong and thereby have an appeal process) it's best to flag likely keyword stuffed documents for further human review...

If a page has 100 words, and you search through the page detecting the count for the occurences of keywords (rendering stuffing by 1px or bgcolor irrelevant), thereby gaining a keyword density count, there really is no hard and fast method for a certain percentage 'allways' being keyword stuffing, generally 3-7% is normal. Perhaps if you detect 10% + then you flag it as 'potentially stuffed' and set aside for human review.

Furthermore consider these scenarios (taken from here):

  • Lists of phone numbers without substantial added value
  • Blocks of text listing cities and states a webpage is trying to rank for

and what the context of a keyword is.

Pretty damn difficult to do correctly.

Paul Zahra
  • 9,522
  • 8
  • 54
  • 76
  • But OP has suggested 50k docs per day. That's untenable. Hence, his request for suggestions regarding automating the quarantine. – DonBoitnott Jun 06 '13 at 12:11
  • In which case you could do something like has already been suggested, i.e. flag the document, penalise / inform document author and as google does, provide an appeal process for the times you screw it up. – Paul Zahra Jun 06 '13 at 12:24
  • Which also begs the question, if 50,000 docs per day are to be checked, on average what percentage is found to be 'stuffed'. – Paul Zahra Jun 07 '13 at 10:40
1

Detect tag-abuse with forecolor/backcolor detection like you already do. For size detection calculate the average text size and remove the outliers. Also set predefined limits on the textsize (like you already do).

Next up is the structure of the tag "blobs". For your first point you can just count the words and if one occurs too often (maybe 5x more often than the 2nd word) you can flag it as a repeated tag.

When adding tags en-mass the user often adds them all in one place, so you can see if known "fraud tags" appear next to each other (maybe with one or two words in between).

If you could identify at least some common "fraud tags" and want to get a bit more advanced then you could do the following:

  • Split the document into parts with the same textsize / font and analyze each part separately. For better results group parts that use nearly the same font/size, not only those that have EXACTLY the same font/size.
  • Count the occurrence of each known tag and when some limit set by you is exceeded this part of the document is removed or the document is flagged as "bad" (as in "uses excessiv tags")

No matter how advanced your detection is, as soon as people know its there and more or less know how it works they will find ways to circumvent it.

When that happens you should just flag the offending documents and see trough them yourself. Then if you notice that your detection algorithm got a false-positive you improve it.

Riki
  • 1,776
  • 1
  • 16
  • 31
1

If you notice a pattern in that the common stuffers are always using a font size below a certain size and that size i.e 1-5 which is not really readable then you could assume that that is the "stuffed part".

You can then go on to check if the font colour is also the same as the background colour and remove it that section.

kyri
  • 21
  • 4