2

I've been searching for a while now, but found nothing that suits my need so far. (This was helpful, but not convincing)

From two different sources, I get two different strings. I want to check, if the shorter one is contained within the larger one. However, as those strings both root in an OCR-document, there might be obvious differences.

Example:

String textToSearch = "Recognized Headline";
String documentText = "This is the document text, spanning multiple pages" .
                      "..." .
                      "..." .
                      "This the row with my Recognizect Head1ine embedded" .
                      "..." .               ^^^^^^^^^^^^^^^^^^^^
                      "..." .
                      "End of the document";

How can I find my string reliably in the page without using a standalone Lucene/Solr installation? (Or maybe I've just not found the tutorial/manual). There must be some library out there which can do this, right?

Community
  • 1
  • 1
Dan Soap
  • 10,114
  • 1
  • 40
  • 49
  • Lucene can be used in-memory "mode" (if you meant standalone = indexed on disk). This might be also useful: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java – Bela Vizer Mar 02 '12 at 21:40
  • I know the Levenshtein algorithm, but whereever I've used it before I used it to check two strings for similarity, not if one contains the other – Dan Soap Mar 02 '12 at 21:44
  • Can you exploit the fact that you can split your document (and your headline) into a list of words? Or is that not always the case? – biziclop Mar 02 '12 at 21:59
  • @biziclop What do you mean exactly? How would that help me? – Dan Soap Mar 02 '12 at 22:01
  • @Cassy just thinking aloud. In a naive implementation for example you'd split your document into word tokens, then calculate the Levenshtein distance of each word to the first word of your pattern. If the distance is close enough, you can test the next word against the pattern's next word and so on. – biziclop Mar 02 '12 at 22:06
  • @biziclop That would be possible, yes. But that's exactly what I would want the library I search for to do for me. I'm pretty sure, that I'm not the first with this problem ;-) – Dan Soap Mar 02 '12 at 22:09
  • It doesn't look hard to code, given a method that returns the L-distance between two strings, you could give it a try just to see how does it perform. If it's sort of acceptable, you know this is the right direction, if it's horrible you probably need to look for an entirely different kind of solution. – biziclop Mar 02 '12 at 22:21
  • These links might help: http://en.wikipedia.org/wiki/Bitap_algorithm (this is the algorithm commonly used for these types of searches) and http://code.google.com/p/google-diff-match-patch/ (an implementation of something pretty similar to what you need. – biziclop Mar 02 '12 at 22:32
  • @biziclop the google-diff-match-patch looks promising. Will give it a shot – Dan Soap Mar 02 '12 at 23:09

1 Answers1

0

First of all you need to find your input source. A webpage has a DOM tree that can be parsed in two ways: SAX (event-driven model without context) or DOM (tree-based model with context). SAX is ideal here because you don't really need to have contextual information to retrieve a stream of tokenized text nodes from the DOM. Convert all the textual nodes into a stream of tokens.

One you have a stream of tokens you can do your processing on them. For large amounts of input algorithms like the Levenshtein string matching become inadequate. Instead, look into Markov Chains. They can help match a set of inputs against a set of outputs fairly reliably and efficiently.

jmkeyes
  • 3,751
  • 17
  • 20
  • You can also use StAX to parse XML, but I don't think obtaining the raw text is part of the problem. – biziclop Mar 02 '12 at 22:19
  • You're right. I saw Lucene/Solr mentioned and immediately thought he was trying to parse out a web page. – jmkeyes Mar 02 '12 at 22:24