1

I have a predefined set of words like murder, crime, officer, robbery, culprits, mishap, accident, crash, killed, ....(around 5000 words)

I want to match this words in a news article (approx. 1kb-5kb text) and if found then categorize those words accordingly. Initially I just used spaces before and after words i.e.

if(article.contains(" "+word+" ")) { \*do something*\ }

But this do not work when the word is followed by full-stop, comma or other symbol, same goes for beginning of word

So i switched to regex with word boundaries, but the code now runs 20x slower and CPU usage goes to 100% in 5 threads.

Does anybody have better solution in java? all help is appreciated :)

  • it will match containin word also then i.e. 'con' and 'constable' are different word 'con' should not match 'constable' in text. – Kiran Phadatare Jan 26 '17 at 16:40
  • You first need to decide on the algorithm, **then** look for an implementation in Java, e.g. see [here](http://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-string-matches) – Maarten Bodewes Jan 26 '17 at 16:42
  • The answers of this question http://stackoverflow.com/questions/225337/how-do-i-split-a-string-with-any-whitespace-chars-as-delimiters might be useful. – Kh.Taheri Jan 26 '17 at 16:46
  • There are algorithms designed for [matching multiple patterns at the same time](https://en.wikipedia.org/wiki/String_searching_algorithm#Algorithms_using_a_finite_set_of_patterns). I'd choose one of them if I were you. Please bear in mind though that (at least in English), finding the boundaries of a word is non-trivial, as some words may legitimately end (and semi-legitimately, begin) with an apostrophe. – biziclop Jan 26 '17 at 16:59

2 Answers2

0

You will always have to check if the word contains special symbols inside it so i suggest calling the replaceAll function using \W to get rid of any of the extra fluff/symbols inside the word if it does exist.

String wordToLookup = " " + word.replaceAll("\\W", "") + " ";
if(article.contains(wordToLookup))
{
    //do something
}
RAZ_Muh_Taz
  • 4,059
  • 1
  • 13
  • 26
0

I don't think regex is the best tool to process that search, but if you don't find a better tool you could already win a lot of time by crafting an optimized regex. If you check that test I did with just a handful of token and a small search string, the search with a single pattern is already 4 times faster than the search with multiple patterns.

Now obviously with 5000 tokens I don't expect you to produce and maintain that regex by hand, but it would be possible to transform the token list into a prefix tree which would then be used to craft the regex :

tokens : con, conman, constitution, correct, exact

tree :     ^
        c     e
        o     x
     n    r   a
  s  $ m  r   c
  t    a  e   t
  i    n  c   $
  t    $  t
[...]     $

regex : \\b(co(n(stitution|man)?|rrect)|exact)\\b

Anyway I think your first step should be to research existing full-text search libraries, which could probably solve your problem much more efficiently without much effort.

Aaron
  • 24,009
  • 2
  • 33
  • 57
  • Thank for the example, it will be great idea to create a prefix tree, however i don't have that much time for it now. however i'll first check if some full text match engines do the trick. – Kiran Phadatare Jan 26 '17 at 17:30
  • Creating the prefix tree isn't hard as I'm sure there are libraries for that. I think that creating the regex from that would be the hard part, although you could obtain basic regexs quite easily if the tokens are only text and that you don't mind achieving perfect optimization directly. I may take a crack at it :) – Aaron Jan 26 '17 at 17:36