2

I have a lot of text documents on the one hand and a huge list of Keywords (Strings) on the other hand. Now I'm interested, which of these keywords are contained in the documents.

At the moment I'm using a monstrous auto generated regex:

keywords = %w(Key1, Key2, Key3)
regx = Regexp.new('\b(' + keywords.join('|') + ')\b','i')
documents.each |d|
    d.scan(regx)
end

This worked great for a List of a few hundred keywords but now I'm using about 50000 keywords and it's slowing down too much.

Is there a better way doing such an operation using ruby?

EDIT:

  • The Documents are typical news articles like news about recent sport events as you can find via google news for example. In my testset each article contains about 1000 Words
  • The Keywords can be single words but could also be phrases containing multiple words like 'Franz Beckenbauer' or 'Russel Wilson'.
  • I'm interested only in complete matches - so searching for 'diction' should only match 'diction', not 'dictionary'
PascalTurbo
  • 2,189
  • 3
  • 24
  • 41
  • Use `StringScanner`. – sawa Mar 24 '16 at 16:48
  • Is the slowdown linear with respect to the number of keywords? If so, I don't think you can improve from that. – sawa Mar 24 '16 at 16:49
  • Do you only want whole words matched from the documents, and not partial strings? Do you want a search for "diction" to return "dictionary"? – JLB Mar 24 '16 at 16:53
  • Roughly how many words are in each document? – Cary Swoveland Mar 24 '16 at 16:58
  • You may wish to consider creating, for each document, a set of words contained in the document. You can then quickly determine if each key word is in the set. – Cary Swoveland Mar 24 '16 at 17:05
  • Thinking the same thing --- using .to_set and then getting the intersection of the word set and file set – JLB Mar 24 '16 at 17:19
  • 1
    Take a look at http://stackoverflow.com/questions/5395236/is-there-an-efficient-way-to-perform-hundreds-of-text-substitutions-in-ruby/5395777#5395777. Not all regular expressions are created alike, and using `Regexp.union` or concatenating strings using `|` won't necessarily create efficient patterns. Without a better idea of the words you're using, the regex being generated and the files you're scanning it's difficult to help with detailed answers. There are tricks to speed up regex, and poorly written regex can be very slow, much slower than a simple in-string search. – the Tin Man Mar 24 '16 at 17:35
  • @JLB, after a bit more thought I doubt that creating a set of words in a file and then checking each keyword against that set is faster than the reverse, creating a set of keywords and then checking each word in each file against that set. – Cary Swoveland Mar 24 '16 at 17:41
  • @theTinMan There's at least one gem that does something similar: [regexp_trie](https://github.com/gfx/ruby-regexp_trie). It looks like it's based on Perl's [Regexp::Trie](https://metacpan.org/pod/Regexp::Trie), which claims to be "a faster but simpler version of Regexp::Assemble or Regexp::Optimizer." It works pretty well—see my answer below. – Jordan Running Mar 24 '16 at 19:10
  • Actually, scratch that (for now). I unfortunately came across [a pretty significant bug](https://github.com/gfx/ruby-regexp_trie/issues/1) in regexp_trie. If the bug gets fixed I'll undelete my answer. – Jordan Running Mar 24 '16 at 19:45
  • Trie is a building block. Dig into the code of Regexp::Assemble and you'll see there's a lot more to it. – the Tin Man Mar 24 '16 at 21:16

2 Answers2

1

Convert the list of keywords to a hash:

h = {
  "foo" => true,
  "bar" => true,
  ...
  "baz" => true,
}

Then, read the document chunk by chunk (separated by space):

File.new("/path/to/file").each(" ") do
  |ws| ws.scan(/[\w']+/) do
    |w| if h.key?(w)
      # Found.
    end
  end
end
sawa
  • 165,429
  • 45
  • 277
  • 381
  • 1
    Why a hash rather than a set? – Cary Swoveland Mar 24 '16 at 17:27
  • 1
    A set is a hash in disguise with some icing. – the Tin Man Mar 24 '16 at 17:36
  • @theTinMan, yes, but my question stands. – Cary Swoveland Mar 24 '16 at 17:46
  • If I only want to know if a single element exists in a Set, `set.include?(o)` results in one extra operation compared to looking directly at a Hash, `hash[o]`, since that's what `include?` does internally. In that case I use a hash. If I want to do Set operations beyond a lookup/test for inclusion of a single element then I use Set. @sawa might have other reasons. – the Tin Man Mar 24 '16 at 18:02
0

I would start using the gem: phrasie This gives you a array of words in (each) document, which you can easily match with your keywords.

have a look: https://github.com/ashleyw/phrasie

Alphons
  • 303
  • 2
  • 6