How to check for multiple words inside a folder

Question

I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.

I was able to get all of the words into an array using this code:

array_of_words = [] 

File.readlines('words.txt').map do |word|
  array_of_words << word
end

And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:

Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
  puts filepath
  puts File.readlines(filepath).any?{ |l| l['api'] } 
end

Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).

Do you have to do this in ruby? The command-line tool `egrep` could do this much easier via something like `egrep -r "(api|function|method)" *`... — Brian, May 03 '17 at 21:31

score 2 · Answer 1 · edited May 23 '17 at 12:18

Consider this:

File.readlines('words.txt').map do |word|
  array_of_words << word
end

will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:

array_of_words = File.readlines('words.txt')

A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.

Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.

Ruby does have Regexp.union however it's only a partial help.

words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/

The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:

/#{re}/ # => /(?-mix:foo|bar)/

(?-mix: will cause you problems so don't do that. Instead use:

/#{re.source}/ # => /foo|bar/

which will generate the pattern and behave like we expect.

Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:

'foolish'[/#{re.source}/] # => "foo"

The way to work around that is to set word-boundaries around the pattern:

/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/

which then look for whole words:

'foolish'[/\b(?:#{re.source})\b/] # => nil

More information is available in Ruby's Regexp documentation.

Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.

Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.

When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.

Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.

This untested code should get you started:

WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/

Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
  puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end

It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.

It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.

Hello, thank you for this amazing/helpful information I will come up with a better solution, but this def. helps! — Hamel Desai, May 03 '17 at 22:38

Jim U · Answer 2 · 2017-05-04T13:53:58.813

0

Recursively searches directory for any of the words contained in words.txt

re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/

Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
  puts filepath
  puts File.readlines(filepath, "r:ascii").grep(re).any?
end

edited May 04 '17 at 13:53

answered May 03 '17 at 21:35

Jim U

3,318
1
14
24

I updated answer to escape the contents of words.txt – Jim U May 03 '17 at 22:03
Hey so I got the same exact error. `===': invalid byte sequence in UTF-8 (ArgumentError) – Hamel Desai May 03 '17 at 22:06
`Regexp.quote(word.strip) }.join('|')` isn't a good idea as it can generate false-positive sub-string hits. – the Tin Man May 03 '17 at 22:13
@Hamel Desai - I updated the answer to try to avoid searching binary files – Jim U May 03 '17 at 22:19
The error message suggests there's something in your files (either in source or words.txt) that have non-UTF-8 characters in them. So, maybe try opening the files as something other than UTF-8. So, **I updated my answer to open the files as `ascii`.** Maybe that will help. Then again, maybe it's words.txt that's non-UTF-8. If you know the encoding of your files, maybe use that instead. – Jim U May 04 '17 at 13:58
I would try using a divide and conquer approach to diagnose the cause of the error. Run the program on a `words.txt` file that contains only a couple simple words. And have only a file or two in the Source directory. It should work. Then try again with your original words.txt. If it breaks, you know words.txt is contributing to the problem. Try again with half the words in words.txt. Basically, do a binary search for the problem line in words.txt. If the problem is not with words.txt, use the same strategy with the files in Source. Eventually you should be able to isolate the problem. – Jim U May 04 '17 at 14:07

How to check for multiple words inside a folder

2 Answers2