1

What I ask here should be pretty common but my intent is to figure out the best possible way to do it.

  • I have a list of files(say n) within a directory - all of which have been categorized by extensions.
  • I have a csv file containing Regex patterns(say m) which I want to look for in all those files of a particular type.
  • I want to have a final output wherein I have a Regex pattern, file name, line and line number listed.

Here are the few questions I have about how I should approach this:

  1. Is there a way where I could avoid m*n operations?
  2. What's faster - reading the files, buffering content and storing each line in say in an array before a search for all regex expressions or should I be taking a regex pattern, read the file line by line and search as I parse without using up memory?
  3. I figure that read/write operations are the most taxing - hence, I want to have 'n+1' reads(files, csv) and just a single write at the very end. Is my assumption and approach here correct?
  4. Arrays, Lists, hashmaps, something else - any suggestion on what would be the best way to have the task done? I think parsing files would be the key to efficiency?
  5. Any particular 'uncommon' Java APIs that I can make use of which reduce the code significantly?

I appreciate any insight/help with respect to this question.

.

Prasoon
  • 425
  • 1
  • 6
  • 18

1 Answers1

4

Write a simple working solution first, then optimize it. That said, I think you might be able to do something like:

  • Construct a composite regex from each of the individual regexes that you're searching for. If they don't use capturing patterns, I suspect you could just do something like "(regex1)|(regex2)|(regex3)" and that'd be valid. I'm not positive, though -- I've never been clear on how regex capturing groups work in when they're in different | branches.
  • Use Pattern.compile(regexString) to precompile the regex so it's not rebuilt more than once.
  • Use Guava's Files.toString(File, Charset) to just slurp each file all at once. If you're that keen on doing it line-by-line, use Files.readLines(File, Charset) to get a List<String>. You might even use the full-blown callback-based Files.readLines(File, Charset, LineProcessor) to avoid having the whole file in memory at once.
  • Use the compiled Pattern to match against the target file -- you'll probably need to use the Matcher to identify where exactly the match was, and which pattern was matched.
Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • Composite regex would not work for me since I wanted the output to also capture the matching regex pattern. Or is there a way? – Prasoon Feb 07 '12 at 01:13
  • 1
    ...There might be, but it'll be tricky. Super tricky. At this point, I'd recommend going ahead with the `n * m` solution that matches each pattern independently, and then see if that's fast enough for your needs. If not, go ahead and attempt the deep hackery -- probably by figuring out which pattern is the "outer pattern" corresponding to "regex number ___." – Louis Wasserman Feb 07 '12 at 01:14