1

I have a script that iterates over file contents of hundreds of thousands of files to find specific matches. For convenience I am using a string in. What are the performance differences between the two? I'm looking for more of a conceptual understanding here.

list_of_file_contents = [...] # 1GB
key = 'd89fns;3ofll'
matches = []
for item in list_of_file_contents:
    if key in item:
        matches.append(key)

--vs--

grep -r my_files/ 'd89fns;3ofll'
David542
  • 104,438
  • 178
  • 489
  • 842

2 Answers2

1

The biggest conceptual difference is that grep does regular expression matching. In python you'd need to explicitly write code using the re module. The search expression in your example doesn't exploit any of the richness of regular expressions, so the search behaves just like the plain string match in python, and should consume only a tiny bit more resources than fgrep would. The python script is really fgrep and hopefully operates on par with that.

If the files are encoded, say in UTF-16, depending on the version of the various programs, there could be a big difference in whether matches are found, and a little in how long it takes.

And that's assuming that the actual python code deals with input and output efficiently, i.e. list_of_file_contents isn't an actual list of the data, but for instance a list comprehension around fileinput; and there is not a huge number of matches or a different matches.

Stein
  • 1,558
  • 23
  • 30
0

I suggest you try it out for yourself. Profiling Python code is really easy: https://stackoverflow.com/a/582337/970247. For a more conceptual approach. Regex is a powerful string parsing engine full of features, in contrast Python "in" will do just one thing in a really straightforward way. I would say the latter will be the more efficient but again, trying it for yourself is the way to go.

Community
  • 1
  • 1
Laurent Jalbert Simard
  • 5,949
  • 1
  • 28
  • 36