Optimize regex and file reading operation on big files in python

Question

Right now I have two large files, pattern file and log file, each of them has over 300,000 lines. The pattern file is of this format:

Line 1 : <ID>   <Dialog1>    <ReplyStr>    <Dialog2>    
// the ReplyStr is needed as a pattern

The log file is of this format:

Line 1 : <LogData>    <ReplyStr>    <CommentOfReply>   
// get all CommentOfReply, whose ReplyStr is from the pattern file

My task is to get all comment from specific replies, for analyzing the user's emotion to these given replies. So this is what I do step-by-step:

to pick out all patterns and logs, both of them using regex,
then match them together with string compare operation.

I need to optimize the code, for now it took 8 hours to finished.

The profile is following (using cProfile on first 10 loops):

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   19.345   19.345 <string>:1(<module>)
        1    7.275    7.275   19.345   19.345 get_candidate2.py:12(foo)
  3331494    2.239    0.000   10.772    0.000 re.py:139(search)
  3331496    4.314    0.000    5.293    0.000 re.py:226(_compile)
      7/2    0.000    0.000    0.000    0.000 sre_compile.py:32(_compile)
                            ......
  3331507    0.632    0.000    0.632    0.000 {method 'get' of 'dict' objects}
  3331260    0.560    0.000    0.560    0.000 {method 'group' of '_sre.SRE_Match' objects}
        2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'remove' of 'list' objects}
  3331494    3.241    0.000    3.241    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
        9    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
  6662529    0.737    0.000    0.737    0.000 {method 'strip' of 'str' objects}

From the profile, it seems all the time consuming is from the re.search(). I have no idea how to reduce it.

So you are attempting to perform something like 300,000 separate searches on 300,000 lines? That sounds like 10^11 or so operations, which would indeed take a long time. Perhaps you need to rethink what you're trying to achieve. But it might be to makings of an interesting problem if you flesh it out a bit. Can you share with us some more about the specific patterns and text to be searched? Why is it that you need so many patterns? What is the overall goal of this project? — Mike Satteson, Feb 06 '15 at 14:42
@MikeSatteson I edited the question, adding more details now, thanks for your advice. It seems impossible to reduce the `ncall` of the `re.search()` in this case. — stanleyerror, Feb 06 '15 at 14:54
Can't you just iterate the log file once and create one big dictionary `{replyStr: [CommentOfReply, ...]}`? Then iterate the pattern file and get all the comments for the replies at once. — tobias_k, Feb 06 '15 at 14:56
So is it the case that `` in each file is exactly the same? If so, you have you have lots of potential for improvement. @tobias_k suggestion has a great deal of merit. My inclination would be to setup a database where reply string (or a hash thereof) serves as a shared key between tables derived from `pattern` and `log` data. One question, what's the relationship between occurrences of `` in these files (e.g. one-to-one, many-to-one, one-to-some). That relationship might drive the sort of data structure that you'd choose for this problem — Mike Satteson, Feb 06 '15 at 16:07
@tobias_k suggest a good solution. Since I need to get all the comment of given reply strings, so a dict is totally enough for me. Thanks for all you guys. — stanleyerror, Feb 06 '15 at 16:14
One very simple minded, but possibly effective solution, would be to [sort](http://linux.die.net/man/1/sort) each file on ``. You could then open each so sorted file, reading line by line, and the matching data would "fall into your lap". Depending upon you needs, this simple approach might get the job done for you with minimal extra coding. Even if you need to develop some fancy data structure, this technique will at least make it easier to work with these data as you do so. — Mike Satteson, Feb 06 '15 at 16:15
@MikeSatteson this would be a more general solution in both one-to-one or many-to-one cases. Thanks for all your suggestions. — stanleyerror, Feb 06 '15 at 16:21
@stanleyerror If you provide a small representative excerpt from both files (some ten lines maybe) then I might write up some sample code to help you getting started, if needed. — tobias_k, Feb 06 '15 at 16:21
@tobias_k haha, so kind of you. Well, I should work it out independently, since it is my assignment. Anyway, your solution helps a lot. — stanleyerror, Feb 06 '15 at 16:26
Very good question. Thanks for responding so promptly and nicely to our comments. I'd take @tobias_k up on his suggestion, I'm sure that he'll create an excellent answer. Alas, I have to leave this for now. I'll check back later to see what has developed. — Mike Satteson, Feb 06 '15 at 16:30
Your desire to finish on you own is admirable. Don't forget that you can answer your own question, should you like, once you have developed a solutio. — Mike Satteson, Feb 06 '15 at 16:32

stanleyerror · Answer 1 · 2015-02-09T05:47:03.707

2

Thanks to the help from @MikeSatteson and @tobias_k, I figure it out.

To pick out all the comment string (from log file) corresponding to given reply string (from pattern file), the solution is:

a dict is needed, whose key is reply string, and value is a list of comment string.
pick out all the reply string from pattern file, as the key set of a dict.
pick out all the reply-comment pair from log file, if the dict's key set contains the reply, append the comment to the comment list.

Here is the code:

my_dict = {}
with open('pattern file', 'r') as pattern_file:
    for line in pattern_file:
        reply = get_reply(line)
        my_dict[reply] = list()     

with open('log file', 'r') as log_file:
    for line in log_file:
        pair = get_comment_reply_pair(line)
        reply = pair.reply
        comment  = pair.comment
        if reply in my_dict:
            l = my_dict[reply]
            l.append(comment)

edited Feb 09 '15 at 05:47

answered Feb 07 '15 at 15:39

stanleyerror

728
1
9
23

1

Good to see you figured it out! Two small suggestions: You can use `with open('pattern file') as pattern_file:` to have the file closed automatically, and you could just do `if reply in my_dict`. – tobias_k Feb 07 '15 at 16:37
Nice job @stanleyerror! I'm glad that you found my comments helpful. – Mike Satteson Feb 08 '15 at 17:19
@tobias_k I have improved the answer with your advice. Is `with open() as` a buffered reader, I mean load lines to memory whenever needed? – stanleyerror Feb 09 '15 at 05:45
@stanleyerror That's independent of whether you use `with` or not. If you use `for line in pattern_file`, the file is read line by line, while if you do `lines = pattern_file.readlines()`, you read the entire file at once. What `with` does is just that is makes sure that the file is closed, even if some kind of error occurs. For more on `with`, see [here](http://stackoverflow.com/q/3012488/1639625). – tobias_k Feb 09 '15 at 08:52

Optimize regex and file reading operation on big files in python

1 Answers1