Right now I have two large files, pattern file and log file, each of them has over 300,000 lines. The pattern file is of this format:
Line 1 : <ID> <Dialog1> <ReplyStr> <Dialog2>
// the ReplyStr is needed as a pattern
The log file is of this format:
Line 1 : <LogData> <ReplyStr> <CommentOfReply>
// get all CommentOfReply, whose ReplyStr is from the pattern file
My task is to get all comment from specific replies, for analyzing the user's emotion to these given replies. So this is what I do step-by-step:
- to pick out all patterns and logs, both of them using regex,
- then match them together with string compare operation.
I need to optimize the code, for now it took 8 hours to finished.
The profile is following (using cProfile
on first 10 loops):
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 19.345 19.345 <string>:1(<module>)
1 7.275 7.275 19.345 19.345 get_candidate2.py:12(foo)
3331494 2.239 0.000 10.772 0.000 re.py:139(search)
3331496 4.314 0.000 5.293 0.000 re.py:226(_compile)
7/2 0.000 0.000 0.000 0.000 sre_compile.py:32(_compile)
......
3331507 0.632 0.000 0.632 0.000 {method 'get' of 'dict' objects}
3331260 0.560 0.000 0.560 0.000 {method 'group' of '_sre.SRE_Match' objects}
2 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'remove' of 'list' objects}
3331494 3.241 0.000 3.241 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
9 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
6662529 0.737 0.000 0.737 0.000 {method 'strip' of 'str' objects}
From the profile, it seems all the time consuming is from the re.search()
. I have no idea how to reduce it.