I am using rake (Rapid automatics keyword extraction algo) to generate keywords. I have around 53million records, ~ 4.6gb. I want to know the best possible way to do this.
I have rake nicely wrapped up in a class. I have a 4.5gb file which consists of 53 million records. Below are some of the approaches.
Approach #1:
with open("~inputfile.csv") as fd:
for line in fd:
keywords = rake.run(line)
write(keywords)
This is a basic brute force way. Assuming that writing to file takes time, invoking it 53 million times would be costly. I used the below approach, writing 100K lines to file at one go.
Approach #2
with open("~inputfile.csv") as fd:
temp_string = ''
counter = 0
for line in fd:
keywords = rake.run(line)
string = string + keywords + '\n'
counter += 1
if counter == 100000:
write(string)
string = ''
To my surprise Approach #2 took more time than approach #1. I don't get it! How is that possible? Also can you guys suggest a better approach?
Approach #3 (thanks to cefstat)
with open("~inputfile.csv") as fd:
strings = []
counter = 0
for line in fd:
strings.append(rake.run(line))
counter += 1
if counter == 100000:
write("\n".join(strings))
write("\n")
strings = []
Runs faster than Approaches #1 & #2.
Thanks in advance!