Here is my code for reading a huge file (more than 15 GiB) called interactions.csv and do some checks about each row and based on the check, split the interactions file into two separate files: test.csv and trains.csv.
It takes more than two days on my machine to stop. Is there any way I can make this code faster maybe using some kind of parallelism ?
target_items: a list containing some item IDs
The current program:
with open(interactions) as interactionFile, open("train.csv", "wb") as train, open("test.csv", "wb") as test:
header=interactionFile.next();
train.write(header+'\n')
test.write(header+'\n')
i=0
for row in interactionFile:
# process each row
l = row.split('\t')
if l[1] in target_items:
test.write(row+'\n')
else:
train.write(row+'\n')
print(i)
i+=1