Python read and write a file faster

Question

Here is my code for reading a huge file (more than 15 GiB) called interactions.csv and do some checks about each row and based on the check, split the interactions file into two separate files: test.csv and trains.csv.

It takes more than two days on my machine to stop. Is there any way I can make this code faster maybe using some kind of parallelism ?

target_items: a list containing some item IDs

The current program:

with open(interactions) as interactionFile, open("train.csv", "wb") as train, open("test.csv", "wb") as test:
    header=interactionFile.next();
    train.write(header+'\n')
    test.write(header+'\n')
    i=0
    for row in interactionFile:
        # process each row
        l = row.split('\t')
        if l[1] in target_items:
            test.write(row+'\n')
        else:
            train.write(row+'\n')
        print(i)
        i+=1

is `target_items` a long list? If this is a large list, you can get significant speedup by converting it to a set. — roganjosh, Mar 09 '17 at 17:09
Try learning hadoop. parallel processing is it's core features — Muhammad Haseeb Khan, Mar 09 '17 at 17:09
Don't know how many lines you have but printing the line number for each of them to your screen is going to have a really unpleasant effect on your performance — Tom Tanner, Mar 09 '17 at 17:10
@HimanAB in that case, you should really consider converting it to a set before you start looping through the file. The speedup will be gigantic from just doing `target_items = set(target_items)`. — roganjosh, Mar 09 '17 at 17:12
Th file has 322776003 lines. I added the print(i) to see if the program was still running as I waited two days and nothing happened! — HimanAB, Mar 09 '17 at 17:12
@HimanAB in that case, consider `if i % 10000 == 0: print(i)` so that you remove most of the statements but also get regular printouts. — roganjosh, Mar 09 '17 at 17:13
Wow. By changing it to set using if i % 10000 == 0: print(i)`, it is already a lot faster. Is there any way to make this code parallel using multiprocessing? — HimanAB, Mar 09 '17 at 17:16
See also http://stackoverflow.com/questions/14863224/efficient-reading-of-800-gb-xml-file-in-python-2-7 — Tom Tanner, Mar 09 '17 at 17:16
You can also avoid creating `l`. Just do `if row.split('\t')[1] in target_items:`. Should not have a significant effect but you don't need it anyway. **How long are the rows btw? There might be a better way to reach this *identifier*** — Ma0, Mar 09 '17 at 17:18
re multiprocessing - that depends somewhat on where your bottleneck is (if it's the file I/O it wont help much), and whether or not you need to maintain the order ofr lines in your output files. But have you tried converting the target_items to set as suggested? — Tom Tanner, Mar 09 '17 at 17:18
I think rather than doing row.split, as you don't use full the result of the split row.split('\t', 2) would save some processing — Tom Tanner, Mar 09 '17 at 17:21
1) use the 2nd arg to `str.split()`: `row.split('\t', 2)`. 2) Stop adding the `\n` to your output. There is already a newline in `row`. — Robᵩ, Mar 09 '17 at 17:21
Maybe this [old question](http://stackoverflow.com/questions/28108972/fastest-way-to-re-read-a-file-in-python), and [this](http://stackoverflow.com/questions/30294146/python-fastest-way-to-process-large-file) and [this](http://stackoverflow.com/questions/14944183/python-fastest-way-to-read-a-large-text-file-several-gb) could help you. Have you readed them?. This isn' t a new problem. — Juan Antonio, Mar 09 '17 at 17:25
because if you do row.split('\t') it has to find every tab and create an element in a list for each one it finds. If you do a maximum of 2, it finds the first two columsn, and then puts all the rest of the string into the 3rd, element, and doesn't have look for any more tabs — Tom Tanner, Mar 09 '17 at 17:26
You might also find an improvment with `a1 = row.find('\t')` `a2 = row.find('\t', a1 + 1)` `l = row[a1 + 1 : a2]` (assuming you can rely on there being at least 2 tabs every line) — Tom Tanner, Mar 09 '17 at 17:29
Whatever you end up implementing, let us know how much faster it has become. You can also answer your own question to do so! And we would be able to help much more if you simply post a row of this `csv` file. + there is a `csv` module and stuff like `pandas` that are pretty fast. — Ma0, Mar 09 '17 at 17:30
Thanks, all. After using set instead of list, printing only other 10000 lines and also using split('\t',2), the program is around 10 times faster. Before it used to take 48 hours, so now it should stop in less than 5 hours. — HimanAB, Mar 09 '17 at 17:35
@HimanAB I just want to check, you converted to `set` _prior_ to doing anything with the file? Like, you _don't_ have `if l[1] in set(target_items):`, right? I thought the speedup would be greater overall but I guess the main limiting aspect is I/O here. — roganjosh, Mar 09 '17 at 17:40
Aw yes. I converted it even before I make the set. So now, when I build the list of target_items, it is actually a set now instead of a list — HimanAB, Mar 09 '17 at 17:42
it may be better to use awk or grep, python is too heavy in this situation :) — linpingta, Mar 10 '17 at 09:12

Python read and write a file faster

0 Answers0