2

Here is my code for reading a huge file (more than 15 GiB) called interactions.csv and do some checks about each row and based on the check, split the interactions file into two separate files: test.csv and trains.csv.

It takes more than two days on my machine to stop. Is there any way I can make this code faster maybe using some kind of parallelism ?

target_items: a list containing some item IDs

The current program:

with open(interactions) as interactionFile, open("train.csv", "wb") as train, open("test.csv", "wb") as test:
    header=interactionFile.next();
    train.write(header+'\n')
    test.write(header+'\n')
    i=0
    for row in interactionFile:
        # process each row
        l = row.split('\t')
        if l[1] in target_items:
            test.write(row+'\n')
        else:
            train.write(row+'\n')
        print(i)
        i+=1
Ma0
  • 15,057
  • 4
  • 35
  • 65
HimanAB
  • 2,443
  • 8
  • 29
  • 43
  • 6
    is `target_items` a long list? If this is a large list, you can get significant speedup by converting it to a set. – roganjosh Mar 09 '17 at 17:09
  • Try learning hadoop. parallel processing is it's core features – Muhammad Haseeb Khan Mar 09 '17 at 17:09
  • It contains 15000 elements. – HimanAB Mar 09 '17 at 17:10
  • 4
    Don't know how many lines you have but printing the line number for each of them to your screen is going to have a really unpleasant effect on your performance – Tom Tanner Mar 09 '17 at 17:10
  • 5
    @HimanAB in that case, you should really consider converting it to a set before you start looping through the file. The speedup will be gigantic from just doing `target_items = set(target_items)`. – roganjosh Mar 09 '17 at 17:12
  • Th file has 322776003 lines. I added the print(i) to see if the program was still running as I waited two days and nothing happened! – HimanAB Mar 09 '17 at 17:12
  • 4
    @HimanAB in that case, consider `if i % 10000 == 0: print(i)` so that you remove most of the statements but also get regular printouts. – roganjosh Mar 09 '17 at 17:13
  • Wow. By changing it to set using if i % 10000 == 0: print(i)`, it is already a lot faster. Is there any way to make this code parallel using multiprocessing? – HimanAB Mar 09 '17 at 17:16
  • See also http://stackoverflow.com/questions/14863224/efficient-reading-of-800-gb-xml-file-in-python-2-7 – Tom Tanner Mar 09 '17 at 17:16
  • You can also avoid creating `l`. Just do `if row.split('\t')[1] in target_items:`. Should not have a significant effect but you don't need it anyway. **How long are the rows btw? There might be a better way to reach this *identifier*** – Ma0 Mar 09 '17 at 17:18
  • re multiprocessing - that depends somewhat on where your bottleneck is (if it's the file I/O it wont help much), and whether or not you need to maintain the order ofr lines in your output files. But have you tried converting the target_items to set as suggested? – Tom Tanner Mar 09 '17 at 17:18
  • I think rather than doing row.split, as you don't use full the result of the split row.split('\t', 2) would save some processing – Tom Tanner Mar 09 '17 at 17:21
  • 1
    1) use the 2nd arg to `str.split()`: `row.split('\t', 2)`. 2) Stop adding the `\n` to your output. There is already a newline in `row`. – Robᵩ Mar 09 '17 at 17:21
  • @Robᵩ How does row.split('\t', 2) help? – HimanAB Mar 09 '17 at 17:24
  • Maybe this [old question](http://stackoverflow.com/questions/28108972/fastest-way-to-re-read-a-file-in-python), and [this](http://stackoverflow.com/questions/30294146/python-fastest-way-to-process-large-file) and [this](http://stackoverflow.com/questions/14944183/python-fastest-way-to-read-a-large-text-file-several-gb) could help you. Have you readed them?. This isn' t a new problem. – Juan Antonio Mar 09 '17 at 17:25
  • 3
    because if you do row.split('\t') it has to find every tab and create an element in a list for each one it finds. If you do a maximum of 2, it finds the first two columsn, and then puts all the rest of the string into the 3rd, element, and doesn't have look for any more tabs – Tom Tanner Mar 09 '17 at 17:26
  • You might also find an improvment with `a1 = row.find('\t')` `a2 = row.find('\t', a1 + 1)` `l = row[a1 + 1 : a2]` (assuming you can rely on there being at least 2 tabs every line) – Tom Tanner Mar 09 '17 at 17:29
  • Whatever you end up implementing, let us know how much faster it has become. You can also answer your own question to do so! And we would be able to help much more if you simply post a row of this `csv` file. + there is a `csv` module and stuff like `pandas` that are pretty fast. – Ma0 Mar 09 '17 at 17:30
  • Thanks, all. After using set instead of list, printing only other 10000 lines and also using split('\t',2), the program is around 10 times faster. Before it used to take 48 hours, so now it should stop in less than 5 hours. – HimanAB Mar 09 '17 at 17:35
  • @HimanAB I just want to check, you converted to `set` _prior_ to doing anything with the file? Like, you _don't_ have `if l[1] in set(target_items):`, right? I thought the speedup would be greater overall but I guess the main limiting aspect is I/O here. – roganjosh Mar 09 '17 at 17:40
  • Aw yes. I converted it even before I make the set. So now, when I build the list of target_items, it is actually a set now instead of a list – HimanAB Mar 09 '17 at 17:42
  • it may be better to use awk or grep, python is too heavy in this situation :) – linpingta Mar 10 '17 at 09:12

0 Answers0