2

I have a large file ( >= 1GB) that I'm trying to read and then upload the contents into a dictionary. With just a simple code that reads one line at a time, it's taking about 8 minutes to just read the file and populate the dictionary. The code snippet that I'm using is below:

with open(filename, 'r') as f:
    for line in f:
        toks = line.rstrip().split()
        id1 = toks[0]
        id2 = toks[1]
        start = int(toks[4])
        end = int(toks[5])

        if id1 not in my_dict:
            my_dict[id1] = [[start, end]]
        else:
            if [start, end] not in my_dict[id1]:
                my_dict[id1].append([start,end])

       if id2 not in my_dict:
            my_dict[id2] = [[start, end]]
        else:
            if [start, end] not in my_dict[id2]:
                my_dict[id2].append([start,end])

Now, just running this block of code alone is taking a long time and I'm wondering if I can speed up this process using multiprocessing may be ? I've looked into some material that are close to what I want to do like here and here and many others. But, given I'm very new to this, I'm struggling to even decide if multiprocessing is the right way to go. Also, in most of the multiprocessing-related resources available, they don't explain how we update the dictionary. I think I read somewhere that sharing the same dictionary is a bad idea. I wish I could be more specific with my question and hopefully won't get flagged as not suitable for SO but I just want to speed up this process of building the dictionary.

EDIT

After the suggestions made by @juanpa.arrivillaga , my code looks like :

import collections

my_dict = collections.defaultdict(set)
with open(filename, 'r') as f:
    for line in f:
        toks = line.rstrip().split()
        id1 = toks[0]
        id2 = toks[1]
        start = int(toks[4])
        end = int(toks[5])

        my_dict[id1].add((start,end))
        my_dict[id2].add((start,end))

This reduced my time to about ~21 seconds when run on a file of size ~500MB with 11mil lines.

pramesh shakya
  • 143
  • 2
  • 16
  • 2
    Use sets, not lists. – Stefan Pochmann Dec 18 '19 at 18:29
  • I understand that it would make me avoid checking if the range already exists in the dictionary but would it substantially improve the speed of populating the dictionary ? the file usually has about 30mil lines. – pramesh shakya Dec 18 '19 at 18:31
  • 2
    That check is the slow part, so yes. – Stefan Pochmann Dec 18 '19 at 18:34
  • 2
    " I'm wondering if I can speed up this process using multiprocessing may be ?" You should always consider your algorithm before even attempting to consider multiprocessing. As stated above, you are inefficiently doing a membership test on lists. Fix that first – juanpa.arrivillaga Dec 18 '19 at 18:46
  • 1
    @CodeDifferent **absolutely not**. In Python, `.append` is an amoratized constant time operation. `.append`ing in a loop is idiomatic and very fast. The problem is the membership test. – juanpa.arrivillaga Dec 18 '19 at 18:47
  • Yes, it was indeed the membership that was the bottleneck. I tried it on a file about 500MB with 11mil lines and before it took about 4 minutes and now it's down to 26 seconds. – pramesh shakya Dec 18 '19 at 18:48
  • still seems slow, honestly. What exactly are you doing now? There are various micro-optimizations you could do at this point if that is fixed. What sort of hard disk do you have? SSD? – juanpa.arrivillaga Dec 18 '19 at 18:52
  • @juanpa.arrivillaga I'm exactly running the above code and nothing else. There are plenty other things I have yet to do after i create the dictionary but for now, i'm just trying to speed up portions of my code. My computer's configuration is 8 cores, 1TB HDD , 16GB RAM. I changed the list of list to list of set. What other micro-optimzations would you propose? – pramesh shakya Dec 18 '19 at 18:56
  • This looks like a good map-reduce problem. Here is a simple python example that may apply almost directly. ( https://pymotw.com/2/multiprocessing/mapreduce.html ) – KDecker Dec 18 '19 at 18:58
  • 1
    @prameshshakya Sounds like you're still doing it wrong. Shouldn't be list of set but set of tuple. – Stefan Pochmann Dec 18 '19 at 18:59
  • 2
    @prameshshakya That doesn't sound right. You don't want a list of sets, then you are *still doing the membership test on a list*. **How exactly** are you handling the sets? Note, you can use a `import collections; my_dict = defaultdict(set)` and that should be as speedy as possible, then you don't need to do the check, and your loop body simply becomes `my_dict[id1].add((start, end)); my_dict[id2].add((start, end))` – juanpa.arrivillaga Dec 18 '19 at 19:00
  • @juanpa.arrivillaga Even after I made the changes as you said, I'm still getting around 21 seconds. – pramesh shakya Dec 18 '19 at 19:10
  • Again, **please post your code as it currently exists**. In any case, if you are using an HDD, it may I/O bottleneck – juanpa.arrivillaga Dec 18 '19 at 19:11
  • @juanpa.arrivillaga apologies, i've edited the question to include the edited code. Is there anyway to improve the I/O bottleneck from a software standpoint ? – pramesh shakya Dec 18 '19 at 19:18
  • 1
    @prameshshakya unlikely, but maybe someone has a suggestion. And if it is all in one large file it will not be amenable to multiprocessing. – juanpa.arrivillaga Dec 18 '19 at 19:19
  • 1
    I would suggest putting everything in a function. Code in the global namespace is going to be slowed down by global lookups. That is one obvious micro-optimization – juanpa.arrivillaga Dec 18 '19 at 19:21
  • @juanpa.arrivillaga yes, i already did that. If someone else is interested. Here's the link . https://stackoverflow.com/questions/11241523/why-does-python-code-run-faster-in-a-function – pramesh shakya Dec 18 '19 at 19:25
  • @juanpa.arrivillaga With `a = [str(i) for i in range(11000000)]` it takes my decent laptop about 2 seconds to do just `[int(i) for s in a]`. So sounds reasonable now. – Stefan Pochmann Dec 18 '19 at 19:27
  • @StefanPochmann yeah, you are right. I think the IO is what is dominating the runtime at this point. – juanpa.arrivillaga Dec 18 '19 at 19:28
  • @juanpa.arrivillaga Reading one large file should be fast (no matter SSD or HDD). Plus with 16 GB RAM, their 500 MB file might stay in cache between their attempts. I think it's mostly that all those operations take time. – Stefan Pochmann Dec 18 '19 at 19:31
  • @juanpa.arrivillaga , do you think changing the lines where i'm converting string to int could be done in a fashion where map() can be used which would speed it up? – pramesh shakya Dec 18 '19 at 19:31
  • @prameshshakya no. That isn't going to speed anything up appreciably, unless you were doing one giant conversion. Even then, I would only expect a marginal improvement. – juanpa.arrivillaga Dec 18 '19 at 19:33
  • Thank you all of you guys. I guess that's all the performance I can squeeze out of it. Will keep my eyes open to any else's suggestions. – pramesh shakya Dec 18 '19 at 19:40
  • 1
    You could remove the `rstrip()`. The `split()` discards that whitespace anyway. – Stefan Pochmann Dec 18 '19 at 19:50

0 Answers0