Low Performance for Searching and Inserting into Python Dictionary

Question

I am running the code below to read line-by-line from a Pandas DataFrame and create "chains" of rows under keys in a Dictionary.

  flows = dict()

  for row in df.itertuples():
    key = (row.src_ip,row.dst_ip,row.src_port,row.dst_port,row.protocol)

    if key in flows:
      flows[key] += [[row.timestamp, row.length]]
    else:
      flows[key] = [[row.timestamp, row.length]]

The context is that I am finding lists of packets that are part of the same flow (5-tuple of addresses, pots, and protocol). I hope to process 70 million packets.

Below is a comparison of my line-by-line performance for 1,000,000 and 2,000,000 rows in the DF. The lines if key in flows and flows[key] = [[row.timestamp, row.length]] do not scale with n the way I would expect - I thought both insert and search were O(1) on average and O(n) at worst?

Why does it take so long to check for the key and insert a new key:list pair?

Can anyone offer advice on how to speed up this code or a better data structure to achieve this?

Below is an example of a key:value pair. I expect to hold around 35 million of these in a dictionary.

('74.178.232.151', '163.85.67.184', 443.0, 49601.0, 'UDP'): [[1562475600.3387961, 1392], [1562475600.338807, 1392], [1562475600.3388178, 1392], [1562475600.3388348, 1392], [1562475600.338841, 1392]]

Iterating over a pandas dataframe is generally a [bad idea](https://stackoverflow.com/a/55557758/9081267). It's not very clear what you are trying to achieve. But I think there should be methods to speed up your code by using optimzed code with `pandas` native methods. — Erfan, Jan 27 '20 at 18:30
Are you using up all your memory? Are you actually using the dataframe for anything? Why is pandas involved here? — juanpa.arrivillaga, Jan 27 '20 at 18:30
Thanks for your comments. @Erfan iterating over pandas only takes up 4% of the processing time for 2 million records so I don't think that's the problem. — gregory, Jan 27 '20 at 18:33
@juanpa.arrivillaga This is using surprisingly little memory. You are correct, I am not really using pandas for anything and plan on removing that. But I still need a dictionary for further processing. — gregory, Jan 27 '20 at 18:36
@gregory um, I doubt it is using "little memory". What is "surprising little" *exactly*? Your dict would take up tons of memory, unless this is being done on some beefy server. How much RAM do you have available? — juanpa.arrivillaga, Jan 27 '20 at 18:37
@juanpa.arrivillaga I am using Google Colab with 35GB of memory. The 2 million dataset creates a dictionary of approx. 1 million unique keys. Assuming this trend, I would like to process 70 million rows into a dictionary with 35 million unique keys. Am I running into low performance because of too many hash collisions? — gregory, Jan 27 '20 at 18:42
Maybe, but I still think you are running out of RAM. Can you give me an example of a key and value pair? — juanpa.arrivillaga, Jan 27 '20 at 18:50
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206741/discussion-between-gregory-and-juanpa-arrivillaga). — gregory, Jan 27 '20 at 18:59

Low Performance for Searching and Inserting into Python Dictionary

0 Answers0