0

I am running the code below to read line-by-line from a Pandas DataFrame and create "chains" of rows under keys in a Dictionary.

  flows = dict()

  for row in df.itertuples():
    key = (row.src_ip,row.dst_ip,row.src_port,row.dst_port,row.protocol)

    if key in flows:
      flows[key] += [[row.timestamp, row.length]]
    else:
      flows[key] = [[row.timestamp, row.length]]

The context is that I am finding lists of packets that are part of the same flow (5-tuple of addresses, pots, and protocol). I hope to process 70 million packets.

Below is a comparison of my line-by-line performance for 1,000,000 and 2,000,000 rows in the DF. The lines if key in flows and flows[key] = [[row.timestamp, row.length]] do not scale with n the way I would expect - I thought both insert and search were O(1) on average and O(n) at worst?

enter image description here enter image description here

Why does it take so long to check for the key and insert a new key:list pair?

Can anyone offer advice on how to speed up this code or a better data structure to achieve this?

Below is an example of a key:value pair. I expect to hold around 35 million of these in a dictionary.

('74.178.232.151', '163.85.67.184', 443.0, 49601.0, 'UDP'): [[1562475600.3387961, 1392], [1562475600.338807, 1392], [1562475600.3388178, 1392], [1562475600.3388348, 1392], [1562475600.338841, 1392]]
gregory
  • 188
  • 2
  • 17
  • 1
    Iterating over a pandas dataframe is generally a [bad idea](https://stackoverflow.com/a/55557758/9081267). It's not very clear what you are trying to achieve. But I think there should be methods to speed up your code by using optimzed code with `pandas` native methods. – Erfan Jan 27 '20 at 18:30
  • Are you using up all your memory? Are you actually using the dataframe for anything? Why is pandas involved here? – juanpa.arrivillaga Jan 27 '20 at 18:30
  • Thanks for your comments. @Erfan iterating over pandas only takes up 4% of the processing time for 2 million records so I don't think that's the problem. – gregory Jan 27 '20 at 18:33
  • @juanpa.arrivillaga This is using surprisingly little memory. You are correct, I am not really using pandas for anything and plan on removing that. But I still need a dictionary for further processing. – gregory Jan 27 '20 at 18:36
  • @gregory um, I doubt it is using "little memory". What is "surprising little" *exactly*? Your dict would take up tons of memory, unless this is being done on some beefy server. How much RAM do you have available? – juanpa.arrivillaga Jan 27 '20 at 18:37
  • For example, how many unique keys do you expect to create? – juanpa.arrivillaga Jan 27 '20 at 18:39
  • @juanpa.arrivillaga I am using Google Colab with 35GB of memory. The 2 million dataset creates a dictionary of approx. 1 million unique keys. Assuming this trend, I would like to process 70 million rows into a dictionary with 35 million unique keys. Am I running into low performance because of too many hash collisions? – gregory Jan 27 '20 at 18:42
  • Maybe, but I still think you are running out of RAM. Can you give me an example of a key and value pair? – juanpa.arrivillaga Jan 27 '20 at 18:50
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206741/discussion-between-gregory-and-juanpa-arrivillaga). – gregory Jan 27 '20 at 18:59

0 Answers0