I am running the code below to read line-by-line from a Pandas DataFrame and create "chains" of rows under keys in a Dictionary.
flows = dict()
for row in df.itertuples():
key = (row.src_ip,row.dst_ip,row.src_port,row.dst_port,row.protocol)
if key in flows:
flows[key] += [[row.timestamp, row.length]]
else:
flows[key] = [[row.timestamp, row.length]]
The context is that I am finding lists of packets that are part of the same flow (5-tuple of addresses, pots, and protocol). I hope to process 70 million packets.
Below is a comparison of my line-by-line performance for 1,000,000 and 2,000,000 rows in the DF. The lines if key in flows
and flows[key] = [[row.timestamp, row.length]]
do not scale with n the way I would expect - I thought both insert and search were O(1) on average and O(n) at worst?
Why does it take so long to check for the key and insert a new key:list pair?
Can anyone offer advice on how to speed up this code or a better data structure to achieve this?
Below is an example of a key:value pair. I expect to hold around 35 million of these in a dictionary.
('74.178.232.151', '163.85.67.184', 443.0, 49601.0, 'UDP'): [[1562475600.3387961, 1392], [1562475600.338807, 1392], [1562475600.3388178, 1392], [1562475600.3388348, 1392], [1562475600.338841, 1392]]