New to Pandas and have a question that I cannot answer on my own. For context, this is output from a firewall. it generates millions of packets and I am trying to aggregate that data into a firewall ruleset. The best way I've come up with is to identify traffic based on the destination IP.
The source/dest ports will change if they are ephemeral so it's important that I aggregate them into the same row. That way I can determine the range of ports for the ruleset.
RAW CSV:
dvc,"src_interface",transport,"src_ip","src_port","dest_ip","dest_port",direction,action,cause,count "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1025,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1026,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",22,"1.1.1.1",1028,outbound,allowed,"",2 "Firewall-1",outside,tcp,"3.3.3.3",22,"2.2.2.2",2200,outbound,allowed,"",2
Dataframe:
dvc src_interface transport src_ip src_port dest_ip dest_port direction action cause count
0 Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1025 outbound allowed NaN 2
1 Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1026 outbound allowed NaN 2
2 Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1028 outbound allowed NaN 2
3 Firewall-1 outside tcp 3.3.3.3 22 2.2.2.2 2200 outbound allowed NaN 2
How would I go about merging rows with the same dest_ip?
CODE:
df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index = True)
index_cols = df.columns.tolist()
index_cols.remove('dest_ip')
df = df.groupby(index_cols, as_index=False)['dest_ip'].apply(list)
print(df)
Expected Output:
Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1025-1026,1028 outbound allowed nan 2
Firewall-1 outside tcp 3.3.3.3 22 2.2.2.2 2200 outbound allowed nan 2
Most examples I've found online involve joining two dataframes whereas I only have the one. Any help would be appreciated. Thanks in advance!