0

New to Pandas and have a question that I cannot answer on my own. For context, this is output from a firewall. it generates millions of packets and I am trying to aggregate that data into a firewall ruleset. The best way I've come up with is to identify traffic based on the destination IP.

The source/dest ports will change if they are ephemeral so it's important that I aggregate them into the same row. That way I can determine the range of ports for the ruleset.

RAW CSV:

dvc,"src_interface",transport,"src_ip","src_port","dest_ip","dest_port",direction,action,cause,count "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1025,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1026,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",22,"1.1.1.1",1028,outbound,allowed,"",2 "Firewall-1",outside,tcp,"3.3.3.3",22,"2.2.2.2",2200,outbound,allowed,"",2

Dataframe:

dvc src_interface transport   src_ip  src_port        dest_ip  dest_port direction   action  cause  count
0  Firewall-1       outside       tcp  4.4.4.4       53  1.1.1.1       1025  outbound  allowed    NaN      2
1  Firewall-1       outside       tcp  4.4.4.4       53  1.1.1.1       1026  outbound  allowed    NaN      2
2  Firewall-1       outside       tcp  4.4.4.4       53  1.1.1.1       1028  outbound  allowed    NaN      2
3  Firewall-1       outside       tcp  3.3.3.3       22  2.2.2.2       2200  outbound  allowed    NaN      2

How would I go about merging rows with the same dest_ip?

CODE:

df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index = True)
index_cols = df.columns.tolist()
index_cols.remove('dest_ip')
df = df.groupby(index_cols, as_index=False)['dest_ip'].apply(list)
print(df)

Expected Output:

Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1025-1026,1028 outbound allowed nan 2
Firewall-1 outside tcp 3.3.3.3 22 2.2.2.2 2200 outbound allowed nan 2

Most examples I've found online involve joining two dataframes whereas I only have the one. Any help would be appreciated. Thanks in advance!

depark
  • 3
  • 1
  • 1
  • 3
  • Sounds like you’re looking for a ‘groupby’ operation not a merge. Check out the documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html – robertwest Nov 24 '18 at 22:26
  • Please show us a [MCVE]. In particular, there are no dataframes in your post right now... – timgeb Nov 24 '18 at 22:33
  • Look at this answer: https://stackoverflow.com/q/22219004/3220100 – robertwest Nov 24 '18 at 23:11

2 Answers2

2

Try this. Group all the columns where you expected information to be duplicated and then aggregate the different “dest_port” values into a list:

df = pd.DataFrame([
            ["Firewall-1","outside","tcp","4.4.4.4",53,"1.1.1.1",1025,"outbound","allowed","",2], 
            ["Firewall-1","outside","tcp","4.4.4.4",53,"1.1.1.1",1026,"outbound","allowed","",2], 
            ["Firewall-1","outside","tcp","4.4.4.4",22,"1.1.1.1",1028,"outbound","allowed","",2], 
            ["Firewall-1","outside","tcp","3.3.3.3",22,"2.2.2.2",2200,"outbound", "allowed","",2]
        ], 
        columns=["dvc","src_interface","transport","src_ip","src_port","dest_ip","dest_port","direction", "action", "cause", "count"])

index_cols = df.columns.tolist()
index_cols.remove("dest_port") 
df = df.groupby(index_cols)["dest_port"].apply(list)
df = df.reset_index()

this results in 3 remaining rows and not 2 as in your desired output:

   dvc              src_interface transport   src_ip         src_port  dest_ip direction   action cause  count     dest_port
0  Firewall-1       outside       tcp         3.3.3.3        22  2.2.2.2  outbound  allowed            2        [2200]
1  Firewall-1       outside       tcp         4.4.4.4        22  1.1.1.1  outbound  allowed            2        [1028]
2  Firewall-1       outside       tcp         4.4.4.4        53  1.1.1.1  outbound  allowed            2  [1025, 1026]
robertwest
  • 904
  • 7
  • 13
  • I think that gets me a little closer however the dataframe is empty for some reason: $ python3.7 firewallData.py Empty DataFrame Columns: [] Index: [] – depark Nov 25 '18 at 01:03
  • df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index = True) index_cols = df.columns.tolist() index_cols.remove('dest_ip') df = df.groupby(index_cols, as_index=False)['dest_ip'].apply(list) print(df) – depark Nov 25 '18 at 01:04
  • Can you please add to your question a very simple dataframe constructor that can be used to make a test dataframe from the 3 rows of sample data, then I can check this code – robertwest Nov 25 '18 at 01:06
  • Thanks again Robert for looking into this. I've added more detail. – depark Nov 25 '18 at 01:23
  • Nearly there. I need this: df = pd.Dataframe({‘colA’: [“item1”, “item2”, “item3”, “item4”], ‘colB’: [...], ...}). – robertwest Nov 25 '18 at 01:25
  • Forgive my ignorance but I don't think my dataframe looks like that: df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index = True) – depark Nov 25 '18 at 01:31
  • I don’t have your csv so I can’t run this line of code. You need to create a small dataframe to mimic the read from the csv. Use the example I gave with the first 4 rows of your dataset – robertwest Nov 25 '18 at 01:33
  • Understood. I've added the data above. Appreciate you! – depark Nov 25 '18 at 01:39
  • I've updated my answer. it works on python 3. Next time you post on stackoverflow make sure to write "df = pd.DataFrame([...... " like I have in this answer so that people can easily test it – robertwest Nov 25 '18 at 02:08
  • Understood. I'll adhere to that format going forward. What is the result of your code look like for you? I'm still seeing an "Empty DataFrame message when I apply the above solution. – depark Nov 25 '18 at 02:17
  • It works. Run the code exactly as written including the dataframe construction. Perhaps the data you are reading from the csv is somehow different from this? Try executing 1 line of code at a time in the ipython console and observe the intermediate results to debug – robertwest Nov 25 '18 at 13:56
  • Thanks, Robert. I'm a lot closer but still not there. While the data is the same I believe that your dataframe is different from my csv. I suspect that yours returns each row as a list and mine does not. It keeps failing around this line of code if you import a csv into a dataframe: df = df.groupby(index_cols)["dest_port"].apply(list) – depark Nov 26 '18 at 02:41
  • The dataframe is constructed with a list of lists but each row is not a list – robertwest Nov 26 '18 at 14:36
0

I think the following might do what you're looking for:

    import pandas as pd
    #create practice dataframe. will remove rows if values in 'key' are duplicate
    df = pd.DataFrame({'key':[1,1,3,4],'color':[1,2,3,2],'house':[1,2,3,7]})
    print(df.drop_duplicates(['key']))

Original dataframe:

    key  color  house
    1      1      1
    1      2      2
    3      3      3
    4      2      7

Output dataframe:

    key  color  house
    1      1      1
    3      3      3
    4      2      7
DeathbyGreen
  • 337
  • 2
  • 17
  • Thanks for the reply :) It looks like your sample data is somewhat different. When I attempt to run your code I get this error: TypeError: 'method' object is not subscriptable – depark Nov 24 '18 at 23:03
  • You’re getting that error because this example has a syntax error in it. If should be: ‘drop_duplicates(subset=[‘key’])’. For you the subset parameter would have to be all columns except ‘dest_ip’ – robertwest Nov 24 '18 at 23:20
  • Robert is right, I missed a set of parenthesis. I've edited my answer – DeathbyGreen Nov 24 '18 at 23:23