-1

I've got an issue with my Pandas data analysis running very slowly and wondered if someone could help please.

I have 2 data frames I'm trying to compare if the IP address in one frame is part of the network in another frame:

  • VFFrame contains 10 columns, 4798 rows: 'CIDR' column contains the network
  • AssetFrame contains 3368 columns, 115816 rows: 'IPv4s' contains the IP addresses to match in 'CIDR' (as 'IPv4s' can have 1 to 3 possible IP addresses, I've had to split it out and clean them up)

I'm adding the additional 7 columns to the AssetFrame ready to take the new values once matched.

for index, row in AssetFrame.iterrows():
    IPstore = AssetFrame.loc[index, 'ipv4s'].split(',')
    
    for index, row in VFFrame.iterrows():
        net = ipaddress.IPv4Network(VFFrame.loc[index, 'CIDR'])
    
        for i in range(len(IPstore)):
            IPstore[i] = IPstore[i].strip()
            IP = ipaddress.IPv4Address(IPstore[i])
        
            if IP in net:
                row = [IP, net]       # Used to check list of matches to export as CSV,
                TheList.append(row)   # to check my assumption below is correct.

                # All IPs will be in the same network 'CIDR' or there will be no match
                # The columns have already been added to the AssetFrame ready to take the new values
                AssetFrame.loc[index, 'comment'] = VFFrame.loc[index, 'comment']
                AssetFrame.loc[index, 'country'] = VFFrame.loc[index, 'country']
                AssetFrame.loc[index, 'city'] = VFFrame.loc[index, 'city']
                AssetFrame.loc[index, 'site-name'] = VFFrame.loc[index, 'site-name']
                AssetFrame.loc[index, 'site-id'] = VFFrame.loc[index, 'site-id']
                AssetFrame.loc[index, 'vf-device'] = VFFrame.loc[index, 'vf-device']
                AssetFrame.loc[index, 'vlan'] = VFFrame.loc[index, 'vlan']



AssetFrame:
    id            ipv4s             fqdn
0   b564a4        192.168.20.4      too.many@cats121.com
1   e454a4        192.168.20.74     too.many@dogs231.com
2   a454a4        192.168.20.84     too.many@worms456.com

VFFrame:
    subnet          mask                CIDR                Comment         vlan
0   192.168.20.0    255.255.255.224     192.168.20.0/26     Blah Blah       101
1   192.168.20.64   255.255.255.240     192.168.20.64/28    Ditto Blah      201

Result Should be:
AssetFrame:
    id        ipv4s             fqdn                    Comment         vlan
0   b564a4    192.168.20.4      too.many@cats121.com    Blah Blah       101
1   e454a4    192.168.20.74     too.many@dogs231.com    Ditto Blah      201
2   a454a4    192.168.20.84     too.many@Worms456.com   No Match        No Match

Example of the data frames and wanted output:

GarryU
  • 1
  • 1
  • Does the second for-loop need to be nested? Because you are running that loop for every row in AssetFrame. – zmike Mar 15 '21 at 21:01
  • 3 nested loops... Gonna say that's it. Without sample data and a description of the manipulation it will be difficult to show you how to remove those loops (assuming it's possible) – ALollz Mar 15 '21 at 21:02
  • Try something like this (and de-indent the rest of the code): `IPstore = [] for index, row in AssetFrame.iterrows(): IPstore += AssetFrame.loc[index, 'ipv4s'].split(',') ` – zmike Mar 15 '21 at 21:02
  • How long is IPStore? Your outer two loops will run 555,000,000 times. If IPStore is larger than about 5, you're talking about billions and billions of loops. – Tim Roberts Mar 15 '21 at 21:02
  • And do you recognize you are overwriting "row" and "index" here? The `index` in your innermost loop is the VFFrame `index`, not the `AssetFrame` index. – Tim Roberts Mar 15 '21 at 21:05
  • @TimRoberts Good catch. That's probably another major culprit (besides the nested loops). – zmike Mar 15 '21 at 21:08
  • I think this can all be done with just one merge operation, but I would need you to post an example of the two dataframes first. – Boskosnitch Mar 15 '21 at 21:19
  • Ok cool, yes understand there are quite a few iterations with the nests, I did think about a way to merge the data where IPv4 is in network CIDR, but head got in to spin on how to do that where there are more than 1 IP address. I suppose one way to do that is chop the data frames down to bare minimum to then re-build a preferred output at the end. – GarryU Mar 15 '21 at 22:04
  • added the example DFs and desired output in to link pic. – GarryU Mar 15 '21 at 22:41
  • Could you include the sample data as text / python code so we can test with it easily ? – Malo Mar 16 '21 at 08:12

1 Answers1

0

"divide and conquer" is your friends. First of all we have O(n³) complexity and this is no a big deal. The first step in these cases is to understand where time is lost, i can guess the .loc is the bottleneck. But for a clear analisys i suggest to use a performing tools. I wrote perf_tool that can guide you to the solution:

Some hits

  • Internal loop do scalar assignment when you can do it in vectorial mode. The AssetFrame.loc is very expensive and called too many times.

  • Middle loop do again elaboration for each record, this can be done in vectorial mode using masks

  • Top Loop again iter on each row.

  • Probably the Top and Middle loop can be removed doing a single merge operation between the DataFrames.

Glauco
  • 1,385
  • 2
  • 10
  • 20
  • I did think on that one on using merge, but can't quite get my head around it. The merge criteria would be: is 'IPv4' in the network 'CIDR'. Which then as the IPv4 can contain up to 3 addresses I would need to think on concatenating the results, unless I just go with the 1st hit and ignore the rest of the IPv4. – GarryU Mar 16 '21 at 12:27
  • Hi Garry, you can use concat to repeat dataframe information, or Series.repeat in order to prepare Dataframe and able the merge operation – Glauco Mar 16 '21 at 13:04
  • Yes, there is no reason why not. That shouldn't take long either – GarryU Mar 16 '21 at 13:11