Python: Program running Slooooow

Question

I've got an issue with my Pandas data analysis running very slowly and wondered if someone could help please.

I have 2 data frames I'm trying to compare if the IP address in one frame is part of the network in another frame:

VFFrame contains 10 columns, 4798 rows: 'CIDR' column contains the network
AssetFrame contains 3368 columns, 115816 rows: 'IPv4s' contains the IP addresses to match in 'CIDR' (as 'IPv4s' can have 1 to 3 possible IP addresses, I've had to split it out and clean them up)

I'm adding the additional 7 columns to the AssetFrame ready to take the new values once matched.

for index, row in AssetFrame.iterrows():
    IPstore = AssetFrame.loc[index, 'ipv4s'].split(',')
    
    for index, row in VFFrame.iterrows():
        net = ipaddress.IPv4Network(VFFrame.loc[index, 'CIDR'])
    
        for i in range(len(IPstore)):
            IPstore[i] = IPstore[i].strip()
            IP = ipaddress.IPv4Address(IPstore[i])
        
            if IP in net:
                row = [IP, net]       # Used to check list of matches to export as CSV,
                TheList.append(row)   # to check my assumption below is correct.

                # All IPs will be in the same network 'CIDR' or there will be no match
                # The columns have already been added to the AssetFrame ready to take the new values
                AssetFrame.loc[index, 'comment'] = VFFrame.loc[index, 'comment']
                AssetFrame.loc[index, 'country'] = VFFrame.loc[index, 'country']
                AssetFrame.loc[index, 'city'] = VFFrame.loc[index, 'city']
                AssetFrame.loc[index, 'site-name'] = VFFrame.loc[index, 'site-name']
                AssetFrame.loc[index, 'site-id'] = VFFrame.loc[index, 'site-id']
                AssetFrame.loc[index, 'vf-device'] = VFFrame.loc[index, 'vf-device']
                AssetFrame.loc[index, 'vlan'] = VFFrame.loc[index, 'vlan']



AssetFrame:
    id            ipv4s             fqdn
0   b564a4        192.168.20.4      too.many@cats121.com
1   e454a4        192.168.20.74     too.many@dogs231.com
2   a454a4        192.168.20.84     too.many@worms456.com

VFFrame:
    subnet          mask                CIDR                Comment         vlan
0   192.168.20.0    255.255.255.224     192.168.20.0/26     Blah Blah       101
1   192.168.20.64   255.255.255.240     192.168.20.64/28    Ditto Blah      201

Result Should be:
AssetFrame:
    id        ipv4s             fqdn                    Comment         vlan
0   b564a4    192.168.20.4      too.many@cats121.com    Blah Blah       101
1   e454a4    192.168.20.74     too.many@dogs231.com    Ditto Blah      201
2   a454a4    192.168.20.84     too.many@Worms456.com   No Match        No Match

Example of the data frames and wanted output:

Does the second for-loop need to be nested? Because you are running that loop for every row in AssetFrame. — zmike, Mar 15 '21 at 21:01
3 nested loops... Gonna say that's it. Without sample data and a description of the manipulation it will be difficult to show you how to remove those loops (assuming it's possible) — ALollz, Mar 15 '21 at 21:02
Try something like this (and de-indent the rest of the code): `IPstore = [] for index, row in AssetFrame.iterrows(): IPstore += AssetFrame.loc[index, 'ipv4s'].split(',') ` — zmike, Mar 15 '21 at 21:02
How long is IPStore? Your outer two loops will run 555,000,000 times. If IPStore is larger than about 5, you're talking about billions and billions of loops. — Tim Roberts, Mar 15 '21 at 21:02
And do you recognize you are overwriting "row" and "index" here? The `index` in your innermost loop is the VFFrame `index`, not the `AssetFrame` index. — Tim Roberts, Mar 15 '21 at 21:05
@TimRoberts Good catch. That's probably another major culprit (besides the nested loops). — zmike, Mar 15 '21 at 21:08
I think this can all be done with just one merge operation, but I would need you to post an example of the two dataframes first. — Boskosnitch, Mar 15 '21 at 21:19
Ok cool, yes understand there are quite a few iterations with the nests, I did think about a way to merge the data where IPv4 is in network CIDR, but head got in to spin on how to do that where there are more than 1 IP address. I suppose one way to do that is chop the data frames down to bare minimum to then re-build a preferred output at the end. — GarryU, Mar 15 '21 at 22:04
Could you include the sample data as text / python code so we can test with it easily ? — Malo, Mar 16 '21 at 08:12

score 0 · Answer 1 · answered Mar 16 '21 at 10:07

0

"divide and conquer" is your friends. First of all we have O(n³) complexity and this is no a big deal. The first step in these cases is to understand where time is lost, i can guess the .loc is the bottleneck. But for a clear analisys i suggest to use a performing tools. I wrote perf_tool that can guide you to the solution:

Some hits

Internal loop do scalar assignment when you can do it in vectorial mode. The AssetFrame.loc is very expensive and called too many times.
Middle loop do again elaboration for each record, this can be done in vectorial mode using masks
Top Loop again iter on each row.
Probably the Top and Middle loop can be removed doing a single merge operation between the DataFrames.

answered Mar 16 '21 at 10:07

Glauco

1,385
2
10
20

I did think on that one on using merge, but can't quite get my head around it. The merge criteria would be: is 'IPv4' in the network 'CIDR'. Which then as the IPv4 can contain up to 3 addresses I would need to think on concatenating the results, unless I just go with the 1st hit and ignore the rest of the IPv4. – GarryU Mar 16 '21 at 12:27
Hi Garry, you can use concat to repeat dataframe information, or Series.repeat in order to prepare Dataframe and able the merge operation – Glauco Mar 16 '21 at 13:04
Yes, there is no reason why not. That shouldn't take long either – GarryU Mar 16 '21 at 13:11

Python: Program running Slooooow

1 Answers1