1

I need to train an ML model using a large dataset. For this I'm using the dask library.

My dataset contains IP addresses (column index 0 and 2). I'm trying to convert these IP addresses into integer using the ipaddress python library. A sample of the dataset is given below:

IP Add Src Port IP Add Dest. Port
9.166.0.5 1305 149.17.12.8 21
9.166.0.5 1305 149.17.12.8 21
9.166.0.5 1305 149.17.12.8 21
9.166.0.5 1305 149.17.12.8 21
9.166.0.5 1305 149.17.12.8 21
9.166.0.5 1305 149.17.12.8 21

Initially when using pandas dataframe, I used the following to convert the Ip addresses:

df['IP Add Src'] = df['IP Add Src'].apply(lambda x: int(ipaddress.IPv4Address(x)))

From what I've read, with dask, there are the apply, map_partitions and map functions which are available.

However, I'm still unsure how to use these functions to convert these ip addresses in place.

Any help on how I can implement this.

kMg
  • 11
  • 1
  • have you tried your proposed solution? does it not work for some reason? if not, can you provide the traceback? apply should work the same way in dask.dataframe - you may need to provide the `meta` argument to [`dask.dataframe.Series.apply`](https://docs.dask.org/en/stable/generated/dask.dataframe.Series.apply.html), which could be as simple as `meta=("IP Add Src", int)` – Michael Delgado Sep 07 '22 at 18:10

2 Answers2

1

With Dask using dask.dataframe.Series.apply and the treuss proposed method to evaluate IP address:

import pandas as pd
import dask.dataframe as dd
from functools import reduce

df = pd.DataFrame({'ip': ['9.166.0.1', '9.166.0.2', '9.166.0.3', '9.166.0.4', '9.166.0.5'],
                   'port': [80, 81, 82, 83, 84]})
ddf = dd.from_pandas(df, 2)

def strip_to_int(str_ip):
    arr_ip = str_ip.split('.')
    if len(arr_ip)==4:
        return reduce(lambda x,y: x<<8|int(y), arr_ip, 0)
    return None

series_int_ip = ddf.ip.apply(strip_to_int, meta=ddf.ip)
ddf.assign(ip=series_int_ip)

result:

         ip     port
0   161873921   80
1   161873922   81
2   161873923   82
3   161873924   83
4   161873925   84
Massifox
  • 4,369
  • 11
  • 31
-1

The correct way to convert an IPv4-Address to a 32bit integer value is to "left shift" each octet by the appropriate number of bits, i.e. 24 for the first, 16 for the second and 8 for the third octet, and build the sum (or binary or) of the four resulting numbers:

from functools import reduce
ipAddress = '149.17.12.8'
ipAddressAsInt = reduce(lambda x,y: x+y, [int(b)<<(8*(3-a)) for a,b in enumerate(ipAddress.split('.'))])

The expression in the code builds the 32bit value for each octet by using left-shift operator << on the octet value a with an argument that is built from the position of the octet b returned by enumerate. reduce than uses a simple lambda function to add the values. Note that lambda x,y: x|y also works and would technically actually be more correct.

Of course you could also do the left-shift directly in the lambda function, but imho this makes it less readable:

reduce(lambda x,y: x|int(y[1])<<(8*(3-y[0])), enumerate(ipAddress.split('.')), 0)

Update: A more readable version is this:

reduce(lambda x,y: x<<8|int(y), ipAddress.split('.'), 0)

Start with value 0, then for each octet in the IP-Address, left shift the current value by 8 and add the current octet.

To apply this to a dataframe column, use map as opposed to apply (check here).

Small test/proof-of-concept:

>>> import pandas as pd
>>> from functools import reduce
>>> df = pd.DataFrame({'sourceHost': ['luke', 'leia', 'bb8'],
...                    'sourceIp': ['10.24.53.128', '10.24.125.44', '10.24.133.253'],
...                    'destHost': ['vader', 'palpatin', 'keylo'],
...                    'destIp': ['10.25.88.124', '10.25.230.12', '10.25.240.1']})
>>> df
  sourceHost       sourceIp  destHost        destIp
0       luke   10.24.53.128     vader  10.25.88.124
1       leia   10.24.125.44  palpatin  10.25.230.12
2        bb8  10.24.133.253     keylo   10.25.240.1
>>> df['sourceIp'] = df['sourceIp'].map(lambda ip: reduce(lambda x,y: x<<8|int(y), ip.split('.'), 0))
>>> df
  sourceHost   sourceIp  destHost        destIp
0       luke  169358720     vader  10.25.88.124
1       leia  169377068  palpatin  10.25.230.12
2        bb8  169379325     keylo   10.25.240.1
>>> df['destIp'] = df['destIp'].map(lambda ip: reduce(lambda x,y: x<<8|int(y), ip.split('.'), 0))
>>> df
  sourceHost   sourceIp  destHost     destIp
0       luke  169358720     vader  169433212
1       leia  169377068  palpatin  169469452
2        bb8  169379325     keylo  169472001
treuss
  • 1,913
  • 1
  • 15
  • Thanks for info. However the issue I'm facing is how to actually convert these IP addresses within the dask dataframe. – kMg Sep 07 '22 at 16:48
  • I don't know dask, but in pandas you would typically do `df['IP Add Src'] = df['IP Add Src'].map(lambda ip: reduce(lambda x,y: x<<8|int(y), ip.split('.'), 0))` – treuss Sep 07 '22 at 17:25