1

I have the following code for getting IP information:

import requests
import json
import pandas as pd
import swifter  

def get_ip(ip):
    response = requests.get ("http://ip-api.com/json/" + ip.rstrip())
    geo = response.json()
    location = {'lat': geo.get('lat', ''),
                'lon': geo.get('lon', ''),
                'region': geo.get('regionName', ''),
                'city': geo.get('city', ''),
                'org': geo.get('org', ''),
                'country': geo.get('countryCode', ''),
                'query': geo.get('query', '')
                }
    return(location)

For applying it to an entire dataframe of IPs (df) I am using the next:

df=pd.DataFrame(['85.56.19.4','188.85.165.103','81.61.223.131'])

for lab,row in df.iterrows():
    dip = get_ip(df.iloc[lab][0])
    try:
        ip.append(dip["query"])
        private.append('no')
        country.append(dip["country"])
        city.append(dip["city"])
        region.append(dip["region"])
        organization.append(dip["org"])
        latitude.append(dip["lat"])
        longitude.append(dip["lon"])
    except:
        ip.append(df.iloc[lab][0])
        private.append("yes")

However, since iterrows is very slow and I need more performance, I want to use swiftapply, which is an extension of apply function. I have used this:

def ip(x):
    dip = get_ip(x)
    if (dip['ip']=='private')==True:
        ip.append(x)
        private.append("yes")
    else:
        ip.append(dip["ip"])
        private.append('no')
        country.append(dip["country"])
        city.append(dip["city"])
        region.append(dip["region"])
        organization.append(dip["org"])
        latitude.append(dip["lat"])
        longitude.append(dip["lon"])

df.swifter.apply(ip)

And I get the following error: AttributeError: ("'Series' object has no attribute 'rstrip'", 'occurred at index 0')

How could I fix it?

Javier Lopez Tomas
  • 2,072
  • 3
  • 19
  • 41
  • `rstrip()` is a function that only works on strings. It seems you're using it with a non-string object, but I'm not sure where (`ip.rstrip()` is the only occurrence of `rstrip()`, and `ip` is likely a string) – rdimaio Sep 26 '18 at 14:12

1 Answers1

1

rstrip is a string operation. In order to apply a string operation to a series Series you have to first call the str function on the series, which allows vectorized string operations to be performed on a Series.

Specifically, in your code changing ip.rstrip() to ip.str.rstrip() should resolve your AttributeError.

After digging around a little it turns out the requests.get operation you're trying to perform cannot be vectorized within pandas (see Using Python Requests for several URLS in a dataframe). I hacked up the following that should be a little more efficient than using iterrows. What the following does is utilizes np.vectorize to run the function to get information for each IP address. The location input is saved as new columns in a new DataFrame.

First, I altered your get_ip function to return the location dictionary, not (location).

Next, I created a vectorization function using np.vectorize:

vec_func = np.vectorize(lambda url: get_ip(url))

Finally, vec_func is applied to df to create a new DataFrame that merges df with the location output from vec_func where df[0] is the column with your URLs:

new_df = pd.concat([df, pd.DataFrame(vec_func(df[0]), columns=["response"])["response"].apply(pd.Series)], axis=1)

The code above retrieves the API response in the form of a dictionary for each row in your DataFrame then maps the dictionary to columns in the DataFrame. In the end your new DataFrame would look like this:

                0      lat     lon     region      city             org country           query
0      85.56.19.4  37.3824 -5.9761  Andalusia   Seville   Orange Espana      ES      85.56.19.4
1  188.85.165.103  41.6561 -0.8773     Aragon  Zaragoza  Vodafone Spain      ES  188.85.165.103
2   81.61.223.131  40.3272 -3.7635     Madrid   Leganés    Vodafone Ono      ES   81.61.223.131

Hopefully this resolves the InvalidSchema error and gets you a little better performance than iterrows().

vielkind
  • 2,840
  • 1
  • 16
  • 16
  • Changing it I obtain another error: _InvalidSchema: ("No connection adapters were found for 'ip_address http://ipapi..., dtype: object'", 'occurred at index 0')_ I have followed the solutions on this [link] (https://stackoverflow.com/questions/15115328/python-requests-no-connection-adapters) and I am still getting the error. – Javier Lopez Tomas Sep 26 '18 at 14:22
  • Can you provide the full URL that is causing this new error? – vielkind Sep 26 '18 at 15:08
  • Sure. (_"No connection adapters were found for 'ip_address h ttp://ip-api.com/json/85.56.19.4\nName: 0, dtype: object'", 'occurred at index 0'_) PS: I put a space between h and t for not getting a link – Javier Lopez Tomas Sep 26 '18 at 15:58
  • @JavierLópezTomás- I've updated the answer with a workaround to the `InvalidSchema` error. Hopefully this helps. I learned a lot looking into this one! – vielkind Sep 28 '18 at 03:32
  • Thank you for your answer. I have measured the time both in my laptop and in a server for a dataframe of 100 IP addresses. In my laptop, your proposed method is 1 sec (1 min vs 1min1sec) slower than the iterrows, and in the server, it is 0.4s slower (49.7s vs 49.3s) – Javier Lopez Tomas Sep 28 '18 at 08:34
  • I have checked and just the get_ip function takes also 1 minute, so maybe your proposed method is better but still not noticeable cause the main problem is the get_ip. – Javier Lopez Tomas Sep 28 '18 at 09:35
  • @JavierLópezTomás- Yes, and unfortunately as I noted `requests.get()` is not an operation you can vectorize. Each URL has to be submitted individually, which means there will be some latency in the API request and response that will compound with each individual API request. – vielkind Sep 28 '18 at 12:09