Call two pandas columns in a function?

Question

I have a bunch of URLs I am trying to write to files. I store the URLs in a pandas dataframe.

The dataframe has two columns: url and id. I am trying to request each URL from url and write it to a file named id.

Here is what I got so far:

def get_link(url): 
    file_name = os.path.join('/mypath/foo/bar', df.id)
    try: 
        r = requests.get(url)
    except Exception as e:
        print("Failded to get " + url)
    else:
        with open(file_name, 'w') as f: 
            f.write(r.text)

df.url.apply(lambda l: get_link(l))

But when I call the function, it obvioulsly fails, since os.path.join expects a string and not a series. Hence I get the error join() argument must be str or bytes, not 'Series'

Any ideas how I can simultaenously call df.id and df.url?

Thank you/R

As an aside: Do not use a catchall `Exception`. Keep it specific, that inadvertently swallows other issues you may not want. — cs95, Aug 15 '17 at 07:49

jezrael · Answer 1 · 2017-08-15T07:47:40.637

1

I think you need apply with axis=1 for process by rows and then get values per rows by x.url and x.id, becasue working with Series with index by columns, here url and id:

def get_link(x): 
    print (x) 
    file_name = os.path.join('/mypath/foo/bar', x.id)
    try: 
        r = requests.get(x.url)
    except Exception as e:
        print("Failded to get " + x.url)
    else:
        with open(file_name, 'w') as f: 
            f.write(r.text)

df.apply(get_link, axis=1)

Sample:

df = pd.DataFrame({'url':['url1','url2'],
                   'id':[1,2]})

print (df)
   id   url
0   1  url1
1   2  url2

def get_link(x):
    print (x) 
    print ('url is: {}'.format(x.url))
    print ('id is: {}'.format(x.id))

df.apply(get_link, axis=1)

id        1
url    url1
Name: 0, dtype: object
url is: url1
id is: 1
id        2
url    url2
Name: 1, dtype: object
url is: url2
id is: 2

edited Aug 15 '17 at 07:47

answered Aug 15 '17 at 07:41

jezrael

822,522
95
1,334
1,252

Thank you! This is very illustrative and provides me with some new knowledge! – Rachel Aug 15 '17 at 08:02
1

Hmmm, I am curious, why do you think `iterrows` is better? I think the best is [avoid](https://stackoverflow.com/a/24871316/2901002) it. – jezrael Aug 15 '17 at 08:04
1

`iterrows` is not recommended for operations than can be vectorized. I don't believe url requests fall into this category, however. – Alexander Aug 15 '17 at 08:07
Sorry if I disappointed you! It is just that I understand `iterrows` - I used it before. I am new to the solution you provided and don't quite understand it yet. It still works well! Why would you try not to use `iterrows`? Hope that is OK! Thank you in any case! – Rachel Aug 15 '17 at 08:08
1

If simply loop values it working both solutions. But generally the best is avoiding iterrows, because the slowiest – jezrael Aug 15 '17 at 08:10
Understood! Thank you! – Rachel Aug 15 '17 at 08:14

Alexander · Accepted Answer · 2017-08-15T08:00:21.190

1

You can enhance your function to take the id_ parameter in addition to url.

def get_link(url, id_): 
    file_name = os.path.join('/mypath/foo/bar', id_)
    try: 
        r = requests.get(url)
    except ConnectionError, MissingSchema as e:
        print("Failded to get " + url)
    else:
        with open(file_name, 'w') as f: 
            f.write(r.text)

Then just iterate through your dataframe to call your function.

for idx, row in df.iterrows():
    get_link(url=row.url, id_=row.id)

edited Aug 15 '17 at 08:00

answered Aug 15 '17 at 07:54

Alexander

105,104
32
201
196

Thank you! This helps me a great deal! – Rachel Aug 15 '17 at 08:01

Call two pandas columns in a function?

2 Answers2