2

I have a bunch of URLs I am trying to write to files. I store the URLs in a pandas dataframe.

The dataframe has two columns: url and id. I am trying to request each URL from url and write it to a file named id.

Here is what I got so far:

def get_link(url): 
    file_name = os.path.join('/mypath/foo/bar', df.id)
    try: 
        r = requests.get(url)
    except Exception as e:
        print("Failded to get " + url)
    else:
        with open(file_name, 'w') as f: 
            f.write(r.text)

df.url.apply(lambda l: get_link(l))

But when I call the function, it obvioulsly fails, since os.path.join expects a string and not a series. Hence I get the error join() argument must be str or bytes, not 'Series'

Any ideas how I can simultaenously call df.id and df.url?

Thank you/R

Rachel
  • 1,937
  • 7
  • 31
  • 58
  • 1
    As an aside: Do not use a catchall `Exception`. Keep it specific, that inadvertently swallows other issues you may not want. – cs95 Aug 15 '17 at 07:49

2 Answers2

1

I think you need apply with axis=1 for process by rows and then get values per rows by x.url and x.id, becasue working with Series with index by columns, here url and id:

def get_link(x): 
    print (x) 
    file_name = os.path.join('/mypath/foo/bar', x.id)
    try: 
        r = requests.get(x.url)
    except Exception as e:
        print("Failded to get " + x.url)
    else:
        with open(file_name, 'w') as f: 
            f.write(r.text)

df.apply(get_link, axis=1)

Sample:

df = pd.DataFrame({'url':['url1','url2'],
                   'id':[1,2]})

print (df)
   id   url
0   1  url1
1   2  url2

def get_link(x):
    print (x) 
    print ('url is: {}'.format(x.url))
    print ('id is: {}'.format(x.id))

df.apply(get_link, axis=1)

id        1
url    url1
Name: 0, dtype: object
url is: url1
id is: 1
id        2
url    url2
Name: 1, dtype: object
url is: url2
id is: 2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you! This is very illustrative and provides me with some new knowledge! – Rachel Aug 15 '17 at 08:02
  • 1
    Hmmm, I am curious, why do you think `iterrows` is better? I think the best is [avoid](https://stackoverflow.com/a/24871316/2901002) it. – jezrael Aug 15 '17 at 08:04
  • 1
    `iterrows` is not recommended for operations than can be vectorized. I don't believe url requests fall into this category, however. – Alexander Aug 15 '17 at 08:07
  • Sorry if I disappointed you! It is just that I understand `iterrows` - I used it before. I am new to the solution you provided and don't quite understand it yet. It still works well! Why would you try not to use `iterrows`? Hope that is OK! Thank you in any case! – Rachel Aug 15 '17 at 08:08
  • 1
    If simply loop values it working both solutions. But generally the best is avoiding iterrows, because the slowiest – jezrael Aug 15 '17 at 08:10
  • Understood! Thank you! – Rachel Aug 15 '17 at 08:14
1

You can enhance your function to take the id_ parameter in addition to url.

def get_link(url, id_): 
    file_name = os.path.join('/mypath/foo/bar', id_)
    try: 
        r = requests.get(url)
    except ConnectionError, MissingSchema as e:
        print("Failded to get " + url)
    else:
        with open(file_name, 'w') as f: 
            f.write(r.text)

Then just iterate through your dataframe to call your function.

for idx, row in df.iterrows():
    get_link(url=row.url, id_=row.id)
Alexander
  • 105,104
  • 32
  • 201
  • 196