I am trying to grab geo locations using URLs from a csv by searching the twitter, tweets urls. The input file has more than 100K rows with bunch of columns.
I am using python 3.x anaconda with all the updated version and I am getting following error:
Traceback (most recent call last):
File "__main__.py", line 21, in <module>
location = get_location(userid)
File "C:path\twitter_location.py", line 22, in get_location
location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
IndexError: list index out of range
The code below :
#!/usr/env/bin python
import urllib.request
import urllib3
from bs4 import BeautifulSoup
def get_location(userid):
'''
Get location as string ('Paris', 'New york', ..) by scraping twitter profils page.
Returns None if location can not be scrapped
'''
page_url = 'http://twitter.com/{0}'.format(userid)
try:
page = urllib.request.urlopen(page_url)
except urllib.request.HTTPError:
print ('ERROR: user {} not found'.format(userid))
return None
content = page.read()
html = BeautifulSoup(content)
location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
if location.strip() == '':
return None
return location.strip()
I am looking for a quick fix so that I can execute the whole input files with more than 100k rows.
Edit: I
As mentioned in the answer below, After including the try
block the outputs have stopped grabbing geo location.
Before the inclusion of try
block after certain count list out of range
error.
After Including the try
block the error is gone and so the coordinates. I am getting all none
values.
Here is the DropBox link with Input, Before & After Output & entire code bundle.
Edit: II
Entire code and inputs are in the dropbox I am searching for some help where we can eliminate the entire API thing and find an alternative to pull geo locations of twitter usernames.
Appreciate the help in fixing the problem. Thanks in advance.