IndexError: list index out of range [Python 3.x Web scraping]

Question

I am trying to grab geo locations using URLs from a csv by searching the twitter, tweets urls. The input file has more than 100K rows with bunch of columns.

I am using python 3.x anaconda with all the updated version and I am getting following error:

Traceback (most recent call last):
  File "__main__.py", line 21, in <module>
    location = get_location(userid)
  File "C:path\twitter_location.py", line 22, in get_location
    location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
IndexError: list index out of range

The code below :

#!/usr/env/bin python
import urllib.request
import urllib3
from bs4 import BeautifulSoup

def get_location(userid):
    '''
    Get location as string ('Paris', 'New york', ..) by scraping twitter profils page.
    Returns None if location can not be scrapped
    '''

    page_url = 'http://twitter.com/{0}'.format(userid)

    try:
        page = urllib.request.urlopen(page_url)
    except urllib.request.HTTPError:
        print ('ERROR: user {} not found'.format(userid))
        return None

    content = page.read()
    html = BeautifulSoup(content)
    location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()

    if location.strip() == '':
        return None
    return location.strip()

I am looking for a quick fix so that I can execute the whole input files with more than 100k rows.

Edit: I As mentioned in the answer below, After including the try block the outputs have stopped grabbing geo location.

Before the inclusion of try block after certain count list out of range error.

After Including the try block the error is gone and so the coordinates. I am getting all none values.

Here is the DropBox link with Input, Before & After Output & entire code bundle.

Edit: II

Entire code and inputs are in the dropbox I am searching for some help where we can eliminate the entire API thing and find an alternative to pull geo locations of twitter usernames.

Appreciate the help in fixing the problem. Thanks in advance.

My guess is that some of the contents do not have '.ProfileHea...' in them and therefore html select gives an empty list with no index 0 — Ezer K, Apr 20 '17 at 23:29
@EzerK Thank you for the suggestion, in that case how can I neglect such a row and move forward to next ? I am trying with `('.ProfileHeaderCard-locationText')[-1]` and executing. Let me see how does that work in this case. — Sitz Blogz, Apr 20 '17 at 23:35

innicoder · Answer 1 · 2018-04-09T21:41:20.823

3

Well, you have an exception handling for HTTPErrors but there's no handling if there's no .ProfileHeaderCard-locationText. That's probably the issue. Now you can import/implement

import logging
logging.warning('Watch out!')  # will print a message to the console
logging.info('I told you so')  # will not print anything
logging.exception()

You can use this in all of your programs (and you should !). Just like you added a try, except block for ` try:

try:
    page = urllib.request.urlopen(page_url)
except urllib.request.HTTPError:
    print('ERROR: user {} not found'.format(userid))
    return None

you can do the same for

try:
    location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
except Exception:
    print("Error,Hey dud couldn't find Profile...")

The main problem might be that Google limits the usage of their API Using it your way. Much more convenient way is using Google-Maps-Python-API < CLICK FOR DETAILS Usage Example:

from geolocation.google_maps import GoogleMaps

address = "New York City Wall Street 12"

google_maps = GoogleMaps(api_key='your_google_maps_key') 

location = google_maps.search(location=address) # sends search to Google Maps.

print(location.all()) # returns all locations.

my_location = location.first() # returns only first location.

print(my_location.city)
print(my_location.route)
print(my_location.street_number)
print(my_location.postal_code)

EDIT:

    if location.strip() == '':
        return None
return location.strip()

I think you meant:

if location.strip()==None:
    return None
else:
    return location.strip()

edited Apr 09 '18 at 21:41

answered Apr 20 '17 at 23:48

innicoder

2,612
3
14
29

Thank you so much for the suggestion But in this code which part do I use this `logging`? But there needs to be a way to skip may be that particular from the input and move forward with parsing the next row ? Isnt it ? – Sitz Blogz Apr 20 '17 at 23:53
1

I've added code, you can use try: except block everywhere in your code where you expect an error (for any reason), now just using except Exception is a pretty sloppy way of doing things, but I do it and it gets the job done. *For this instance you can use except IndexError: print("There's no .profileHeaderCard, moving on...") – innicoder Apr 21 '17 at 00:04
Thank you so much .. Let me try the addition and execute and will inform you back . – Sitz Blogz Apr 21 '17 at 00:12
Of course, make sure you give feedback and explain your problem / edit your question, add code to make it easier for you and for us as well. Tip: If you think you don't know what you're looking for exactly ask for videos / docs explaining the subject. – innicoder Apr 21 '17 at 00:14
I included the `try:` block but that is giving me indentation error `try: ^ TabError: inconsistent use of tabs and spaces in indentation` – Sitz Blogz Apr 21 '17 at 03:39
Got it.. By mistake gave a tab instead of using space – Sitz Blogz Apr 21 '17 at 04:25
I did include the `try` block in the code as mentioned that did solve my error but is not returning me the geo coordinates. I have put a `edit` section in the question with a link to `dropbox` where the entire code, inputs and outputs are available. Please can you have a look into it and help me fix the problem. – Sitz Blogz Apr 21 '17 at 18:51
I'll add code, I can't really debug your whole program now but the main issue lies that you're not using google API properly (Look at the EDIT part for more indepth explanation). – innicoder Apr 21 '17 at 20:31
I do not want to use Google or any API because they have limitation of maximum 2000 search queries and I have like more than 500k usernames in about 100k csvs. The whole thing is about eliminating APIs of any sorts – Sitz Blogz Apr 22 '17 at 02:29
That seems like a good idea but there's a problem. Google probably has limitations on their API usage (the way you're using it). I can't really tell you anything in particular unless you try, print the codes output the whole thing and see what's going on, you'll have to do some debugging and research but let me know if you have a specific problem that we may have an answer too ( You can also put a bounty on this post if you want it resolved quickly). I've also added a 3rd section in my code, take a look about something I've noticed. – innicoder Apr 22 '17 at 12:12
Can I request you please take a little time to review the code and help me with a better working version of the same. Its not so much of big code. Its humble request Please. – Sitz Blogz Apr 22 '17 at 12:14
Yes I am going to put a bounty but will have to wait another 12 hrs for that. – Sitz Blogz Apr 22 '17 at 12:15
The edit isnt working my friend.. I need some way to eliminate the APIs completely and alternative solution .. – Sitz Blogz Apr 23 '17 at 07:15
I'm sorry but you're talking about scraping Google (They invented captcha) sure they know how to deal with things like these and no one is going to reveal their secrets against their bot-implementation. That's why they created the APIs, I can't help you further. – innicoder Apr 23 '17 at 13:07

score 2 · Answer 2 · answered Apr 26 '17 at 13:47

2

While handling the 'index out of range' exception as suggested in Faulty Fuse's answer is important, this will only fix the symptom.

The root of the problem is that after a certain number of requests twitter blocks your IP(s) and stops sending any usable content. (they don't like such mass-queries).

Potential solutions:

Go slower. This will delay being blocked by Twitter. You need to go so slow you don't get blocked at all. While this might not be possible for 100k records, this would be an easy fix if you have the time to wait for results.
Use rotating proxies. Use many of them. ... or combine a handful of proxies with going somewhat slower.

answered Apr 26 '17 at 13:47

Done Data Solutions

2,156
19
32

Thank you so much.. Can u please help me with writing delay block in the code.. Please it will be huge help for me.. May be like 100 queries at a time and halt n delay and again some more queries.. Until the end.. Please try to help with code.. – Sitz Blogz Apr 26 '17 at 14:07
2

Sure, if you want to go the slow route, I would put a single `time.sleep(wait_seconds)` into your loop that goes through all the userids and then experiment with different values for wait_seconds, starting with `wait_seconds=2` and adjusting up and down accord to your test results. – Done Data Solutions Apr 27 '17 at 06:23
Please can u help with code edit.. I have put all the input, output and code in the question with Dropbox link.. It would be huge help for me.. – Sitz Blogz Apr 27 '17 at 06:25

IndexError: list index out of range [Python 3.x Web scraping]

2 Answers2