1

I was trying to define a new function and I found that it doesn't work as intended.

This what I wrote:

def fbref(stats):
    base_url = "https://fbref.com/en/comps/12/"
    iterate = stats
    end_url = "/La-Liga-Stats"
    response = requests.get(base_url + iterate + end_url)

    soup = BeautifulSoup(response.text, 'html.parser')

    comments = soup.find_all(string=lambda text: isinstance(text, Comment))

    df = []
    for each in comments:
        if 'table' in each:
            try:
                df.append(pd.read_html(each, header=1)[0])
            except:
                continue

    df = df[0]
    df = df[df.Player != "Player"]
    df = df.fillna(0)
    df.iloc[:, 5:-1] = df.iloc[:, 5:-1].apply(pd.to_numeric, axis = 1)
    return df

It works fine, but when the function is called twice or more times in succession it says "list index out of range." For example, if I write gca = fbref("gca"), defense = fbref("defense"), possession = fbref("possession"), passing = fbref("passing"), stats = fbref("stats"), shooting = fbref("shooting"), misc = fbref("misc") I get "gca", "defense" and sometimes also "possession", but after that it gives me the error. I tried several combinations and same behaviour, so it's not about the order.

Does anyone have a clue of what may be happening? Thank you for reading this.

I use spyder and python 3.9

Addoc
  • 13
  • 3
  • Can you explain in more detail what you mean by "if I do it whithout my fbref function and copy-pasting the url manually, it gives me no error." - can you share an example of the working code with no error for comparison? – Grismar Aug 01 '22 at 22:42
  • 1
    If `comments` is empty or you don't find any tables in it, `df` will be empty, so `df[0]` will be out of range. – Barmar Aug 01 '22 at 22:42
  • I suspect when you do this by hand you're not clearing the `df` list between each URL, so it doesn't become empty when nothing is found the second time. – Barmar Aug 01 '22 at 22:44
  • @Grismar Yes. I basically delete "def fbref(stats):" and "return df" and I change "iterate = stats" to iterate = misc (for example) and it works just fine. – Addoc Aug 01 '22 at 22:46
  • @Barmar _It could have been, but now I tried removing all variables before and it works fine. Besides, I checked the data inside an it's the right one. – Addoc Aug 01 '22 at 22:48
  • That's the only difference I can see between running the code at top-level and running it in a function. Add `print(df)` before `df = df[0]` in both cases. – Barmar Aug 01 '22 at 22:52
  • @Barmar Yes, I don't see the difference either. Now, I found out that it's not about the values I enter, but about the order. The first two (sometimes 3) comands work fine, but then I got the error no matter which value I insert. For example if I write gca = fbref("gca"), defense = fbref("defense"), possession = fbref("possession"), misc = fbref("misc"), passing = fbref("passing"), stats = fbref("stats"), shooting = fbref("shooting") I get gca, defense and possession before the error appears again – Addoc Aug 01 '22 at 23:08
  • print the last `comments` please, that happens when you get an error. 99% chance it's a rate limiter on the API. – Bharel Aug 01 '22 at 23:35
  • @Bharel You might be right. In the last comment before the error I get this ['[if lt IE 7]> <![endif]', '[if IE 7]> <![endif]', '[if IE 8]> <![endif]', '[if gt IE 8]><!', '<![endif]', '[if lt IE 9]><![endif]', '[if gte IE 10]><!', '<![endif]', ' /.error-footer ']. Anything I can do? – Addoc Aug 01 '22 at 23:41
  • [A bare `except` is bad practice](/q/54948548/4518341). Instead, use the specific exception you're expecting like `except ValueError`. Or at least do `except Exception as e` and `print(e)` so you can actually see what's happening. – wjandrea Aug 01 '22 at 23:51
  • @Bharel Yes, you were right. I added time.sleep(3) and problem solved. Thank you. – Addoc Aug 02 '22 at 00:11
  • @wjandrea Ok. Thank you for the advise, I'll try to do it that way from now on. – Addoc Aug 02 '22 at 00:11

1 Answers1

0

As I wrote in the comments, your issue is due to a rate limiting mechanism.

Since you're parsing a website, there's no documentation for rate. Should there be - I suggest reading the documentation of the target website and checking for rate.

Sometimes websites return the rate or any other useful information in the headers, so check the response headers and response code.

Sleep is not always the best solution. Once you know the rate, I suggest using a rate limiting library, maybe together with asyncio or threading.

Oh and print the exception in that try: except:, it'll only be helpful. Good luck!

Bharel
  • 23,672
  • 5
  • 40
  • 80