1

I'm trying to extract data from html source using BeautifulSoup. This is the source

<td class="advisor" colspan="">

Here is my code:

soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')

for td in tds:
    if td["colspan"] == '':
        col = 0
    else:
        col = int(td["colspan"])

However, I get this error:

ValueError: invalid literal for int() with base 10: ''

I know this error means '' cannot be transformed to integer, but why doesn't my 'if' work? I think this situation should go to

col = 0

rather than

col = int(td["colspan"])
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
Mars Lee
  • 1,845
  • 5
  • 17
  • 37

3 Answers3

2

I would suggest you use exception handling as follows:

from bs4 import BeautifulSoup

html = """
    <td class="advisor" colspan="2"></td>
    <td class="advisor" colspan=""></td>
    <td class="advisor" colspan="x"></td>
    """

soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')

for td in tds:
    try:
        col = int(td["colspan"])
    except (ValueError, KeyError) as e:
        col = 0

    print(col)

This would display the following:

2
0
0

Tested using Python 3.4.3

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • I've tried the above script on the whole html for the url and it seems to work fine. It produces many 0s, with a couple of 2s. What version of Python and BeautifulSoup are you using? – Martin Evans Mar 02 '16 at 08:28
  • I also get all colspan value and the result is same as yours. It is really weird that if did not work... My python version is 3.5.1 and BS4 – Mars Lee Mar 02 '16 at 08:32
  • Use `import bs4; print(bs4.__version__)` to display the version, I am using `4.3.2`. – Martin Evans Mar 02 '16 at 08:33
  • I know why... This is totally my fault... I find other place also trying to assign td["colspan"]... Deeply sorry for my stupid! – Mars Lee Mar 02 '16 at 08:37
1

To avoid having error due to wrong input type, you could check if the argument is really integer first before you proceed:

def check_int(s):
    if s = '' or s is None
        return False
    st = str(s)
    if st[0] in ('-', '+'):
        return st[1:].isdigit()
    return st.isdigit()

for td in tds:
    if check_int(td["colspan"]):
        col = int(td["colspan"])
    else:
        col = 0

Or, using ternary operation:

for td in tds:
    col = int(td["colspan"]) if check_int(td["colspan"]) else 0

Edit: some good materials to do int checking without try-except.

Community
  • 1
  • 1
Ian
  • 30,182
  • 19
  • 69
  • 107
  • Thank you for your help! I finally found that it is my fault! I know why it didn't work!! Thank you : ) – Mars Lee Mar 02 '16 at 09:08
0

You can assume that the value of var col is allways "", then check if it's true.

soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')

for td in tds:
    col = 0
    if td["colspan"].isdigit():
        col = int(td["colspan"])
Mauro Baraldi
  • 6,346
  • 2
  • 32
  • 43