2

I am trying to scrape data from a webpage that contains a table to then put it in a pandas data frame.

I beleive to have done everything correctly but i get repeated columns...

Here is my code:

html_content = requests.get('http://timetables.itsligo.ie:81/reporting/individual;student+set;id;SG_KAPPL_H08%2FF%2FY1%2F1%2F%28A%29%0D%0A?t=student+set+individual&days=1-5&weeks=29&periods=3-20&template=student+set+individual').text


soup = BeautifulSoup(html_content,'html.parser')

all_tables = soup.find_all('table')

wanted_table = all_tables[6]

first_tr = wanted_table.find('tr')

following_tr = first_tr.find_next_siblings()


details = []

for tr in following_tr:
    prepare = []
    for td in tr.find_all('td'):
        prepare.append(td.text)
    details.append(prepare)

df = pd.DataFrame(details)
pd.set_option('display.max_columns', None)
display(df)

Which works great but as you can see in the bellow picture(column1 and 2 in row 0) , im getting repeated td's and one always has \n repeated.

The thing i noticed is that the details list return its double for some reason,maybe there is a table nested in a table?

Im doing this in jupyter by the way.

Thank you in advance!

King
  • 451
  • 6
  • 17
  • 1
    Be careful not to edit out content that is required for solution. – QHarr Mar 12 '21 at 20:29
  • I second @QHarr, your original post was a [good example](https://stackoverflow.com/help/minimal-reproducible-example). The issue described was no longer clear or reproducible after the edits; I have rolled them back. – Tom Mar 12 '21 at 20:47

2 Answers2

2

The reason your details list is nested is because you are constructing it that way; that is, if you append a list (prepare) to another list (details), you get a nested list. See here. And this is okay, since it is works well to be read into your DataFrame.

Still, you are correct that there is a nested table thing going on in the HTML. I won't try to format the HTML here, but each box in the schedule is a <td> within the overarching wanted_table. When there is a course in one of those cells, there is another <table> used to hold the details. So the class name, professor, etc. are more <td> elements within this nested <table>. So when finding all the cells (tr.find_all('td')), you encounter both the larger class box, as well as its nested elements. And when you get the .text on the outermost <td>, you also get the text from the innermost cells, hence the repetition.

I am not sure if this is the best way, but one option would be to prevent the search from entering the nested table, using the recursive parameter in find_all.

# all your other code above

for tr in following_tr:
    prepare = []
    for td in tr.find_all('td', recursive=False):
        prepare.append(td.text)
    details.append(prepare)

df = pd.DataFrame(details)

The above should prevent the repeated elements from appearing. However, there is still the problem of having many \n characters, as well as not including the fact that some cells span multiple columns. You can start to fix the first by including some strip-ing on the text. For the second, you can access the colspan attribute to pad the prepare list:

# all your other code above

for tr in following_tr:
    prepare = []
    for td in tr.find_all('td', recursive=False):
        text = td.text.strip('\s\n')
        prepare += [text] + [None] * (int(td.get('colspan', 0)) - 1)
    details.append(prepare)

df = pd.DataFrame(details)

It's a little too unwieldy to post the output. And there is still formatting you will likely want to do, but that is getting beyond the scope of your original post. Hopefully something in here helps!

Tom
  • 8,310
  • 2
  • 16
  • 36
  • Nicely detailed explanation. It would be great if you added in the header row as well and converted to df at end – QHarr Mar 12 '21 at 20:28
  • 1
    @QHarr, thanks! I updated to include the `df` construction. As for the header row, OP's original post seemed to be explicitly omitting this, so I did as well. But I see it has changed a lot now.... – Tom Mar 12 '21 at 20:37
-1
import pandas as pd
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url)

len(df_list)

Output: 32

after specifying na_values Below

pd.read_html(
    url, 
    na_values=["Forbes: The World's Billionaires website"]
    )[0].tail()
Divyessh
  • 2,540
  • 1
  • 7
  • 24