How to parse HTML tables using html5lib and Beautiful Soup in Jupyter?

Question

I'm Getting the value error trying to parse a page with BeautifulSoup and html5lib in Jupyter:

import pandas as pd
import requests
import html5lib

url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"

r = requests.get(url)
df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
df = df_list[0]
df.head()

ValueError                                Traceback (most recent call last)
Cell In[1], line 9
      6 url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"
      8 r = requests.get(url)
----> 9 df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
     10 df = df_list[0]
     11 df.head()

File D:\Drivers\Anaconda\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:1205, in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only, extract_links)
   1201 validate_header_arg(header)
   1203 io = stringify_path(io)
-> 1205 return _parse(
   1206     flavor=flavor,
   1207     io=io,
   1208     match=match,
   1209     header=header,
   1210     index_col=index_col,
   1211     skiprows=skiprows,
   1212     parse_dates=parse_dates,
   1213     thousands=thousands,
   1214     attrs=attrs,
   1215     encoding=encoding,
   1216     decimal=decimal,
   1217     converters=converters,
   1218     na_values=na_values,
   1219     keep_default_na=keep_default_na,
   1220     displayed_only=displayed_only,
   1221     extract_links=extract_links,
   1222 )

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:1006, in _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, **kwargs)
   1004 else:
   1005     assert retained is not None  # for mypy
-> 1006     raise retained
   1008 ret = []
   1009 for table in tables:

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:986, in _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, **kwargs)
    983 p = parser(io, compiled_match, attrs, encoding, displayed_only, extract_links)
    985 try:
--> 986     tables = p.parse_tables()
    987 except ValueError as caught:
    988     # if `io` is an io-like object, check if it's seekable
    989     # and try to rewind it before trying the next parser
    990     if hasattr(io, "seekable") and io.seekable():

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:262, in _HtmlFrameParser.parse_tables(self)
    254 def parse_tables(self):
    255     """
    256     Parse and return all tables from the DOM.
    257 
   (...)
    260     list of parsed (header, body, footer) tuples from tables.
    261     """
--> 262     tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    263     return (self._parse_thead_tbody_tfoot(table) for table in tables)

File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:618, in _BeautifulSoupHtml5LibFrameParser._parse_tables(self, doc, match, attrs)
    615 tables = doc.find_all(element_name, attrs=attrs)
    617 if not tables:
--> 618     raise ValueError("No tables found")
    620 result = []
    621 unique_tables = set()

ValueError: No tables found

I've been trying page parsing in jupyter by using

BeautifulSoup(html.text, 'html.parser')

But in this case it doesn't bring the proper page content from a browser - the tables are not seen in the result.

I read that this is possible with selenium or pycharm.

But, also with pandas and html5lib. I never used it and don't know what the approach should be.

Something specific with html5lib? Any inconsistencies in my simpliest code? Any other ways to parse tables in web page? With lxml? Where to look at for the decision?

The table is added dynamically using JavaScript. `read_html()` doesn't execute JavaScript. — Barmar, May 23 '23 at 21:24
@Barmar - Thanks, I saw the question with such a solution. I tried but ended up with the error. So, could you please provide the syntax of using JavaScript in the question's context? — Eugene, May 23 '23 at 21:37
You need to use something like Selenium WebDriver. I have no idea how to connect that to pandas. — Barmar, May 23 '23 at 21:43
https://stackoverflow.com/questions/23377533/python-beautifulsoup-parsing-table — Eugene, May 23 '23 at 22:24
This is to confirm that the EASIEST way (as I referred above) DOES WORK! WITHOUT BeautifulSoup, html5lib, JavaScript and other tricks. Pandas and requests only! Cannot provide the working code - message is too short. The code turned out to be almost the same as was in my question. Without html5lib, though. Looks like the problem was with the URL. — Eugene, May 23 '23 at 23:07
@Barry the Platipus. Your answer works also. And it gives more info, with names of columns. Thank you Barry! — Eugene, May 23 '23 at 23:31
There's nothing in that question about dynamically-created tables. — Barmar, May 24 '23 at 14:46

score 1 · Answer 1 · answered May 23 '23 at 23:20

This is to confirm that the EASIEST way (I provided the link above) DOES WORK! WITHOUT BeautifulSoup, html5lib, JavaScript and other tricks. Pandas and requests only!

The working code:

import requests 
import pandas as pd

url = "https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies#:~:text=%20%20%20%20Film%20%20%20,%20%204%20%2025%20more%20rows%20"

r = requests.get(url)
df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
df = df_list[0]
df.head()

OUT:

    0   1
0   1998    100 Movies
1   1999    100 Stars
2   2000    100 Laughs
3   2001    100 Thrills
4   2002    100 Passions

print(len(df_list[0])) gives 14 tables.

The code is almost the same as was in my question. Looks like the problem was with the URL. Strange that the error was not related to it.

None of the URL after `#` is sent to the server, it should have no impact. — Barmar, May 24 '23 at 14:40

score 0 · Accepted Answer · answered May 23 '23 at 23:11

The data is in page, but it's being transformed into a table by Javascript. Pandas cannot execute Javascript to see that table. I notice you're also importing requests package. Here is one way of obtaining that GDP data, using requests to retrieve the data, then using BeautifulSoup to parse the html response and isolate the element holding the data, then using JSON to parse that element and get the actual data:

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import json

url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"

r = requests.get(url)
soup = bs(r.text, 'html.parser')
elem_w_data = soup.select_one('script[id="__NEXT_DATA__"]').text

df = pd.json_normalize(json.loads(elem_w_data)['props']['pageProps']['data'])
print(df)

Result in terminal:

    pop id  imfGDP  unGDP   country gdpPerCapita    continent
0   3.399966e+05    840 2.669515e+13    18624475000000  United States   7.851594e+04    North America
1   5.050000e-03    840 2.669515e+13    18624475000000  United States   5.286168e+12    North America
2   1.425671e+06    156 2.186548e+13    11218281029298  China   1.533697e+04    Asia
3   -1.500000e-04   156 2.186548e+13    11218281029298  China   -1.457699e+14   Asia
4   1.232945e+05    392 5.291351e+12    4936211827875   Japan   4.291635e+04    Asia
... ... ... ... ... ... ... ...
419 8.260000e-03    788 0.000000e+00    41703561397 Tunisia 5.048857e+09    Africa
420 4.606200e+01    796 0.000000e+00    917550492   Turks and Caicos Islands    1.991990e+04    North America
421 7.860000e-03    796 0.000000e+00    917550492   Turks and Caicos Islands    1.167367e+08    North America
422 3.674463e+04    804 0.000000e+00    93270354852 Ukraine 2.538339e+03    Europe
423 -7.448000e-02   804 0.000000e+00    93270354852 Ukraine -1.252287e+09   Europe
424 rows × 7 columns

Relevant documentation: pandas, requests, BeautifulSoup.

@ Barry the Platipus. Thank you one more time, Barry! So, here we see 2 ways of parsing tables: one, the simpliest one and yours, with more detailed info. — Eugene, May 23 '23 at 23:37

How to parse HTML tables using html5lib and Beautiful Soup in Jupyter?

2 Answers2