3

Current code:

import requests
import pandas as pd
   
url = 'https://docs.anaconda.com/anaconda/user-guide/getting-started/'
html = requests.get(url, verify=False).content
df_list = pd.read_html(html, flavor='bs4')
df = df_list[0]

I'm tying to extract html from a page using pandas.read_html() function while setting the 'flavor' arg = 'bs4' or 'html5lib'. I get the error: ImportError: html5lib not found, please install it.

 C:\Users\...\Miniconda3\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'docs.anaconda.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
Traceback (most recent call last):
  File "C:\Users\...\Documents\...\data_scrape.py", line 11, in <module>
    df_list = pd.read_html(html, flavor='bs4')
  File "C:\Users\...\Miniconda3\lib\site-packages\pandas\io\html.py", line 1100, in read_html
    displayed_only=displayed_only,
  File "C:\Users\...\Miniconda3\lib\site-packages\pandas\io\html.py", line 891, in _parse
    parser = _parser_dispatch(flav)
  File "C:\Users\...\Miniconda3\lib\site-packages\pandas\io\html.py", line 840, in _parser_dispatch
    raise ImportError("html5lib not found, please install it")
ImportError: html5lib not found, please install it

But I certainly have bs4 and html5lib installed in the env. After running the conda list command:

conda list
# packages in environment at C:\Users\...\Miniconda3\envs\web_scrape:
#
# Name                    Version                   Build  Channel
beautifulsoup4            4.9.1            py38h32f6830_0    conda-forge
bs4                       4.9.1                         0    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py38h32f6830_0    conda-forge
html5lib                  1.1                pyh9f0ad1d_0    conda-forge
intel-openmp              2020.1                      216
libblas                   3.8.0                    16_mkl    conda-forge
libcblas                  3.8.0                    16_mkl    conda-forge
libiconv                  1.15             vc14h29686d3_5  [vc14]  anaconda
liblapack                 3.8.0                    16_mkl    conda-forge
libxml2                   2.9.10               h464c3ec_1    anaconda
libxslt                   1.1.34               he774522_0    anaconda
lxml                      4.5.2            py38he3d0fc9_0    conda-forge
mkl                       2020.1                      216
numpy                     1.18.5           py38h72c728b_0    conda-forge
openssl                   1.1.1g               he774522_0    conda-forge
pandas                    1.0.5            py38he6e81aa_0    conda-forge
pip                       20.1.1                     py_1    conda-forge
python                    3.8.3           cpython_h5fd99cc_0    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytz                      2020.1             pyh9f0ad1d_0    conda-forge
setuptools                49.2.0           py38h32f6830_0    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
soupsieve                 2.0.1            py38h32f6830_0    conda-forge
sqlite                    3.32.3               he774522_1    conda-forge
vc                        14.1                 h869be7e_1    conda-forge
vs2015_runtime            14.16.27012          h30e32a0_2    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.34.2                     py_1    conda-forge
wincertstore              0.2                   py38_1003    conda-forge

I don't know why the packages aren't being recognized by the pandas function. There are multiple other posts that deal with the same problem, but none of the solutions have worked for me.

Example, a few posts like these: Python: ImportError: lxml not found, please install it and

The above answers suggest to use pip3 to install the packages. When I run those commands I get the following info.

pip3 install html5lib
Requirement already satisfied: html5lib in c:\users\...\miniconda3\envs\web_scrape\lib\site-packages (1.1)
Requirement already satisfied: six>=1.9 in c:\users\...\miniconda3\envs\web_scrape\lib\site-packages (from html5lib) (1.15.0)
Requirement already satisfied: webencodings in c:\users\...\miniconda3\envs\web_scrape\lib\site-packages (from html5lib) (0.5.1)

Any help or references to a similar problem are appreciated!

Thank you!

smci
  • 32,567
  • 20
  • 113
  • 146
GeosGeek
  • 31
  • 1
  • 3
  • 1
    i don't know how you are running the first function, but the original (failing) function looks to be running in a system miniconda, while your other two examples are clearly running in a conda env. The system site-packages does not have these packages, but your conda environment does. – Corley Brigman Jul 14 '20 at 20:23
  • If you want to debug why, run with `python -v` (per [this answer](https://stackoverflow.com/a/7334681/202229)) then you'll see the import error is being thrown from pandas/io/html.py line 864, which is caused when `if not _HAS_HTML5LIB` is false, which is due to `html5lib = import_optional_dependency("html5lib", errors="ignore")` on line 66 having failed. To debug further, look inside `pandas/compat/_optional.py import_optional_dependency()` – smci Mar 25 '23 at 06:56
  • Also, tell us what your versions of python, pandas, conda are. (I see your html5lib is 1.1). – smci Mar 25 '23 at 07:03

2 Answers2

2

Try

conda install -c anaconda html5lib 

I had the same issue and I have no idea why it worked but it worked just fine for me, I had the same trouble with the lib lxml and I applied the same solution. I just copied the answer from a post on Github

https://github.com/jupyter/notebook/issues/3623

0

For anyone coming here... this page was close to the top of my search results, and my resolution was, different, but simple. I had neglected to restart Jupyter after installing html5lib. Jupyter was running and I needed to restart it. Once I restarted it, everything was fine.

Kevin Schroeder
  • 1,296
  • 11
  • 23
  • This tells us almost nothing about why you figure it resolved the issue: had you a) updated to a new pandas version (which version?) that does the import correctly? b) had you updated conda? on which packages? which command? c) something else? Just because it fixed the issue on your setup, doesn't mean it'll work for others. You need to diagnose why. – smci Aug 30 '23 at 23:32
  • I updated my comment to note that after I had installed html5lib I had neglected to restart Jupyter. – Kevin Schroeder Sep 01 '23 at 09:48
  • Which version of html5lib did you install? Which version of pandas? of Jupyter? This information is still missing. – smci Sep 03 '23 at 00:58