How do I use Python and lxml to parse a local html file?

Question

I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.

My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/

This could be a related problem: Requests : No connection adapters were found for, error in Python3

Here is my code:

from lxml import html
import requests

page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)

test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')

print test

The traceback that I'm getting reads:

C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
  File "C:/Users/.../extract_html/extract.py", line 4, in <module>
    page = requests.get('C:\Users\...\sites\site_1.html')
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'

Process finished with exit code 1

You can see that it has something to do with a "connection adapter" but I'm not sure what that means.

Why don't you start from a minimal example of your local HTML file? Makes it easier to learn for you and you can post the contents here making it easier for everyone to follow along. — Midnighter, Sep 24 '15 at 16:02
Unfortunately the file is huge and I fear that simplifying it could produce the output of the program. — rdevn00b, Sep 24 '15 at 16:04

Bryan Oakley · Accepted Answer · 2015-09-24T16:13:43.837

32

If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)

edited Sep 24 '15 at 16:13

answered Sep 24 '15 at 16:06

Bryan Oakley

370,779
53
539
685

Ok I'm trying this but it's telling me that the .text in page.text is unresolvable. – rdevn00b Sep 24 '15 at 16:10
@rdevn00b: my bad. Yes, just use `page`, not `page.text`. I'll update my answer. – Bryan Oakley Sep 24 '15 at 16:13

score 12 · Answer 2 · answered Jan 27 '19 at 13:52

12

There is a better way for doing it: using parse function instead of fromstring

tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))

answered Jan 27 '19 at 13:52

molhamaleh

353
3
9

2

Don't forget to do the import first: `from lxml import html` – didierCH Jan 01 '21 at 16:38

score 5 · Answer 3 · answered Jan 28 '20 at 12:09

5

You can also try using Beautiful Soup

from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")     
soup = BeautifulSoup(f)
f.close()

answered Jan 28 '20 at 12:09

product_nick

93
1
4

How do I use Python and lxml to parse a local html file?

3 Answers3