22

I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.

My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/

This could be a related problem: Requests : No connection adapters were found for, error in Python3

Here is my code:

from lxml import html
import requests

page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)

test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')

print test

The traceback that I'm getting reads:

C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
  File "C:/Users/.../extract_html/extract.py", line 4, in <module>
    page = requests.get('C:\Users\...\sites\site_1.html')
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'

Process finished with exit code 1

You can see that it has something to do with a "connection adapter" but I'm not sure what that means.

Community
  • 1
  • 1
rdevn00b
  • 552
  • 2
  • 5
  • 14
  • Why don't you start from a minimal example of your local HTML file? Makes it easier to learn for you and you can post the contents here making it easier for everyone to follow along. – Midnighter Sep 24 '15 at 16:02
  • Unfortunately the file is huge and I fear that simplifying it could produce the output of the program. – rdevn00b Sep 24 '15 at 16:04

3 Answers3

32

If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)
Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685
12

There is a better way for doing it: using parse function instead of fromstring

tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
molhamaleh
  • 353
  • 3
  • 9
5

You can also try using Beautiful Soup

from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")     
soup = BeautifulSoup(f)
f.close()
product_nick
  • 93
  • 1
  • 4