Why doesn't urllib work with local website?

Question

I have a problem with urllib in which I can't seem to scrape my own local website. I can get it to print out all the contents of the website but the regex or something doesn't work. The output I get with the current code is just []. So I was wondering what I am doing wrong? I haven't used urllib in a while so it is very possible I missed something obvious. Python file:

import urllib
import re

htmlfile=urllib.urlopen('IP of server')
htmltext=htmlfile.read()
regex="<body>(.+?)</body>"
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

HTML file:

<html>
    <body>
        This is a basic HTML file to try to get my python file to work...
    </body>
</html>

Thanks a bunch in advance!

`"I can get it to print out all the contents of the website"` Then `urllib` is working just fine. — Jonathon Reinhart, Jan 14 '15 at 01:47

hwnd · Accepted Answer · 2015-01-14T02:04:47.143

A few things wrong here. You need to enable the dotall modifier which forces the dot to span across newline sequences. As far as the following lines containing your compiled regex and call to findall, it should be:

regex = "<body>(.+?)</body>"
pattern = re.compile(regex, re.DOTALL)
price = pattern.findall(htmltext)

Which could be simplified as below and I would recommend discarding the whitespace from the match result.

price = re.findall(r'(?s)<body>\s*(.+?)\s*</body>', htmltext)

For future reference, use a parser such as BeautifulSoup to extract the data instead of regular expression.

score 2 · Answer 2 · edited May 23 '17 at 10:25

2

Alternatively, and actually this should be preferred to regex-based approach - use an HTML Parser.

Example (using BeautifulSoup):

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <html>
...     <body>
...         This is a basic HTML file to try to get my python file to work...
...     </body>
... </html>
... """
>>> soup = BeautifulSoup(data)
>>> print soup.body.get_text(strip=True)
This is a basic HTML file to try to get my python file to work...

Note how simple the code is, no "regex magic".

edited May 23 '17 at 10:25

Community

1
1

answered Jan 14 '15 at 02:03

alecxe

462,703
120
1,088
1,195

1

I was too lazy to write this +1 ;) – Padraic Cunningham Jan 14 '15 at 02:05
I tried BeatifulSoup and it didn't find my local IP. – user3818089 Jan 14 '15 at 02:07
@user3818089 nono, `BeautifulSoup` is a parser, to provide it with smth to parse making an HTTP request you need to use `urllib` or `urllib2` or `requests`. – alecxe Jan 14 '15 at 02:08
So I would still do `htmlfile=urllib.urlopen("IP of server")` and plug that into the `soup = BeautifulSoup(htmlfile)`? Thanks! – user3818089 Jan 14 '15 at 02:22
@user3818089 I would use `urllib2` instead of `urllib`, but yeah, correct. – alecxe Jan 14 '15 at 02:24

score 1 · Answer 3 · answered Jan 14 '15 at 01:47

1

The dot . does not match line breaks unless you set the dot-matches-all s modifier:

re.compile('<body>(.+?)</body>', re.DOTALL)

answered Jan 14 '15 at 01:47

Aran-Fey

39,665
11
104
149

Why doesn't urllib work with local website?

3 Answers3