8

I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.

I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?

 data = urllib2.urlopen(url)
 print data

Only gives me about 1/3 of the source.

 data = urllib2.urlopen(url)
 for lines in data.readlines()
      print lines

This gives me the entire source.

Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.

octopusgrabbus
  • 10,555
  • 15
  • 68
  • 131
Rentafence
  • 81
  • 1
  • 1
  • 2
  • 2
    possible duplicate of [Download html page and its content](http://stackoverflow.com/questions/1825438/download-html-page-and-its-content) – Michael Mrozek Jun 06 '12 at 04:57

5 Answers5

5

You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!

vaebnkehn
  • 113
  • 5
5

You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like

data = urllib2.urlopen(url)
print data.read()

should give you the entire webpage.

From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.

Community
  • 1
  • 1
Adam Mihalcin
  • 14,242
  • 4
  • 36
  • 52
1

Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:

This function returns a file-like object

This is what I got :

print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>

readlines() returns list of lines of html source and you can store it in a string like :

import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
    l.append(line)
s = '\n'.join(l)

You can either use list l or string s, according to your need.

Niyojan
  • 544
  • 1
  • 6
  • 23
0

I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.

dilip kumbham
  • 703
  • 6
  • 15
0

If you want to parse over the variable afterwards you might use gazpacho:

from gazpacho import Soup

url = "https://www.example.com"
soup = Soup.get(url)
str(soup)

That way you can perform finds to extract the information you're after!

emehex
  • 9,874
  • 10
  • 54
  • 100