Open URL and place it to a string

Question

filehandle = urllib.urlopen(myurl)

Because of the fact that I want to regex the filehandle afterwords I need to transform the filehandle from an object to a string. How can I make the webpage code to be stored in a string?

Also if you want to extract data, don't use regex, use a proper html parser like `lxml` — Jakob Bowyer, Oct 07 '12 at 15:45
@JakobBowyer Why should I do it with `lxml` and `BeautifulSoup`. Isn't it more easy with the `regex` way ? — george mano, Oct 07 '12 at 15:47
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Jakob Bowyer, Oct 07 '12 at 15:48
Why accept Anuj, my answer is clearer and provides a document link? — Jakob Bowyer, Oct 07 '12 at 16:02

Anuj Gupta · Accepted Answer · 2012-10-07T15:57:35.413

3

It's pretty simple:

page = filehandle.read()

You can also iterate over it, like:

lines = []
for line in filehandle:
    lines.append(line)

For extracting data, use BeautifulSoup or lxml.

edited Oct 07 '12 at 15:57

answered Oct 07 '12 at 15:43

Anuj Gupta

10,056
3
28
32

3

Please don't name your variables after builtin types. – Jakob Bowyer Oct 07 '12 at 15:44
1

`string` is not a type. `str` is. – Anuj Gupta Oct 07 '12 at 15:45
1

The idea of calling it `string` is non descriptive though. Someone who has to maintain this code might not know what string was or where it came from. – Jakob Bowyer Oct 07 '12 at 15:45
1

you don't need to call `.readlines()` as this reads the whole file into memory anyway, you can just use `for x in file:` as it reads each line when required. – Jakob Bowyer Oct 07 '12 at 15:55
If you are iterating directly over the file object, you don't need to append to a list, because then you are just doing exactly what `readlines()` does – Jakob Bowyer Oct 07 '12 at 15:58
1

Hmm.. Yes, sir! But fh is a file instance. Read it once and you can't do that again. A list is a list. – Anuj Gupta Oct 07 '12 at 15:58
Just use it as you iterate it! – Jakob Bowyer Oct 07 '12 at 16:00
What if you need to use it again? i.e. Iterate twice? – Anuj Gupta Oct 07 '12 at 16:00
If you need a list: `lines = filehandler.readlines()` (though lines are not very useful for an html text). – jfs Oct 07 '12 at 16:05

Jakob Bowyer · Answer 2 · 2012-10-07T15:53:14.403

3

Because urllib.urlopen returns a file like object, you can either call .read() on it, or directly iterate over it.

See the docs for more

Edit:

Okay to explain what

directly iterate over it

means.

import urllib
request = urllib.urlopen("http://www.python.org")
for source_line in request:
    print source_line

edited Oct 07 '12 at 15:53

answered Oct 07 '12 at 15:44

Jakob Bowyer

33,878
8
76
91

The docs may have the worst notation ever (Compared to cplusplus.com and MSDN). – george mano Oct 07 '12 at 15:45
@georgemano they just seem more descriptive and easier to follow, the MSDN docs for example seem to jump around all over the place. – Jakob Bowyer Oct 07 '12 at 15:46
1

what do you mean `directly iterate over it` ? – george mano Oct 07 '12 at 15:51
1

Is `request` a descriptive name? Because essentially, the variable contains a response. – Anuj Gupta Oct 07 '12 at 15:55

Open URL and place it to a string

2 Answers2