How can I download and read a URL with universal newlines?

Question

I was using urllib.urlopen with Python 2.7, but I need to process the downloaded HTML document and its contained newlines (within a <pre> element).

The urllib docs indicates urlopen will not use universal newlines. How can I do this?

score 3 · Accepted Answer · answered Nov 22 '11 at 10:35

Unless the HTML file is already on your disk, urlopen() will handle correctly all formats of newlines (\n, \r\n and \r) in the HTML file you want to parse (that is it will convert them to \n), according to the urllib docs:

"If the URL does not have a scheme identifier, or if it has file: as its scheme identifier, this opens a local file (without universal newlines)"

E.g.

>>> from urllib import urlopen
>>> urlopen("http://****.com/win_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
>>> urlopen("http://****.com/unix_new_lines.htm").read()   
'line 1\nline 2\n\n\nline 3'

You're right. After further diagnosing my bug, I realized this was not actually the problem. — Joe, Nov 23 '11 at 22:48

score 2 · Answer 2 · answered Nov 22 '11 at 04:06

2

When you process the contents of the pre tags, use splitlines to normalize the line-endings:

'\n'.join(contents.splitlines())

answered Nov 22 '11 at 04:06

ekhumoro

115,249
20
229
336

How can I download and read a URL with universal newlines?

2 Answers2

Linked