Remove newline in python with urllib

Question

I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:

import urllib.request

#Downloading entire Web Document 
def download_page(a):
    opener = urllib.request.FancyURLopener({})
    try:
        open_url = opener.open(a)
        page = str(open_url.read())
        return page
    except:
        return""  
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)

#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)

I am not able to spot out the reason of getting a lot of \n in the raw_html variable.

Perhaps you're getting a `\r\n` instead of just `\n`? You would still see a new line if you just remove `\n`. Try to replace both. — orange, Dec 28 '14 at 06:10
I also ried `.replace('\n', '').replace('\r', '').replace('\t', '')` but it did not solve my problem! — hnvasa, Dec 28 '14 at 06:14
What happens if you print as hex? Is there a `char(13)` in the output? `' '.join(x.encode('hex') for x in raw_html2)`? — orange, Dec 28 '14 at 06:17

score 8 · Answer 1 · edited May 23 '17 at 12:14

Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:

from urllib.request import urlopen

with urlopen("http://www.zseries.in") as response:
    html_content = response.read()

At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:

encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)

See A good way to get the charset/encoding of an HTTP response in Python.

if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).

If you read the html correctly then you shouldn't see literal characters \n in the page.

score 1 · Answer 2 · answered Dec 28 '14 at 06:18

1

If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:

import urllib.request

def download_page(a):
    opener = urllib.request.FancyURLopener({})
    open_url = opener.open(a)
    page = str(open_url.read()).replace('\\n', '')
    return page

I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.

answered Dec 28 '14 at 06:18

MattDMo

100,794
21
241
231

thank you for your answer..! it worked for me! I also replaced the function as suggested by you! – hnvasa Dec 28 '14 at 06:24
1

it is a mistake to call `str()` here, see [the explanation](http://stackoverflow.com/a/27674228/4279). – jfs Dec 28 '14 at 06:40
@J.F.Sebastian I know, I was just trying to keep the OP's code as close to the original as possible. (Yes, I did remove the try/except clauses, but that really bugged me). Personally, I'd rewrite the whole thing to use `requests`, and make a 1-liner out of it, but you deal with what you're dealt :) – MattDMo Dec 28 '14 at 06:46

score 0 · Accepted Answer · answered Dec 28 '14 at 06:18

0

Seems like they are literal \n characters , so i suggest you to do like this.

raw_html2 = raw_html.replace('\\n', '')

answered Dec 28 '14 at 06:18

Avinash Raj

172,303
28
230
274

1

it cures only symptoms and even not all of them e.g., I bet there is spurious `b'` in the string, see [the explanation](http://stackoverflow.com/a/27674228/4279). – jfs Dec 28 '14 at 06:42

Remove newline in python with urllib

3 Answers3

Linked