Python raw HTML contain "\n" characters that i cannot remove with the replace command

Question

I am getting HTML data with a python get( url ) command which returns raw HTML data that contains “\n” characters. When I run the replace (“\n”,””) command against this it does not remove it. Could some explain how to either remove this at the "simple_get" stage or from the "raw_htmlB" stage! Code below.

from CodeB import simple_get

htmlPath = "https://en.wikipedia.org/wiki/Terminalia_nigrovenulosa"        
raw_html = simple_get(htmlPath)
if raw_html is None:
    print("not found")
else:
    tmpHtml = str(raw_html)
    tmpHtmlB = tmpHtml.replace("\n","")    
    print("tmpHtmlB:=", tmpHtmlB)


from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None
    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

def is_good_response(resp):
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
        and content_type is not None 
        and content_type.find('html') > -1)

def log_error(e):
    print(e)

Python String literals support backslash escaped chars. Many answers already on SO, such as https://stackoverflow.com/a/4369166/1531971 — , Sep 18 '18 at 17:45
Thanks for all the replies, as you may have guessed I am new to the wacky world of python and this question had been driving me up the wall and in the end it turns out to be so simple an answer. thanks again.. — Shaun, Sep 19 '18 at 09:49

score 0 · Answer 1 · answered Sep 18 '18 at 16:40

0

I think a simple adding of space between your double quotes should do you good

answered Sep 18 '18 at 16:40

shivam thakur

109
10

score 0 · Answer 2 · answered Sep 18 '18 at 16:41

0

Use raw strings r'\n or remember that \n stands for newline and you need to escape the backslash: .replace('\\n', '')

answered Sep 18 '18 at 16:41

Huang_d

144
8

Márcio Coelho · Answer 3 · 2018-09-18T16:47:28.563

0

I believe you need to add a another backlash "\" to \n in order to search for the literal string \n, and escape the backlash.

Quick example:

string = '\\n foo'
print(string.replace('\n', ''))

Returns:

\n foo

While:

print(string.replace('\n', ''))

Returns just:

foo

edited Sep 18 '18 at 16:47

answered Sep 18 '18 at 16:42

Márcio Coelho

333
3
11

Karn Kumar · Answer 4 · 2018-09-18T17:33:43.600

It should be pretty straight-forward, Use rstrip to chop off the \n char from the tmpHtmlB.

>>> tmpHtmlB = "my string\n"
>>> tmpHtmlB.rstrip()
'my string'

In your case it should be :

tmpHtmlB = tmpHtml.rstrip()

Even if you have multiple newline chars there, you can use as follows because The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n.

\r\n - on a windows computer
\r - on an Apple computer
\n - on Linux

>>> tmpHtmlB = "Test String\n\n\n"
>>> tmpHtmlB.rstrip("\r\n")
'Test String'

OR

>>> tmpHtmlB.rstrip("\n")
'Test String'

Python raw HTML contain "\n" characters that i cannot remove with the replace command

4 Answers4