Beautiful Soup lxml Character Encoding Issue

Question

I'm trying to parse a web page that has non-printable characters on it and write that to a file in python. I'm using Python 2.7 with requests and Beautiful Soup.

I get the page with requests, and parse it with the following-

for option in recon:
    data['opts'] = '/c' + option
    print "Getting: ",
    print option
    r = requests.post(url, data)
    print r.content
    page = bs4.BeautifulSoup(r.content, "lxml", from_encoding='utf-8')
    print page
    tag = page.pre.contents
    print tag[0]

When testing, the print r.content shows the page properly in all its unformatted glory. The page is a .cfm, and the text I'm looking for falls between "pre" tags. After running through bs though, bs interprets some of the non printable text into "br" tags, resulting in tags being a list of 2 items, instead of just all the text between the pre tags. Is there a way to either just get the text between the pre tags with requests, or do something differently with bs to get it to not misinterpret the characters?

I've read through the following trying to figure it out, plus requests and beautiful soup docs, but found no luck so far-

Joel on Software - Character Sets

SO utf-8 vs unicode

SO Getting text between tags

score 0 · Answer 1 · answered Jun 20 '17 at 00:05

0

Overthought the problem. I just base64 encoded the data before transfer with certutil on windows, removed the first and last line, and then decoded on the far side.

answered Jun 20 '17 at 00:05

gr0k

789
2
9
22

Beautiful Soup lxml Character Encoding Issue

1 Answers1