I'm trying to parse a web page that has non-printable characters on it and write that to a file in python. I'm using Python 2.7 with requests and Beautiful Soup.
I get the page with requests, and parse it with the following-
for option in recon:
data['opts'] = '/c' + option
print "Getting: ",
print option
r = requests.post(url, data)
print r.content
page = bs4.BeautifulSoup(r.content, "lxml", from_encoding='utf-8')
print page
tag = page.pre.contents
print tag[0]
When testing, the print r.content shows the page properly in all its unformatted glory. The page is a .cfm, and the text I'm looking for falls between "pre" tags. After running through bs though, bs interprets some of the non printable text into "br" tags, resulting in tags being a list of 2 items, instead of just all the text between the pre tags. Is there a way to either just get the text between the pre tags with requests, or do something differently with bs to get it to not misinterpret the characters?
I've read through the following trying to figure it out, plus requests and beautiful soup docs, but found no luck so far-