0

I'm currently in the learning process of Python3, I am scraping a site for some data, which works fine, but when it comes to printing out the p tags I just can't get it to work as I expect.

import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup



data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')

for child in dialog:
    childtext = child.get_text()
    #have tried child.string aswell (exactly the same result)
    childlist.append(childtext.encode('utf-8', 'ignore')
    #Have tried with str(childtext.encode('utf-8', 'ignore'))

print (childlist)

That all works, but the printing is "bytes"

b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'

Real sample text that is ascii encoded:

b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

Note that Announcement is p and the rest is 'strong' under a p tag.

Same sample with utf-8 encode

b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "

I WISH to get:

"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

As you see, the incorrect chars are stripped in "ascii", but as some are   that destroys some linebreaks and I have yet to figure out how to print that correctly, also, the b's are still there then!

I really can't figure out how to remove b's and encode or decode properly. I have tried every "solution" that I can google up.

HTML Content = utf-8

I would most rather not change the full data before processing because it will mess up my other work and I don't think it is needed.

Prettify does not work.

Any suggestions?

taleinat
  • 8,441
  • 1
  • 30
  • 44

1 Answers1

0

First, you're getting output of the form b'stuff' because you are calling .encode(), which returns a bytes object. If you want to print strings for reading, keep them as strings!

As a guess, I assume you're looking to print strings from HTML nicely, pretty much as they would be seen in a browser. For that, you need to decode the HTML string encoding, as described in this SO answer, which for Python 3.5 means:

import html
html.unescape(childtext)

Among other things, this will convert any   sequences in the HTML string into '\xa0' characters, which are printed as spaces. However, if you want to break lines on these characters despite   literally meaning "non-breaking space", you'll have to replace those with actual spaces before printing, e.g. using x.replace('\xa0', ' ').

Community
  • 1
  • 1
taleinat
  • 8,441
  • 1
  • 30
  • 44