2

Looking for some help. I am working on a project scraping specific Craigslist posts using Beautiful Soup in Python. I can successfully display emojis found within the post title but have been unsuccessful within the post body. I've tried different variations but nothing has worked so far. Any help would be appreciated.

Code:

f = open("clcondensed.txt", "w")
html2 = requests.get("https://raleigh.craigslist.org/wan/6078682335.html")
soup = BeautifulSoup(html2.content,"html.parser")
#Post Title 
title = soup.find(id="titletextonly")       
title1 = soup.title.string.encode("ascii","xmlcharrefreplace")
f.write(title1)
#Post Body  
body = soup.find(id="postingbody")          
body = str(body)
body = body.encode("ascii","xmlcharrefreplace")
f.write(body)

Error received from the body:

'ascii' codec can't decode byte 0xef in position 273: ordinal not in range(128)
Phil21
  • 23
  • 3

1 Answers1

1

You should use unicode

body = unicode(body)

Please refer Beautiful Soup Documentation NavigableString


Update:

Sorry for the quick answer. It's not that right.

Here you should use lxml parser instead of html parser, because html parser do not support well for NCR (Numeric Character Reference) emoji.

In my test, when NCR emoji decimal value greater than 65535, such as your html demo emoji 🚢, HTML parser just decode it with wrong unicode \ufffd than u"\U0001F6A2". I can not find the accurate Beautiful Soup reference for this, but the lxml parser is just OK.

Below is the tested code:

import requests
from bs4 import BeautifulSoup
f = open("clcondensed.txt", "w")
html = requests.get("https://raleigh.craigslist.org/wan/6078682335.html")
soup = BeautifulSoup(html.content, "lxml")
#Post Title
title = soup.find(id="titletextonly")
title = unicode(title)
f.write(title.encode('utf-8'))
#Post Body
body = soup.find(id="postingbody")
body = unicode(body)
f.write(body.encode('utf-8'))
f.close()

You can ref lxml entity handling to do more things.

If you do not install lxml, just ref lxml installing.

Hope this help.

Fogmoon
  • 569
  • 5
  • 16