Python Unicode Not Parsing Apostrophe's

Question

I'm trying to parse a title of the following NYT article: https://www.nytimes.com/2018/01/14/us/politics/david-perdue-trump-shithole.html

The title I would like my code to parse is "Hopes Dim for DACA Deal as Lawmakers Battle Over Trump’s Immigration Remarks - The New York Times." I get this result when I run print soup.html.head.title when running with the debugger below. Other than capturing the stdout in a variable (which seems roundabout), is there a smarter way I can get the text I want?

Alternate Try #1

(Pdb) str(soup.html.head.title)
'<title>Hopes Dim for DACA Deal as Lawmakers Battle Over Trump\xe2\x80\x99s Immigration Remarks - The New York Times</title>'

Alternate Try #2

(Pdb) soup.html.head.title.encode('utf-8')
'<title>Hopes Dim for DACA Deal as Lawmakers Battle Over Trump\xe2\x80\x99s Immigration Remarks - The New York Times</title>'

Alternate Try #3

(Pdb) soup.html.head.title.encode('ascii')
'<title>Hopes Dim for DACA Deal as Lawmakers Battle Over Trump&#8217;s Immigration Remarks - The New York Times</title>'

Code:

from __future__ import division

import regex as re
import string
import urllib2
import pdb
from collections import Counter

from bs4 import BeautifulSoup
from cookielib import CookieJar

PARSER_TYPE = 'html.parser'

class NYT(object):
    def __init__(self, url, title='test-title'):
        self.url = url
        self.title = get_title(url)

def get_title(url):
    cj = CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    p = opener.open(url)
    soup = BeautifulSoup(p.read(), PARSER_TYPE)
    title = soup.html.head.title.string
    pdb.set_trace() # trying a few different things here
    title = re.sub(r'[^\x00-\x7F]+',"", title).replace(" - The New York Times", "")
    return title

The question is, how can I get the get_title_method to return the string I want, "Hopes Dim for DACA Deal as Lawmakers Battle Over Trump’s Immigration Remarks - The New York Times." — stk1234, Jan 20 '18 at 13:21

score 0 · Answer 1 · answered Jan 15 '18 at 18:04

0

This should solve your problem.

from bs4 import BeautifulSoup
import requests
url = 'https://www.nytimes.com/2018/01/14/us/politics/david-perdue-trump-shithole.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.title.string)

Maybe this article will shed some light as to why that's happening: Python Requests and Unicode

answered Jan 15 '18 at 18:04

Chris

15,819
3
24
37

I need the method to return the string. print(soup.title.string) yields the correct string in my console, but I can't return a print statement. I get a syntax error. – stk1234 Jan 20 '18 at 13:24
Maybe try returning it without the print() ? – Chris Jan 25 '18 at 13:34

Python Unicode Not Parsing Apostrophe's

1 Answers1