I'm trying to parse a title of the following NYT article: https://www.nytimes.com/2018/01/14/us/politics/david-perdue-trump-shithole.html
The title I would like my code to parse is "Hopes Dim for DACA Deal as Lawmakers Battle Over Trump’s Immigration Remarks - The New York Times." I get this result when I run print soup.html.head.title when running with the debugger below. Other than capturing the stdout in a variable (which seems roundabout), is there a smarter way I can get the text I want?
Alternate Try #1
(Pdb) str(soup.html.head.title)
'<title>Hopes Dim for DACA Deal as Lawmakers Battle Over Trump\xe2\x80\x99s Immigration Remarks - The New York Times</title>'
Alternate Try #2
(Pdb) soup.html.head.title.encode('utf-8')
'<title>Hopes Dim for DACA Deal as Lawmakers Battle Over Trump\xe2\x80\x99s Immigration Remarks - The New York Times</title>'
Alternate Try #3
(Pdb) soup.html.head.title.encode('ascii')
'<title>Hopes Dim for DACA Deal as Lawmakers Battle Over Trump’s Immigration Remarks - The New York Times</title>'
Code:
from __future__ import division
import regex as re
import string
import urllib2
import pdb
from collections import Counter
from bs4 import BeautifulSoup
from cookielib import CookieJar
PARSER_TYPE = 'html.parser'
class NYT(object):
def __init__(self, url, title='test-title'):
self.url = url
self.title = get_title(url)
def get_title(url):
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
p = opener.open(url)
soup = BeautifulSoup(p.read(), PARSER_TYPE)
title = soup.html.head.title.string
pdb.set_trace() # trying a few different things here
title = re.sub(r'[^\x00-\x7F]+',"", title).replace(" - The New York Times", "")
return title