How do I deal with \u2019
and \u201c
type characters when reading UTF-8 encoded JSON into python? I've got a lot of files of this type
{
"id": 1,
"title": "Place names consultation draws to a close",
"releaseDateTime": "2006-07-12T00:00:00+09:30",
"mainContent": "<P><FONT face=Arial>The Government’s six-week public consultation process into the naming of localities across the ... on ...</FONT></P>\n<P><FONT face=Arial>Planning ... said nearly 50 submissions have already been received, and encouraged members of the public to submit suggestions or comments before the deadline.</FONT></P>\n<P><FONT face=Arial>“There are numerous localities throughout the Northern Territory that have one or more ‘unofficial’ names, while other localities have no name at all,” ...
that I'm reading into a dictionary
for file in files:
with open(file) as ff:
j = json.load(ff)
jrels[j['id']] = j
and then attempting to strip the HTML from
BeautifulSoup(jrels[id]['mainContent'], 'lxml').get_text()
but unfortunately I end up with a whole bunch of unicode characters I'm not sure how to deal with.
u'The Government\u2019s six-week public consultation process into the naming of localities across the ... on ...\nPlanning ... said nearly 50 submissions have already been received, and encouraged members of the public to submit suggestions or comments before the deadline.\n\u201cThere are numerous localities throughout ... that have one or more \u2018unofficial\u2019 names, while other localities have no name at all,\u201d ...
Everything displays properly using print()
but I want to be able to build a time-series in Pandas
using releaseDateTime
and then split the mainContent
text into n-grams etc.
Is there a way of reading properly and decoding this without going through the labour of, say, looping through thousands of files, making a dictionary of errant unicode characters using regex, and substituting them that way?
Thanks in advance.