0

How do I deal with \u2019 and \u201c type characters when reading UTF-8 encoded JSON into python? I've got a lot of files of this type

{
  "id": 1,
  "title": "Place names consultation draws to a close",
  "releaseDateTime": "2006-07-12T00:00:00+09:30",
  "mainContent": "<P><FONT face=Arial>The Government’s six-week public consultation process into the naming of localities across the ... on ...</FONT></P>\n<P><FONT face=Arial>Planning ... said nearly 50 submissions have already been received, and encouraged members of the public to submit suggestions or comments before the deadline.</FONT></P>\n<P><FONT face=Arial>“There are numerous localities throughout the Northern Territory that have one or more ‘unofficial’ names, while other localities have no name at all,” ... 

that I'm reading into a dictionary

for file in files: 
    with open(file) as ff:
        j = json.load(ff)
        jrels[j['id']] = j

and then attempting to strip the HTML from

BeautifulSoup(jrels[id]['mainContent'], 'lxml').get_text()

but unfortunately I end up with a whole bunch of unicode characters I'm not sure how to deal with.

u'The Government\u2019s six-week public consultation process into the naming of localities across the ... on ...\nPlanning ... said nearly 50 submissions have already been received, and encouraged members of the public to submit suggestions or comments before the deadline.\n\u201cThere are numerous localities throughout ... that have one or more \u2018unofficial\u2019 names, while other localities have no name at all,\u201d ...

Everything displays properly using print() but I want to be able to build a time-series in Pandas using releaseDateTime and then split the mainContent text into n-grams etc.

Is there a way of reading properly and decoding this without going through the labour of, say, looping through thousands of files, making a dictionary of errant unicode characters using regex, and substituting them that way?

Thanks in advance.

curlew77
  • 393
  • 5
  • 15
  • It already is decoded. I'm not sure what the problem is. – Ignacio Vazquez-Abrams Sep 27 '15 at 08:31
  • Why don't the Unicode chars appear with print()? I think I want it as an ascii string to make it easier to handle using basic language processing tools. I don't want those Unicode chars to screw up the rest of my processing. Is that clearer? – curlew77 Sep 27 '15 at 08:34
  • The correct solution is for the rest of the processing to handle `unicode`s properly, so that it can handle languages that use characters outside of ASCII. – Ignacio Vazquez-Abrams Sep 27 '15 at 08:36
  • I understand that, but I'm still getting my head around how to do it and most of the materials I'm using as reference don't seem to support `unicode`, which makes things harder. It would be simpler for me in, this particular case, if I could just translate the `unicode` chars into their `ascii` equivalents. – curlew77 Sep 27 '15 at 08:42
  • If you're using [NLTK](http://www.nltk.org), it now supports Unicode. If you're using the original NLTK book a new Unicode version will be out next year, but in the meantime there's the [online version](http://www.nltk.org/book), which is (mostly) up to date. – PM 2Ring Sep 27 '15 at 09:52
  • Thanks, that's very helpful – curlew77 Sep 27 '15 at 09:53

0 Answers0