1

I'm having a hard time dealing with some parsing issues related to Emojis. I have a json requested through the brandwatch site using urllib.(1) Then, I must decode it in utf-8, however, when I do so, I'm getting surrogate keys and the json.loader cannot deal with them. (2)

I've tried to use BeautifulSoup4, which works great, however, when there's a &quot on the site result, it is transformed to ", and then, the json.loader cannot deal with it for it says that a , is missing. After tons of search, I gave up trying to escape the " which would be the ideal.(3)

So now, I'm stuck with both "solutions/problems". Any ideas on how to proceed?

Obs: This is a program that fetchs data from the brandwatch and put it inside an MySQL database. So performance is an issue here.

Obs2: PyJQ is a JQ for Python with does the request and I can change the opener.

(1) - Dealing with the first approach using urllib, these are the relevants parts of the code used for it:

 def downloader(url):
    return json.loads(urllib.request.urlopen(url).read().decode('utf8'))

...


parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)

Error Casted:

Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***

If I print the result of urllib.request.urlopen(url).read().decode('utf8') instead of sending it to json.loader, that's what appears. These keys seems to be Emojis.

"fullname":"Botinhas\uD83D\uDC62"

(2) Dealing with the second approach using BeautifulSoup4, here's the relevant part of the code. (Same as above, just changed the downloader function)

def downloader(url):
    return json.loads(BeautifulSoup(urllib.request.urlopen(url), 'lxml').get_text())

...

parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)

And this is the error casted:

Expecting ',' delimiter: line 1 column 4814765 (char 4814764)

Doing the print, the " before Diretas Já should be escaped.

"title":"Por "Diretas Já", manifestações pelo país ocorrem em preparação ao "Ocupa Brasília" - Sindicato dos Engenheiros no Estado do Rio de Janeiro"

I've thought of running a regex, however, I'm not sure whether this would be the most appropriate solution to this case as performance is an issue.

(3) - Part of Brandwatch result with the &quot problem mentioned above Part of Brandwatch result with the &quot problem mentioned above

UPDATE:

As Martin stated in the comments, I ran a replace swapping &quot for nothing. Then, it raised the former problem, of the emoji.

Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***

UPDATE2:

I've added this to the downloader function:

re.sub(r'\\u(d|D)([a-z|A-Z|0-9]{3})', "", urllib.request.urlopen(url).read().decode('utf-8','ignore'))

It solved the issue, however, I don't think it's the best way to solve it. If anybody knows a better option.

PhaSath
  • 166
  • 1
  • 12
  • [Does this link](https://groups.google.com/forum/#!msg/beautifulsoup/hm3fh27NUkM/n1NyTqHg97kJ) help at all? Beautiful soup can't *not* convert the characters to HTML entities. Therefore you have to run a pass to turn them *back* into unicode and *then* another pass to escape the ones that need escaping. – Martin Jun 02 '17 at 16:20
  • @Martin I've made an update based on what you said. – PhaSath Jun 02 '17 at 16:37
  • Ok, Have you explored the idea of using (or at least testing) the data with another program, a competitor to "beautifulSoup4", maybe BS4 just doesn't handle the full UTF-8 properly. Also maybe the [UTF-8 needs a BOM](https://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom)? Does your have a BOM, can you remove it if it does? – Martin Jun 02 '17 at 16:57
  • I'm trying `beautifulsoup4` and the `urllib` itself. About the BOM, it does not have it. – PhaSath Jun 02 '17 at 17:05
  • Yeah the BOM is not part of the issue. Ijust knew it was an issue that comes up on UTF-8 processing, on occasion. I'm afraid I can't give you any specific help on `urllib` etc. – Martin Jun 02 '17 at 17:06

0 Answers0