I'm having a hard time dealing with some parsing issues related to Emojis.
I have a json
requested through the brandwatch
site using urllib
.(1) Then, I must decode it in utf-8
, however, when I do so, I'm getting surrogate keys and the json.loader
cannot deal with them. (2)
I've tried to use BeautifulSoup4
, which works great, however, when there's a "
on the site result, it is transformed to "
, and then, the json.loader
cannot deal with it for it says that a ,
is missing. After tons of search, I gave up trying to escape the "
which would be the ideal.(3)
So now, I'm stuck with both "solutions/problems". Any ideas on how to proceed?
Obs: This is a program that fetchs data from the brandwatch
and put it inside an MySQL
database. So performance is an issue here.
Obs2: PyJQ is a JQ for Python with does the request and I can change the opener.
(1) - Dealing with the first approach using urllib
, these are the relevants parts of the code used for it:
def downloader(url):
return json.loads(urllib.request.urlopen(url).read().decode('utf8'))
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
Error Casted:
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
If I print the result of urllib.request.urlopen(url).read().decode('utf8')
instead of sending it to json.loader
, that's what appears. These keys seems to be Emojis.
"fullname":"Botinhas\uD83D\uDC62"
(2) Dealing with the second approach using BeautifulSoup4
, here's the relevant part of the code. (Same as above, just changed the downloader function)
def downloader(url):
return json.loads(BeautifulSoup(urllib.request.urlopen(url), 'lxml').get_text())
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
And this is the error casted:
Expecting ',' delimiter: line 1 column 4814765 (char 4814764)
Doing the print, the "
before Diretas Já should be escaped.
"title":"Por "Diretas Já", manifestações pelo país ocorrem em preparação ao "Ocupa Brasília" - Sindicato dos Engenheiros no Estado do Rio de Janeiro"
I've thought of running a regex, however, I'm not sure whether this would be the most appropriate solution to this case as performance is an issue.
(3) - Part of Brandwatch result with the "
problem mentioned above
UPDATE:
As Martin stated in the comments, I ran a replace swapping " for nothing. Then, it raised the former problem, of the emoji.
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
UPDATE2:
I've added this to the downloader function:
re.sub(r'\\u(d|D)([a-z|A-Z|0-9]{3})', "", urllib.request.urlopen(url).read().decode('utf-8','ignore'))
It solved the issue, however, I don't think it's the best way to solve it. If anybody knows a better option.