-1

From an ajax call, I got back something like this:

{"d":"\u003cdiv class=\"popup_title\"\u003eBENTELER Autótechnika Kft.\u003c/div\u003e\u003cdiv style=\"font-size:10pt;font-weight:bold;\"\u003e8060 Mór, Akai út 5.

I' d like to convert it to a "usable" format, so \u0003c will simply be a < character.

The header of the ajax call says that this is an iso-8859-2 coding (content-type: text/plain; charset=iso-8859-2), but I' m unsure.

I tried to use iconv with many options, but no luck.

What is interesting is that for instance this site:

https://www.online-toolz.com/tools/text-unicode-entities-convertor.php

does the trick without anything, I just can' t find out what the "from encoding" should be.

I' d be happy to use iconv.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
user2194805
  • 1,201
  • 1
  • 17
  • 35
  • 1
    You shouldn't need to guess. Does the API call return JSON or not? If so, you can file a non-compliance report as RFCs say JSON should be UTF-8 encoded. In the meantime, you can convert from what the header says the encoding is. Then use a JSON library. Writing your own code to just parse JSON's UTF-16 code unit escapes doesn't get the whole job done. – Tom Blodget Jun 01 '19 at 22:00

3 Answers3

1

The character set is simply ASCII. These are escape codes used e.g. by JavaScript (and Python).

If the value you get from the AJAX call is valid JSON (as presumably it will be), use a JSON tool to extract it.

bash$ jq -r .d <<\:
{"d":"\u003cdiv class=\"popup_title\"\u003eBENTELER Autótechnika Kft.\u003c/div\u003e\u003cdiv style=\"font-size:10pt;font-weight:bold;\"\u003e8060 Mór, Akai út 5."}
:
<div class="popup_title">BENTELER Autótechnika Kft.</div><div style="font-size:10pt;font-weight:bold;">8060 Mór, Akai út 5.
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks, this works fine, however redirecting jq' s output into a file give a weird result (https://github.com/stedolan/jq/issues/1110). Of course it' s solv(e)able . – user2194805 Jun 01 '19 at 16:43
  • Sorry, typo fixed; `.d` not `-d` – tripleee Jun 01 '19 at 17:08
  • The accented characters are of course not ASCII - since they are displaying correctly for you, they are probably in your system's default encoding, which I'm guessing is UTF-8 if you are on a sensible platform, and ridiculously unlikely if you are on Windows. – tripleee Jun 02 '19 at 07:53
0

The easiest way to do this is with a JSON parser for your language of choice, which will convert it into an appropriate data structure and unescape it. What you're seeing is a Unicode escape representing U+003C, which is the < character. JSON parsers often escape angle brackets since they have special meaning in HTML and XML, and escaping them means that the JSON can be literally inserted into those types of documents.

Alternately, if you want to escape them from the command line without processing them, you can pipe it to Perl or Ruby to do so, like this:

perl -pe 's/\\u([0-9a-f]+)/"\u{$1}"/ge;'

or

ruby -pe '$_.gsub!(/\\u([0-9a-f]+)/) { |m| m.to_i(16).chr }'

Note that the encoding that you get from the server is likely a red herring. JSON is required to be in Unicode, and it's likely that the server is just misconfigured. If you're certain the data is actually in ISO-8859-2, in violation of the spec, you can fix it by piping the output of the following command to one of the perl or ruby command above:

iconv -f ISO-8859-2 -t UTF-8
bk2204
  • 64,793
  • 6
  • 84
  • 100
0

You could use the %b formatting directive of Bash's printf:

$ encoded='{"d":"\u003cdiv class=\"popup_title\"\u003eBENTELER Autótechnika Kft.\u003c/div\u003e\u003cdiv style=\"font-size:10pt;font-weight:bold;\"\u003e8060 Mór, Akai út 5.'
$ printf -v decoded '%b\n' "$encoded"
$ printf '%s\n' "$decoded"
{"d":"<div class=\"popup_title\">BENTELER Autótechnika Kft.</div><div style=\"font-size:10pt;font-weight:bold;\">8060 Mór, Akai út 5.

From the manual:

%b
Causes printf to expand backslash escape sequences in the corresponding argument in the same way as echo -e (see Bash Builtins).


As Charles points out in his comment, %b as such isn't limited to Bash's printf, but required by POSIX; interpretation of \uHHHH escapes, on the other hand, only happens in Bash, as described in the escape sequences for echo.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • 1
    `printf %b` [is POSIX-specified](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html), not a bashism (though `\u003c` support *is* a bashism; and obviously, the POSIX definition doesn't refer to the specification-violating usage of `echo -e` to do anything other than print `-e` on output). – Charles Duffy Jun 01 '19 at 19:12