Detecting and parsing escape character "\" from a JSON File?

Question

I am having a problem with data that is a JSON file. I am using the following link, from google.

http://www.google.com/finance/company_news?q=AAPL&output=json"

My problem occurs when i want to parse the data and putting it on screen. The data is not being decoded properly from some reason.

The raw data:

 1.) one which must have set many of the company\x26#39;s board on the edge of their
 2.) Making Less Money From Next \x3cb\x3e...\x3c/b\x3e

When i bring in the data i do the following:

DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost httpPost = new HttpPost(url);
HttpResponse httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
is = httpEntity.getContent();        
BufferedReader reader = new BufferedReader(new InputStreamReader(
                is, "iso-8859-1"), 8); 
StringBuilder sb = new StringBuilder();
String line = null;
        while ((line = reader.readLine()) != null) {
            sb.append(line + "n");
}
is.close();
json = sb.toString();

The Output i receive, using org.json to extract the data from the json file, is the following(notice the lack of backslash):

1.)one which must have set many of the companyx26#39;s board on the edge of their
2.)Making Less Money From Next x3cbx3e...x3c/bx3e

my current method for handling the first problem by this:

JSONRowData.setJTitle((Html.fromHtml((article.getString(TAG_TITLE).replaceAll("x26", "&")))).toString());

the second one escapes me though(no pun intended)

I assume the reason that this doesn't work is being the backlash is used for escape characters. Ive tried many different methods of reading the data in but ive had no luck. Is there a way i can import the data to handle this problem without using regular expressions?

Solution

Our nemesis today: "\x26" -- ASCII (in Hexadecimal Notation)

Read the Raw data into a Char Array. commons.io library from apache is a great way to do this. Once you do this, read the char array in a for loop looking for "\", if you have a hit then look for "x" in the next array position. If you have a hit again then take the next two characters in the char array. These two characters are your ASCII hex values. Convert the hex into decimal form then cast the decimal to a char. Take this Character and append it to a string builder.

If there is no match(with "\") then append the char to a string builder. We can now call the .toString() method and turn it into a string.

From there, the data may contain some HTML remnants(' and/or in this case). Using Html.fromHtml() Took care of this.

See [this](http://stackoverflow.com/a/8715600/645270). And, have you tried escaping the escape char? (as suggested in the second answer) — keyser, Jun 13 '12 at 18:26
@Keyser I did notice the link before but it doesn't provide a viable solution. i could escape the escape but wouldn't that require the use of regex to replace "\" with "\\"? — wdziemia, Jun 13 '12 at 18:43
Answer is below, along with the description of the method to solve this issue in the comments of the answer — wdziemia, Jun 14 '12 at 22:28
Reminds me a lot of the link :p Too bad there wasn't a better solution. — keyser, Jun 15 '12 at 06:29

score 3 · Accepted Answer · 2012-06-14T23:15:05.023

3

The problem here is that google -- or at least that url -- is supplying invalid JSON^1,2. The JSON library, while not rejecting the invalid JSON outright, is parsing it in a "well, let's ignore this \ nonsense and continue" manner. That is, it's not the rendering that is wrong, it is the input which is wrong.

¹It is not allowed for \x to appear in a string (except if the \ is itself escaped) as \ (when not escaped) can only be followed by a small set of characters (which does not include x). Escapes for character codes must be done by \u1234 and not \x12.

The only "fixes" I can think of are really gross hacks: i.e. read in raw text and convert \x12 to \u0012. (Actually, it's not that bad of a hack because no context-sensitive stuff needs to be taken into account; however, it should not be required! Shame on Google.)

² Extracted invalid JSON string literal:

"Apple Inc. (NASDAQ:AAPL) shares continued to lead large cap tech stocks in top performance this year. The stock\x26#39;s price showed no major move following a key event started Monday."

(To make this valid, replace \x26 with \u0026 or &.)

Happy coding and -- good luck :)

In Java one [untested] approach might be to use a regular expression (via String.replaceAll):

inputString.replaceAll("\\x(\d{2})", "\\u00$1")

edited Jun 14 '12 at 23:15

answered Jun 14 '12 at 00:22

I was afraid of this, my wishful thinking always gets the better of me.Ill try and work with the raw data and maybe i can work it out from there. I could get the input as XML but then the data is wrapped around in and nested within all sorts of HTML tags and its a mess. Thank you for the response, ill try and get an answer out of the Google devs as well. – wdziemia Jun 14 '12 at 04:29
@wdziemia Actually, that JSON is all sorts of broken. I jumped on the broken in the question, but the keys are also *not JSON strings* and are thus invalid... looks like someone generated "JavaScript object literals" and *not* JSON. I will try not to think about it anymore, because it makes my head hurt: the service being provided by a well-established IT company (rumored to be full of really smart people) which introduced ProtocolBuffers... – Jun 14 '12 at 05:15
Got it working, Thank you for the Help! Read the Raw data into a Char Array and then replaced the ASCII characters in hex notation to their respected decimal values. Then casted the decimal value to a character. Html.fromHtml() Took cake of any HTML entity codes/HTML tags left over. Thanks again! – wdziemia Jun 14 '12 at 22:25
@wdziemia I am glad you figured something out. I would likely try to use Strings instead of characters arrays, however. I have updated my post with the small [untested] example which may work as well... – Jun 14 '12 at 23:04
This is hilarious. I just visited the link and the keys are still unquoted and probably many other issues as well. Did none of you report this? It's very easy to get JSON right. I'll report it... tomorrow.. maybe – Esailija Aug 12 '12 at 23:46

Detecting and parsing escape character "\" from a JSON File?

1 Answers1

Linked