-1

I am trying to use a regular expression to have this kind of string

{
 "key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}

from

JSONObject.toString()

that is one long line of text in my Android Java app

{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}

I found this regular expression for finding all commas.

/(,)(?=(?:[^"]|"[^"]*")*$)/

Now I need to know:

0- if this is reliable, that is, does what they say.

1- if this is works also with commas inside double-quotes.

2- if this takes into account escaped double-quotes.

3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.

5- It has to be used with the multi-line flag to work with multi-line text.

6- It has to work with replaceAll().

The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.

The resulting text has to be still JSON text.

Subsequent replace actions will take place also for the other symbols

: [ ] { } 

and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).

Mushif Ali Nawaz
  • 3,707
  • 3
  • 18
  • 31
P5music
  • 3,197
  • 2
  • 32
  • 81
  • 2
    Well, your questions 0 to 3 already indicate that regex is no good fit for handling json. I'd suggest using a proper parser instead. As an example if values could contain commas, quotes, colons etc. it can get very complicated if not impossible to create a regex that would fit _all_ possible variations. – Thomas Oct 23 '19 at 09:30
  • What exactly are you trying to achieve in the end? This reads like you're trying to format the JSON string into human readable form - that would need to include indentation as well. Note that the JSON libraries out there should already provide that (at least Jackson does). – Thomas Oct 23 '19 at 09:36
  • 1
    The JSON which you provided is not valid. – Mushif Ali Nawaz Oct 23 '19 at 09:36
  • Possible duplicate of [Convert JSON String to Pretty Print JSON output using Jackson](https://stackoverflow.com/questions/14515994/convert-json-string-to-pretty-print-json-output-using-jackson) – Mushif Ali Nawaz Oct 23 '19 at 09:43
  • @Thomas Please see edits. Forget symbols other than commas for now. I have to put the JSON text in the mentioned form because I am experimenting with Git merges in certain ways that my app needs. It's more than formatting in human readable form although it would be an useful addition. Why are you saying it does not work for JSON? – P5music Oct 23 '19 at 09:52
  • @Mushif Ali Nawaz It's not duplicated because I do not use Jackson library, and the text has to be as in the question, not in the form that the library yields. – P5music Oct 23 '19 at 09:56
  • "Why are you saying it does not work for JSON?" - Well, one of the things hat can cause a headache would be commas in strings. If no string value doesn't contain escaped double quotes you might be able to ignore those commas but once there are escaped double quotes it becomes more complex (your expression wouldn't handle those). Now add potential single quotes, especially for string values. Texts now could contain escaped or unescaped double or single quotes depending on what's used to delimit the value ... that's a whole new level of complexity. – Thomas Oct 23 '19 at 12:30
  • In general, one could say that regular expressions are a good fit for [regular languages](https://en.wikipedia.org/wiki/Regular_language) (in fact "a regular language can be expressed using a regular expression") but since [json isn't a regular language](https://cstheory.stackexchange.com/questions/3987/is-json-a-regular-language) but at least a context-free one there's a good chance that you'll eventuall run into a json that your regex isn't able to match properly. – Thomas Oct 23 '19 at 12:44
  • @Thomas So the right solution would be a custom parser that reads all the characters and performs the replace only outside a proper " " or ' ' region, I mean checking the number of non-escaped quotes, starting counting when one is found and going forward until the quotation is closed. It's not as difficult in fact. Do you confirm? Example: "dasdalkj,uouoiuu\",ohoho\"", starts at 0, ends at 27, 28 has to be replaced by , plus \n. Is it right? – P5music Oct 23 '19 at 12:55
  • Well, I'd personally use an existing parser and a custom formatter (or formatter configuration). – Thomas Oct 23 '19 at 13:04
  • @Thomas I already wrote down the parser, because I avoid using libraries, but I would like to know what you would use, please. – P5music Oct 23 '19 at 13:26
  • I'd most likely use Jackson because that's what I'm most familiar with. However, there might be better suited parsers/formatters for your needs. Why are you avoiding the use of libraries in the first place? – Thomas Oct 23 '19 at 13:52
  • @Thomas In fact maybe I think it makes my app heavier or license-burdened (even with permissive libraries). However I do not see the point using a library for just a simple function. Please write all your comments into an answer and I can accept it, otherwise I will answer my self. Thank you – P5music Oct 23 '19 at 14:21

2 Answers2

0

Its not that much simple, but yes if you want to do then you need to filter characters([,{,",',:) and replace then with a new line character against it. like:

[ should get replaced with [\n

Answer to your question is Yes its very much reliable and good to implement its just a single line of code doing all. Thats what regex is made for.

DHRUV GUPTA
  • 2,000
  • 1
  • 15
  • 24
  • Yes, it is what the regular expression is for. But there are some questions. – P5music Oct 23 '19 at 10:26
  • "its just a single line of code" - well it depends. Consider some json like `{ "problematic_field" : "I'll cause some \"problems\" here, just because I can, and I'll even add some escaped non-standard json just to be mean: { 'singleQuoteField' : true, 'someNestedArray': [ \"double\", 'single' ] }" }` - What would the simple regex do here? Would that fit the OP's requirements? ;) – Thomas Oct 23 '19 at 13:58
  • @Thomas you can apply escape characters filter in it. – DHRUV GUPTA Oct 24 '19 at 05:49
  • Well it's not escape _charaters_ but escape _sequences_ and they are a little harder to add. Even so it surely is possible but I'd bet that if you come up with a solution that fits the example above I can come up with yet another example that breaks it again. In the end the regex will become very complex and might still not be able to handle all cases - that's what I try to point out: regex is only a suitable fit for json if you have some measure of control over the content of that json. – Thomas Oct 24 '19 at 08:27
  • No need to get sarcastic or even offensive :) - The main point is this: use a proper json parser instead of regex especially if you don't have any control over the json this needs to be applied to ("...it could be manually edited by the user"). Thus it isn't as easy as it seems. – Thomas Oct 24 '19 at 08:59
-1

0- if this is reliable, that is, does what they say.

Let's break down the expression a little:

  • (,) is a capturing group that matches a single comma
  • (?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
  • (?:...)* would be a non-capturing group that can occur 0 to many times
  • [^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")

As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".

1- if this is works also with commas inside double-quotes.

If the double quote pairs are correctly identified any commas in between would be ignored.

2- if this takes into account escaped double-quotes.

Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.

3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.

Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.

5- It has to be used with the multi-line flag to work with multi-line text.

I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).

6- It has to work with replaceAll().

In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.

That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.

Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.

Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/

In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).

Thomas
  • 87,414
  • 12
  • 119
  • 157