Remove escape sequence characters like newline, tab and carriage return from JSON file

Question

I have a JSON with 80+ fields. While extracting the message field in the below mentioned JSON file using jq, I'm getting newline characters and tab spaces. I want to remove the escape sequence characters and I have tried it using sed, but it did not work.

Sample JSON file:

{
"HOSTNAME":"server1.example",
"level":"WARN",
"level_value":30000,
"logger_name":"server1.example.adapter",
"content":{"message":"ERROR LALALLA\nERROR INFO NANANAN\tSOME MORE ERROR INFO\nBABABABABABBA\n BABABABA\t ABABBABAA\n\n BABABABAB\n\n"}
}

Can anyone help me on this?

so you **never** want a new-line or tab char in that file? OR are there multiple entries in one file? (Please update your Q, and I will delete this comment). Good luck. — shellter, Oct 29 '16 at 16:17
If you use the `-r` option, `jq` will translate escape sequences into real newlines, tabs etc. Is that what you want? `jq -r .content.message file.json`? — hek2mgl, Oct 29 '16 at 16:36
For clarity, please add the expected output matching the sample input to your question (one remaining ambiguity is whether you want the enclosing double quotes stripped as well or not). — mklement0, Oct 29 '16 at 18:49

score 25 · Accepted Answer · edited May 23 '17 at 12:01

A pure jq solution:

$ jq -r '.content.message | gsub("[\\n\\t]"; "")' file.json
ERROR LALALLAERROR INFO NANANANSOME MORE ERROR INFOBABABABABABBA BABABABA ABABBABAA BABABABAB

If you want to keep the enlosing " characters, omit -r.

^{Note: peak's helpful answer contains a generalized regular expression that matches all control characters in the ASCII and Latin-1 Unicode range by way of a Unicode category specifier, \p{Cc}. jq uses the Oniguruma regex engine.}

Other solutions, using an additional utility, such as sed and tr.

Using sed to unconditionally remove escape sequences \n and t:

$ jq '.content.message' file.json | sed 's/\\[tn]//g'
"ERROR LALALLAERROR INFO NANANANSOME MORE ERROR INFOBABABABABABBA BABABABA ABABBABAA BABABABAB"

Note that the enclosing " are still there, however. To remove them, add another substitution to the sed command:

$ jq '.content.message' file.json | sed 's/\\[tn]//g; s/"\(.*\)"/\1/'
ERROR LALALLAERROR INFO NANANANSOME MORE ERROR INFOBABABABABABBA BABABABA ABABBABAA BABABABAB

A simpler option that also removes the enclosing " (note: output has no trailing \n):

$ jq -r '.content.message' file.json | tr -d '\n\t'
ERROR LALALLAERROR INFO NANANANSOME MORE ERROR INFOBABABABABABBA BABABABA ABABBABAA BABABABAB

Note how -r is used to make jq interpolate the string (expanding the \n and \t sequences), which are then removed - as literals - by tr.

score 8 · Answer 2 · edited Nov 01 '16 at 20:34

8

With your input, the following incantation:

$ jq 'walk(if type == "string" then gsub("\\p{Cc}"; "<>") else . end)'

produces:

{
  "HOSTNAME": "server1.example",
  "content": {
    "message": "ERROR LALALLA<>ERROR INFO NANANAN<>SOME MORE ERROR INFO<>BABABABABABBA<> BABABABA<> ABABBABAA<><> BABABABAB<><>"
  },
  "level": "WARN",
  "level_value": 30000,
  "logger_name": "server1.example.adapter"
}

Of course, the above invocation is just illustrative:

you might not need to use walk/1 at all. (walk/1 walks the input JSON.)
you might want to use a different character class, or specify a pipeline of gsub/2 invocations.
if you simply want to excise the control characters, specify "" as the second argument of gsub/2.

If you do want to use walk/1 but your jq does not have it, then simply add its definition (easily available on the web, such as here) before its invocation.

edited Nov 01 '16 at 20:34

mklement0

382,024
64
607
775

answered Oct 29 '16 at 17:43

peak

105,803
17
152
177

++ for several advanced techniques, but, truthfully, the simple `jq -r '.content.message | gsub("[\\n\\t]"; "")' file.json` solution that _could_ be derived from your answer is obscured by the incidental / generalized information. – mklement0 Nov 01 '16 at 20:34
@mklement0 - (1) The question includes the phrase "from JSON file" and mentions a large number of fields. Since it's not clear what is actually needed, I thought a generally useful answer would be more generally useful :-)) (2) The question mentions "escape sequence characters" generally, and TAB, NL and CR specifically, whereas the solution you mention in these comments does not cover all three. – peak Nov 01 '16 at 20:45
Fair points - there's often ambiguity in the description itself and inconsistencies between the description and the sample data ("newline characters and tab spaces [sic]" are mentioned alongside "escape sequences"). I personally find your answer very useful and learned from it, but my point was that a "gentler" framing with more context could have helped. – mklement0 Nov 01 '16 at 20:50

score 2 · Answer 3 · answered Jan 04 '23 at 14:29

2

With jq v1.6 the following is possible

jq -rc ".content.message" file.json

answered Jan 04 '23 at 14:29

Suman Maity

79
6

Since only a string value is being extracted, `-c` (`--compact-output`) has no effect here, and your solution doesn't do what the question asks for (removal of newlines and tabs). – mklement0 Aug 30 '23 at 20:34

Remove escape sequence characters like newline, tab and carriage return from JSON file

3 Answers3