Delete a non-ascii character, only if a condition applies, in bash

Question

I have a very specific need, for which I've been trying to solve, without success.

I have a log, which is created by a dump of a tcp/ip socket... It converts the Hex to ASCII, but naturally there are some special characters in it.

I've managed to remove them, but I'm currently experiencing a difficulty: Sometimes, an 0x0A is sent, which messes with my applications... I'm trying to remove it, but then it also removes the valid 0x0A at the end of the line...

Basically, I have, in the log file:

08-14-2017 10:00:00 String={Teste String}
08-14-2017 10:00:00 String={
Teste String2}
08-14-2017 10:00:00 String={
Teste String3}
08-14-2017 10:00:00 String={Teste String4}

I want the final result as

08-14-2017 10:00:00 String={Teste String}
08-14-2017 10:00:00 String={Teste String2}
08-14-2017 10:00:00 String={Teste String3}
08-14-2017 10:00:00 String={Teste String4}

The characters are always between {}, so every 0x0A after the } is valid, but inside is not.

every command I've tried either removes all the 0x0A, or just not work at all.

I've tried things like

sed 's/^[^}]*}//'
sed 's/\x0A$//'

any thoughts?

Are you applying the sed command on the ASCII text or on the hex? — pchaigno, Aug 14 '17 at 13:56

anubhava · Answer 1 · 2017-08-14T15:05:47.047

3

Another simpler awk:

awk '{printf "%s%s", $0, (/}/ ? ORS : "")}' file

08-14-2017 10:00:00 String={Teste String}
08-14-2017 10:00:00 String={Teste String2}
08-14-2017 10:00:00 String={Teste String3}
08-14-2017 10:00:00 String={Teste String4}

This awk command checks presence of } in a line and then only prints line break, otherwise it prints record without line break.

edited Aug 14 '17 at 15:05

answered Aug 14 '17 at 14:15

anubhava

761,203
64
569
643

1

This is awesome. – dawg Aug 14 '17 at 14:19
1

Yes, it should have been `ORS` :) – anubhava Aug 14 '17 at 15:05

score 1 · Answer 2 · answered Aug 14 '17 at 14:05

This is certainly possible with sed, but it's easier to read and understand in awk:

awk 'BEGIN{ OFS=FS="{"; ORS=RS="}" } { sub(/[^[:print:]]/,"",$2) } 1' input.txt

What does this do?

First, we set our input and output field separators to {, and our input and output record separators to }. This lets us predictably grab the bracketed text as a specific field (at least based on your sample data).
Next, we replace any non-printable characters in field #2 with a null string, eliminating newlines, backspaces, etc.
Finally, we print the line using awk shorthand.

score 1 · Answer 3 · answered Aug 14 '17 at 15:13

With GNU awk for multi-char RS we can just isolate each {...} string and remove newlines within it:

$ awk -v RS='{[^}]+}' '{ORS=gensub(/\n/,"","g",RT)}1' file
08-14-2017 10:00:00 String={Teste String}
08-14-2017 10:00:00 String={Teste String2}
08-14-2017 10:00:00 String={Teste String3}
08-14-2017 10:00:00 String={Teste String4}

For this specific case the other awk answers will work just fine, the above is just a more general solution to the problem of isolating a delimited string to then perform operations on it like removing characters as in this case.

pchaigno · Answer 4 · 2017-08-14T15:08:51.983

0

With sed:

Linux:

$ sed -r ':a;N;$!ba;s/(\{[^}]*)\\n([^{]*\})/\1\2/g' file
08-14-2017 10:00:00 String={Teste String}
08-14-2017 10:00:00 String={Teste String2}
08-14-2017 10:00:00 String={Teste String3}
08-14-2017 10:00:00 String={Teste String4}

FreeBSD and macOS:

sed -e ':a' -e 'N;$!ba' -e 's/(\{[^}]*)\\n([^{]*\})/\1\2/g' file

Explanations

-e ':a' -e 'N;$!ba' allows us to consider both the current and the next line on each iteration of sed. See this SO answer for details.

(\{[^}]*) ensures there's an opening brace not followed by a closing one.

([^{]*\}) does the opposite.

edited Aug 14 '17 at 15:08

answered Aug 14 '17 at 14:08

pchaigno

11,313
2
29
54

Doesn't work for me in FreeBSD or macOS. Is this GNU-sed specific? – ghoti Aug 14 '17 at 14:09
Works when you split it up: `sed -E -e ':a' -e 'N;$!ba' -e 's/(\{[^}]*)\n([^{]*\})/\1\2/g'` .. non-GNU sed appears to want labels not to be followed by semicolons. – ghoti Aug 14 '17 at 14:14
@ghoti Thanks. I updated. This should work with both GNU-sed and non-GNU-sed (?). – pchaigno Aug 14 '17 at 14:17
2

`\n` is not portable across sed versions (you need backslash followed by a literal newline for portability) and `-E` will only work in GNU and OSX sed while `-r` will only work in GNU sed. – Ed Morton Aug 14 '17 at 15:03
1

Also, sed in Solaris 10 does not support `-E` or `-r`, so a BRE-based solution would be preferred for maximum portability. In bash, you may be able to get the embedded literal newline using format substitution, i.e `$'foo\nbar'`. – ghoti Aug 14 '17 at 15:06
And at the end of the day this simply isn't a job for sed at all since an awk solution will be clearer, simpler, more efficient, more portable, easier to enhance/maintain, etc. so why bother polishing it? – Ed Morton Aug 14 '17 at 15:08

dawg · Answer 5 · 2017-08-14T14:51:39.287

0

Perl:

$ perl -0777 -pe 's/({[^}]*)\x0A([^}]*})/\1\2/g' file
08-14-2017 10:00:00 String={Teste String}
08-14-2017 10:00:00 String={Teste String2}
08-14-2017 10:00:00 String={Teste String3}
08-14-2017 10:00:00 String={Teste String4}

Pure Bash (based on anubhava's awk):

while IFS="\n" read -r line; do 
    le=""
    [[ $line =~ \} ]] && le=$'\n'
    printf "%s%s" "$line" "$le"
done <file

edited Aug 14 '17 at 14:51

answered Aug 14 '17 at 14:12

dawg

98,345
23
131
206

Delete a non-ascii character, only if a condition applies, in bash

5 Answers5