Remove duplicate characters using AWK gsub()

Question

I'm trying to reformat some text by removing newline and duplicated space characters.

My input text looks like this:

     hello  ! hello 
you! 

 hello



   world!!! hello


   universe  !

and I'm trying to format it like this:

hello !
hello you!
hello world!
hello universe !

I tried using this command:

awk -v RS='!' '{gsub("^ *|\n", ""); gsub(" +", " ")} NF{print $0 RS }' file

But I still get some spaces at the beginning of the line:

 hello !
hello you!
 hello world!
hello universe !

I don't understand why the first gsub is not removing the leading space (that should be matched by the pattern ^ *).

What is wrong is this awk script?

I'm also interested in the sed command performing the same formatting.

It's because `^` means the beginning of a record and there is a newline between the space and hello, shouldn't have a space in front of the first line though. Use `[[:space:]]` instead. — 123, Jul 07 '16 at 07:00

John1024 · Accepted Answer · 2016-07-07T07:22:56.523

3

$ awk -v RS='!' '{gsub(/^[[:space:]]*/, ""); gsub(/[[:space:]]+/, " ")} NF{print $0 RS}' file
hello !
hello you!
hello world!
hello universe !

-v RS='!'

This sets the record separator to an exclamation point.
gsub(/^[[:space:]]*/, "")

This removes all leading space.

[[:space:]] is a unicode-safe way of matching any white space, which includes blanks, tabs, newlines, and some other more obscure white space.
gsub(/[[:space:]]+/, " ")

This replaces any other multiple space with a single blank
NF{print $0 RS }

If there are any words on this line, this prints them along with the record separator.

edited Jul 07 '16 at 07:22

answered Jul 07 '16 at 06:59

John1024

The `{print $0 RS }` is still needed in order to get the `!` at the end. Anyway, your answer (with the `[[:space:]]`) is good. thanks. – oliv Jul 07 '16 at 07:11
maybe also update the output to match with the command (and add the 4 `!`). – oliv Jul 07 '16 at 07:20

123 · Answer 2 · 2016-07-07T07:39:10.227

1

In sed

sed ':1;/!/!{$!{N;b1}};s/!\{2,\}/!/;s/\n*//g;s/^ *//;s/ \{1,\}/ /g;s/!/&\n/;/^$/d;P;D' file

:1
/!/!{
        $!{
                N
                b1
        }
}
s/!\{2,\}/!/
s/\n*//g
s/^ *//
s/ \{1,\}/ /g
s/!/&\n/
/^$/d
P
D

edited Jul 07 '16 at 07:39

answered Jul 07 '16 at 07:24

123

Can you please give some explanation? BTW, there is an extra newline at the end of the output. – oliv Jul 07 '16 at 07:29
@oliv Blank lines will be deleted now, regarding explanation it's pretty self explanatory if you look at what each command does. – 123 Jul 07 '16 at 07:32
hmm... _self explanatory_ ... ok for the `s` substitution. The `/!/!` part is a mystery for me... same for the `P;D` command... – oliv Jul 07 '16 at 07:43
`/!/` matches a single `!` in pattern space, `!` negates this match, meaning if line does not contain `!`, execute the following block/command. `P` and `D` are documented in the man page. – 123 Jul 07 '16 at 07:47

2 Answers2