Remove \r\ character from String pattern matched in AWK

Question

I'm quite new to AWK so apologies for the basic question. I've found many references for removing windows end-line characters from files but none that match a regular expression and subsequently remove the windows end line characters.

I have a file named infile.txt that contains a line like so:

...
DATAFILE   data5v.dat
...

Within a shell script I want to capture the filename argument data5v.dat from this infile.txt and remove any carriage return character, \r, IF present. The carriage return may not always be present. So I have to match a word and then remove the \r subsequently.

I have tried the following but it is not working how I expect:

FILENAME=$(awk '/DATAFILE/ { print gsub("\r", "", $2) }' $INFILE)

Can I store the string returned from matching my regex /DATAFILE/ in a variable within my AWK statement to subsequently apply gsub?

Use `sub(/\r$/,"")`. I use GNU awk and `RS="\r?\n"`. – James Brown Mar 25 '22 at 11:46 — James Brown, Mar 25 '22 at 11:46

Ed Morton · Accepted Answer · 2022-03-25T14:03:53.720

File names can contain spaces, including \rs, blanks and tabs, so to do this robustly you can't remove all \rs with gsub() and you can't rely on there being any field, e.g. $2, that contains the whole file name.

If your input fields are tab-separated you need:

awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' file

or this otherwise:

awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' file

The above assumes your file names don't start with spaces and don't contain newlines.

To test any solution for robustness try:

printf 'DATAFILE\tfoo \r bar\r\n' | awk '...' | cat -TEv

and make sure that the output looks like it does below:

$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$

$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$

Note the blank, ^M (CR), and ^I (tab) in the middle of the file name as they should be but no ^M at the end of the line.

If your version of cat doesn't support -T or -E then do whatever you normally do to look for non-printing chars, e.g. od -c or vi the output.

tshiono · Answer 2 · 2022-03-25T23:33:32.987

1

With GNU awk, would you please try the following:

FILENAME=$(awk -v RS='\r?\n' '/DATAFILE/ {print $2}' "$INFILE")
echo "$FILENAME"

It assigns the record separator RS to a sequence of zero or one \r followed by \n.
As a side note, it is not recommended to use uppercases for user's variable names because it may conflict with system reserved variable names.

edited Mar 25 '22 at 23:33

answered Mar 25 '22 at 12:16

tshiono

21,248
2
14
22

You should mention that requires GNU awk for multi-char RS, it wouldn't work with a POSIX awk. – Ed Morton Mar 25 '22 at 17:48

tripleee · Answer 3 · 2022-03-25T12:00:49.060

0

Awk simply applies each line of script to each input line. You can easily remove the carriage return and then apply some other logic to the input line. For example,

FILENAME=$(awk '/\r/ { sub(/\r/, "") }
     /DATAFILE/ { print $2 }' "$INFILE")

Notice also When to wrap quotes around a shell variable.

edited Mar 25 '22 at 12:00

answered Mar 25 '22 at 11:46

tripleee

175,061
34
275
318

This did not work. It returned an empty string – daragh Mar 25 '22 at 11:56
How did you test it? Works for me: `awk '/\r/ { sub(/\r/, "") } /DATAFILE/ { print $2 }' <<<$'hello\r\nthis is DATAFILE\r\ngood bye'` at the Bash prompt prints `is` – tripleee Mar 25 '22 at 12:00
Use infile.txt instead of $INFILE – stark Mar 25 '22 at 12:11
1

The test on the command line works but not when I run it within the shell script? The following is working within the shell script: { sub(/\r/, "", $2); print $2 }. This is from anubhava's comment on the original question – daragh Mar 25 '22 at 12:33

score 0 · Answer 4 · answered Mar 29 '22 at 09:52

who says you need gnu-awk :

 gecho -ne  "test\r\nabc\n\rdef\n" \
 \
 | mawk NF=NF FS='\r' OFS='' | odview

0000000        1953719668      1667391754      1717920778              10
           t   e   s   t  \n   a   b   c  \n   d   e   f  \n            
          164 145 163 164 012 141 142 143 012 144 145 146 012            
           t   e   s   t  nl   a   b   c  nl   d   e   f  nl            
          116 101 115 116  10  97  98  99  10 100 101 102  10            
           74  65  73  74  0a  61  62  63  0a  64  65  66  0a            

0000015

gawk -P posix mode is also fine with it :

gecho -ne  "test\r\nabc\n\rdef\n" \
\
| gawk -Pe  NF=NF FS='\r' OFS='' | odview

0000000        1953719668      1667391754      1717920778              10
           t   e   s   t  \n   a   b   c  \n   d   e   f  \n            
          164 145 163 164 012 141 142 143 012 144 145 146 012            
           t   e   s   t  nl   a   b   c  nl   d   e   f  nl            
          116 101 115 116  10  97  98  99  10 100 101 102  10            
           74  65  73  74  0a  61  62  63  0a  64  65  66  0a            

0000015

Remove \r\ character from String pattern matched in AWK

4 Answers4