0

With the following file debug.txt:

a,b,1
a,b,2
a,b,3
a,b,4
a,b,5

This prints the third column of only the first row:

$ awk -F ',' '$1 == "a" {print($3)}' debug.txt 
1

while this prints the third column of all 5 rows:

$ awk -F ',' '$2 == "b" {print($3)}' debug.txt 
1
2
3
4
5

Why is this? And how would I select all rows where the first column matches a?

EDIT: Here is the binary output of debug.txt:

$ cat -v debug.txt 
a,b,1
^Ma,b,2
^Ma,b,3
^Ma,b,4
^Ma,b,5
$ hexdump -Cv debug.txt
00000000  61 2c 62 2c 31 0a 0d 61  2c 62 2c 32 0a 0d 61 2c  |a,b,1..a,b,2..a,|
00000010  62 2c 33 0a 0d 61 2c 62  2c 34 0a 0d 61 2c 62 2c  |b,3..a,b,4..a,b,|
00000020  35                                                |5|
00000021
$ file debug.txt
debug.txt: ASCII text, with CR, LF line terminators

Note that 5 is the last character in the file (no trailing newline).

Hintron
  • 311
  • 2
  • 11
  • 2
    Huh?? You must have problems with your file. Your first command is correct. Is the file created on Windows with a UTF-16 character set or something strange? What does `file debug.txt` show? (or better, `hexdump -Cv debug.txt`) – David C. Rankin Feb 13 '20 at 08:16
  • 1
    It can't possibly happen. `awk -F ',' '$1 == "a" {print($3)}` is working fine on my machine – Inian Feb 13 '20 at 08:17
  • 1
    Works as expected for me. Try creating a new data file and enter the data by hand (not paste or slurp) to make sure it's exactly correct. If that makes a difference, look at the first file with `cat -v` or `sed -n l` to see _exactly_ what's in it. PS: comma doesn't need quoting; `awk -F, '...'` is fine. – dave_thompson_085 Feb 13 '20 at 08:17
  • Ok, good to know that at least the commands look good. I will investigate the file binary to see what's up and update the question with more info. Thanks. P.S. this is on Kubuntu 19.10, though I created the file with my Sublime Text 3 IDE, I believe. – Hintron Feb 13 '20 at 19:01
  • I can confirm that this works as expected on my other Kubuntu 19.10 machine with the file created via Sublime Text 3. – Hintron Feb 13 '20 at 19:05

1 Answers1

1

TL;DR:

A trailing carriage return character (CR or \r) is causing awk to match ^Ma for the first column, causing $1 == "a" to be false on all but the first line.

Explanation

What is happening is that debug.txt has some bizarre newlines. At the end of each line is this sequence: 0x0a0d (which shows as a newline and then a ^M with cat -v debug.txt).

The Wikipedia article for newline indicates that Unix/Linux newlines are just 0xa (\n or LF), while Windows newlines are 0x0d0a (\r\n or CRLF). Somehow, debug.txt has 'backwards' Windows newlines - 0x0a0d (\n\r or LFCR). This is the cause of all the trouble.

awk is smart enough to handle a regular Windows newline when it sees CRLF. However, when it sees LFCR at the end of the first line, it assumes it's a regular Unix newline followed by a stand-alone carriage return.

Since the CR is now on the next line, when awk delimits the first column of the next line, it correctly sees it as ^Ma instead of a. So $1 == "a" evaluates to "^Ma" == "a", which is false. So all the lines except the first line get ignored.

Examples

The following files have the same contents as debug.txt except that the lines end with 0x0a0d (LF + CR), 0x0d0a (CR + LF), and 0xa (LF), respectively (debug.txt and debug-lfcr.txt are the same):

$ cat -v debug-lfcr.txt 
a,b,1
^Ma,b,2
^Ma,b,3
^Ma,b,4
^Ma,b,5
$ awk -F ',' '($1 == "a") {print($3)}' debug-lfcr.txt 
1
$ cat -v debug-crlf.txt 
a,b,1^M
a,b,2^M
a,b,3^M
a,b,4^M
a,b,5
$ awk -F ',' '($1 == "a") {print($3)}' debug-crlf.txt 
1
2
3
4
5
$ cat -v debug-lf.txt 
a,b,1
a,b,2
a,b,3
a,b,4
a,b,5
$ awk -F ',' '($1 == "a") {print($3)}' debug-lf.txt 
1
2
3
4
5

How to fix the files

So the solution is to replace all LFCR sequences with either CRLF or LF.

To convert to just LF from LFCR, strip away all the CRs:

tr -d '\r' < debug.txt > debug-cured.txt

To convert to CRLF from LFCR, strip away the CRs and add them back in to the end of each line:

tr -d '\r' < debug.txt | sed -e '$a\' | sed 's/$/\r/' > debug-cured.txt

(| sed -e '$a\' is optional - it merely adds a newline to the end of the file if it doesn't already. This avoids ending the file with a stand-alone CR, which could cause problems later).

See Remove carriage return in Unix, Add text at the end of each line, and https://unix.stackexchange.com/questions/31947/how-to-add-a-newline-to-the-end-of-a-file.

Aside

The reason why there were LFCR newlines is that I was dealing with software that was outputting text to both a virtual console and a hardware UART. This specific software's print function would detect an LF in the text and inject a CR afterwards. A hardware UART needs both LF and CR, but the order doesn't matter. So the software opted for LFCR, since it's a slightly faster implementation than CRLF.

Community
  • 1
  • 1
Hintron
  • 311
  • 2
  • 11