TL;DR:
A trailing carriage return character (CR or \r
) is causing awk
to match ^Ma
for the first column, causing $1 == "a"
to be false on all but the first line.
Explanation
What is happening is that debug.txt has some bizarre newlines. At the end of each line is this sequence: 0x0a0d
(which shows as a newline and then a ^M
with cat -v debug.txt
).
The Wikipedia article for newline indicates that Unix/Linux newlines are just 0xa
(\n
or LF), while Windows newlines are 0x0d0a
(\r\n
or CRLF). Somehow, debug.txt has 'backwards' Windows newlines - 0x0a0d
(\n\r
or LFCR). This is the cause of all the trouble.
awk
is smart enough to handle a regular Windows newline when it sees CRLF. However, when it sees LFCR at the end of the first line, it assumes it's a regular Unix newline followed by a stand-alone carriage return.
Since the CR is now on the next line, when awk
delimits the first column of the next line, it correctly sees it as ^Ma
instead of a
. So $1 == "a"
evaluates to "^Ma" == "a"
, which is false. So all the lines except the first line get ignored.
Examples
The following files have the same contents as debug.txt except that the lines end with 0x0a0d
(LF + CR), 0x0d0a
(CR + LF), and 0xa
(LF), respectively (debug.txt and debug-lfcr.txt are the same):
$ cat -v debug-lfcr.txt
a,b,1
^Ma,b,2
^Ma,b,3
^Ma,b,4
^Ma,b,5
$ awk -F ',' '($1 == "a") {print($3)}' debug-lfcr.txt
1
$ cat -v debug-crlf.txt
a,b,1^M
a,b,2^M
a,b,3^M
a,b,4^M
a,b,5
$ awk -F ',' '($1 == "a") {print($3)}' debug-crlf.txt
1
2
3
4
5
$ cat -v debug-lf.txt
a,b,1
a,b,2
a,b,3
a,b,4
a,b,5
$ awk -F ',' '($1 == "a") {print($3)}' debug-lf.txt
1
2
3
4
5
How to fix the files
So the solution is to replace all LFCR sequences with either CRLF or LF.
To convert to just LF from LFCR, strip away all the CRs:
tr -d '\r' < debug.txt > debug-cured.txt
To convert to CRLF from LFCR, strip away the CRs and add them back in to the end of each line:
tr -d '\r' < debug.txt | sed -e '$a\' | sed 's/$/\r/' > debug-cured.txt
(| sed -e '$a\'
is optional - it merely adds a newline to the end of the file if it doesn't already. This avoids ending the file with a stand-alone CR, which could cause problems later).
See Remove carriage return in Unix, Add text at the end of each line, and https://unix.stackexchange.com/questions/31947/how-to-add-a-newline-to-the-end-of-a-file.
Aside
The reason why there were LFCR newlines is that I was dealing with software that was outputting text to both a virtual console and a hardware UART. This specific software's print function would detect an LF in the text and inject a CR afterwards. A hardware UART needs both LF and CR, but the order doesn't matter. So the software opted for LFCR, since it's a slightly faster implementation than CRLF.