0

If column1 of file1 is present in column1 file2, then I want to print whole row of file1. This was solved in a previous post, so I was able to get my code to work, but couldn't figure out why my previous solution didn't work. File1 is tab delimited.

File 1:

10001 stuff1,stuff2

20002 stuff3,stuff4

30003 stuff5

File 2:

10001

30003

Output:

10001 stuff1,stuff2

30003 stuff5

This is what worked, but I don't understand what gsub is or why we only print if greater than 0:

awk '{gsub(/\r/,"")} NR==FNR{c[$1]++;next};c[$1]>0' file2 file1

I tried this by modifying some other folks' codes, but I want to know what was wrong with it.

nawk -F, 'BEGIN{FS=OFS='\t'} FNR==NR{array[$1]; next;} ($1 in array){print$0}' file2 file1

The code that didn't work prints both lines of file1. All of file1 is printed, instead of making sure that the lines are present in file 2 first. Any idea what I did wrong?

quantumDog
  • 23
  • 5

1 Answers1

1

The gsub() is [incorrectly] trying to handle DOS line endings, see Why does my tool output overwrite itself and how do I fix it?. Try this (untested):

awk '{sub(/\r$/,"")} NR==FNR{a[$1]; next} $1 in a' file2 file1

If file2 doesn't have DOS line endings then you can remove {sub(/\r$/,"")}.

Your first script:

awk '{gsub(/\r/,"")} NR==FNR{c[$1]++;next};c[$1]>0' file2 file1

will produce the output you wanted but it's the wrong way to do this as it's removing all \rs when you should only remove any that are at the end of a line, and it's creating a counter array when you don't need any counts, you just need to test for the key value being present as an array index, and then it'll keep adding to that array as it reads file1 and so using up more (probably a lot more) memory than necessary.

Your second script:

nawk -F, 'BEGIN{FS=OFS='\t'} FNR==NR{array[$1]; next;} ($1 in array){print$0}' file2 file1

is failing because a) it's not removing the \rs and/or b) your first 2 fields aren't tab-separated. Setting FS to , and then changing it to \t doesn't make sense and in fact you don't need to set it at all given your input data. You don't need {print $0} as that's the default action when a condition is true.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks so much for explaining all this. I had to read it several times, but now I get it. One thing left that I do not understand is how you know the `\r`s are there. Is that something that the shell reads at delimiters or end of lines from Windows files? – quantumDog Aug 16 '22 at 16:11
  • I don't know they're there, but if `x` in file2 doesn't match the `x` in a line like `x y` in file1 then that means that first `x` probably isn't just `x` and instead has some invisible char attached and `x\r` is the most likely, most frequent problem. Windows use `\r\n` (`CR LF`) as the newline indicator, Unix uses `\n` (`LF`) alone, so when Windows outputs `x\r\n` it means "`x` followed by newline" to Windows tools but Unix tools read that as "`x\r` followed by newline" since there's nothing special about the `\r` character in a Unix file - it gets treated the same as `y` in `xy\n`. – Ed Morton Aug 16 '22 at 16:36
  • Why can't you do `{sub(/\r$/"")}` rather than `{sub(/\r$/,"")}`? I can't figure out what the comma does. – quantumDog Aug 17 '22 at 01:17
  • `sub()` is a function and comma separates it's arguments as is common in functions in most Algol-based languages - C, pascal, Ada, Java, etc.. See `sub()` in the awk man page. – Ed Morton Aug 17 '22 at 13:45