AWK : matching two files by first column doesn't work

Question

I have little issue using awk which start to give me white hairs..

I have two files with different contents, but the first column value is the same.

Files are generated during a script as csv file (semicolon separated) and lokks like so:

main_file.csv
---
151597-21;151597;21;3;15;;"Vente OK";excluded
151598-21;151598;21;3;15;;"Vente OK";excluded
151599-0;151599;0;0;10;;;programmed
151600-0;151600;0;0;10;;;programmed
151601-0;151601;0;0;10;;;programmed
151602-0;151602;0;0;10;;;programmed
151603-0;151603;0;0;10;;;programmed
151604-0;151604;0;0;10;;;programmed
151605-0;151605;0;0;10;;;programmed
151606-0;151606;0;0;10;;;programmed
151607-0;151607;0;0;10;;;programmed
...
151622-0;151622;0;0;10;;;programmed
151623-0;151623;0;0;10;;;programmed
151624-0;151624;0;0;10;;;programmed
151625-0;151625;0;0;10;;;programmed
...


filter_file.csv
---
151622-0;151622;0

I want to compare those two files and create a third one that containts lines from "main_file.csv" that matches the ones in the "filter_file.csv" using the first column value as comparaison.

As the exemple shows, I should get a "result_file.csv" with one line in it, unfortunately I get an empty file.

The experted output should be:

151622-0;151622;0;0;10;;;programmed

Here is the commmand I tried:

awk 'BEGIN {FS=OFS=";"} NR==FNR{a[$1]=1; next} a[$1]{print}' filter_file.csv main_file.csv > result_file.csv

If I'm understood awk correctly, it should explain as this:

awk '                           # starting awk program
BEGIN {FS=OFS=";"}              # define column separator as commat for both files (main & filter)
NR==FNR{a[$1]=1; next}          # during read of the first file (filter_file.csv), create an array 'a' with first column value as index
a[$1]{print}                    # during read of the second file (main_file.csv), if first column value exist as an index of the array 'a', print the whole line in the 'result_file.csv'
' 
filter_file.csv main_file.csv   # files to be compared
> result_file.csv               # direct the output to the third file

but I'm afraid I missed something :/

EDIT: update to add a bit of context:

the command is executed from a php script like so:

$awk_cmd = 'awk \'BEGIN {FS=OFS=";"} NR==FNR {a[$1]=1; next} $1 in a {print}\' ' . $filter_file . ' ' . $ref_file . ' > ' . $match_file;
exec($awk_cmd);

where $filter_file, $ref_file and $match_file are the full path of the files.

EDIT2 :
I tested the grep command and get the following output:

0000000   1   5   1   6   2   2   -   0   ;   1   5   1   6   2   2   ;
0000020   0   ;   0   ;   1   0   ;   ;   ;   p   r   o   g   r   a   m
0000040   m   e   d  \r  \n
0000045

fwiw: `a[$1]{print}` will create a new array entry (with value of 0) if one does not exist; while this shouldn't affect functionality (in this case) it will require more memory; `$1 in a` should provide the same benefit (if $1 is an index of array `a[]` then print) without the additional memory usage — markp-fuso, Aug 22 '23 at 15:48
when I run your `awk` script I get the expected output (single line); not sure how you get nothing unless, perhaps, there are some non-printing characters in the data files? may be of help to look at the output from `grep -h 151622-0 filter_file.csv main_file.csv | od -c` ... looking for anything other than the printable characters and `\n` — markp-fuso, Aug 22 '23 at 15:53
setting aside the `php` code for a minute ... if you cut-n-paste your `awk` script from this question and run it at the (bash) command prompt ... does it generate any output? — markp-fuso, Aug 22 '23 at 15:57
the `od -c` output shows no issues that should cause no output; as for the syntax error ... *a)* I don't get that when cutting-n-pasting your `awk` script into my env and *b)* from your comment it looks like the semicolon is not wrapped in double quotes (unlike the script in the question) ... ? — markp-fuso, Aug 22 '23 at 16:04
just curious ... what is your `awk` version? (ie, what is the output from `awk --version`)? — markp-fuso, Aug 22 '23 at 16:05
I suspect your problem is about generally how to call commands from PHP, nothing to do with your awk script - you just happen to be calling awk but it could be sed or perl or anything else. You probably need to escape additional characters or use additional and/or different quotes or something. Google "how to call awk from PHP" or similar. — Ed Morton, Aug 22 '23 at 16:06
` \r \n` ... That is a DOS file, clean all your files with `dos2unix filter_file.csv main_file.csv` Good luck! — shellter, Aug 22 '23 at 22:52
@shellter I think this does not matter here. We don't care about DOS or UNIX as long as in both formats the last byte is `\n` and we are interested only in the first field. Unfortunately this question was closed as duplicate for the wrong reason. — Renaud Pacalet, Aug 23 '23 at 09:15
I can't reproduce the problem; `exec($awk_cmd);` works for me with the given input files — Fravadona, Aug 23 '23 at 18:54

Renaud Pacalet · Answer 1 · 2023-08-23T09:00:03.930

2

The format of your files is apparently DOS (\r\n) but this should not be a problem here. The following assumes that your csv files are simple ones (no multi-lines records, no quoted fields with ; in them...). The example outputs are prefixed with -| .

With any POSIX awk:

awk -F';' 'NR==FNR {a[$1];next} $1 in a' filter_file.csv main_file.csv
-| 151622-0;151622;0;0;10;;;programmed

-F';' defines ; as the input field separator. While parsing the first file (NR==FNR is true only for the first file) store first field ($1) as key of array a and move to next line. While parsing the second file, if first field is a key of array a ($1 in a), print the line (default action).

You could also use sort and join:

join -t';' -o 1.{1..8} <( sort -t';' main_file.csv ) <( sort -t';' filter_file.csv )
-| 151622-0;151622;0;0;10;;;programmed

edited Aug 23 '23 at 09:00

answered Aug 23 '23 at 06:37

Renaud Pacalet

25,260
3
34
51

1

I can't add a complete anwser as somebody close the question but the complete solution was to first encode the files as @shelter suggested , then run the awk command (which I modify to set the seprateor as you suggested (not using BEGIN with double quote). Thanks everybody! – Dexter0015 Aug 23 '23 at 09:09
1

Encoding the files is not needed here, you don't really care about the line ending as long as `\r\n` ends with `\n`. Said differently, in DOS mode your lines are 1 byte longer than in UNIX mode, but as you are interested only in the first field... – Renaud Pacalet Aug 23 '23 at 09:12
I'm confused, we deploy the code on our test server (ubuntu) and it doesn't work either way (with or without encoding) – Dexter0015 Aug 23 '23 at 14:17
Did you test the `awk` script alone on this test server (without the PHP wrapper)? – Renaud Pacalet Aug 23 '23 at 14:20
yep, same result: empty file – Dexter0015 Aug 23 '23 at 14:26
Interesting. Same data files and same version of `awk`? – Renaud Pacalet Aug 23 '23 at 14:46
no. local version : GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.1.0, GNU MP 6.2.1) / server version : GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0) - files generated by the exact same request. – Dexter0015 Aug 23 '23 at 14:54
I suggest that you compare the generated data files between the working and non-working cases. Even if they are generated by the exact same request (to what?) they could be different. – Renaud Pacalet Aug 23 '23 at 15:03
I'm off for two days, i'll try to rewrite the question to reopen it on monday. – Dexter0015 Aug 23 '23 at 15:39

score 0 · Accepted Answer · answered Aug 28 '23 at 07:52

First thanks everybody for your suggestions!

I anwser myself, as none of the answers fully address the issue, though the complete solution is a mix between @shelter encoding suggestion and @Renaud Pacalet rewrite of the awk command.

Something I didn't specified in my question (because at the time I didn't though it would have any impact) is that I tested the code on a local environnement on Windows 10 and that's part of the issue as @shelter suggested...

So before executing the awk command, I add a step in my code to convert the files to compare:

dos2unix filter_file.csv main_file.csv

I also modified how I specify the column separator for awf as @Renaud Pacalet suggested (rather , which give me the following command:

awk -F';' 'NR==FNR {a[$1]=1; next} $1 in a {print}' filter_file.csv main_file.csv > result_file.csv

And this two changes combines give me the correct result.

In my php script it give something like this:

$convert = 'dos2unix  ' . $filter_file . ' ' . $ref_file;
exec($convert);

$awk_cmd = 'awk -F\';\' \'NR==FNR {a[$1]=1; next} $1 in a {print}\' ' . $filter_file . ' ' . $ref_file . ' > ' . $match_file;
exec($awk_cmd);

where files are called using variables as it contain full path to it.

Once thoses chonges were apply, everyting worked on my local server, BUT, one pushed on the test server (ubuntu) we still get the error. Turn out dos2unix was not installed on the test server... Onece installed, everything works as intended.

Still, I don't get why I only get this issue when the filter file countain only one line. When the filter file countain several lines, I never encoutered the issue, strange...

AWK : matching two files by first column doesn't work

2 Answers2