Use grep or awk to find matching lines between txt and csv file

Question

I have a csv file that looks like this (named csvfile.csv):

ENSG00000000003.15;4;0;22;21;2;10;0;0;0;0;6;0;8;2;5;15;0;0
ENSG00000000005.6;0;0;0;0;0;1;0;0;0;0;0;0;0;3;1;2;0;0
ENSG00000000419.14;18;57;76;8;12;28;35;5;73;13;50;31;21;92;39;134;31;59
ENSG00000000457.14;3;6;43;2;0;15;6;0;44;8;22;8;2;65;5;22;32;20
ENSG00000000460.17;55;3;14;28;16;7;32;15;8;19;8;18;23;12;23;25;65;94
ENSG00000000938.13;10;89;862;164;5;297;243;1;582;102;492;46;103;251;124;167;109;1108
ENSG00000000971.16;0;0;9;39;0;75;20;0;6;0;18;0;1;18;17;35;0;0
ENSG00000001036.14;87;98;69;32;47;41;35;46;153;72;48;38;32;19;56;103;186;166
ENSG00000001084.13;0;23;60;25;0;29;63;12;41;4;44;0;7;6;2;1;0;18
ENSG00000001167.15;0;0;56;0;13;22;25;0;34;13;23;4;2;87;4;7;32;28
ENSG00000001460.18;22;3;27;15;1;34;11;3;11;5;15;19;5;36;6;18;21;26
ENSG00000001461.17;74;131;159;87;181;101;144;60;347;148;98;65;60;90;96;46;335;224
ENSG00000001497.18;0;0;90;2;0;38;2;0;44;19;21;0;32;40;16;3;0;43

I'm trying to store the counts (numbers behind the gene names) of only certain genes which are stored in a txt file named text.txt:

ENSG00000001036.14
ENSG00000001461.17

I would also like to preserve the sample names. So my desired output is:

ENSG00000001036.14;87;98;69;32;47;41;35;46;153;72;48;38;32;19;56;103;186;166
ENSG00000001461.17;74;131;159;87;181;101;144;60;347;148;98;65;60;90;96;46;335;224

I have tried the following:

awk -f, '$1==text.txt' csvfile.csv > new.csv.

However, this does not give me the desired output. Can anyone help with this problem?

Thanks!

The problem is that your awk command does not really make sense. You use the `-f` option (awk source file) instead of `-F` (field separator). You try to use a comma as field separator instead of a semi-colon. You try to compare the first field of the CSV file with the literal string `text.txt` instead of the content of the file. You did not double-quote `text.txt`, which forces awk to consider it as a variable name and causes a syntax error because of the dot. Finally you output the result to a file named `new.csv.` instead of `new.csv`. — Renaud Pacalet, Nov 16 '21 at 08:02

score 3 · Answer 1 · answered Nov 15 '21 at 20:05

3

You can do:

awk -F ';' '
FNR==NR{idx[$1]; next}
FNR==1 || $1 in idx
' textile.txt csvfile.csv

Prints:

;sample_a;sample_b;sample_c;sample_d
gene_1;1;1;1;1
gene_2;2;2;2;2

answered Nov 15 '21 at 20:05

dawg

98,345
23
131
206

Hi for some reason the code does not work. I thought its possible that my example which I used for simplicity sake is not a good reflection of my actual dataset. Now I changed my example so its more representative. Do you think that your provided code should work or maybe there is another solution for this problem? – bart joosten Nov 15 '21 at 20:34
@bartjoosten If that doesn't work then you almost certainly have DOS line endings in the first file at least - see https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it. This IS the right answer to the question you posted. – Ed Morton Feb 02 '22 at 15:07

score 2 · Answer 2 · answered Nov 15 '21 at 20:16

2

The awk solution by dawg should work. I would add one way with the join command:

$ join -t';' data.csv text.txt
gene_1;1;1;1;1
gene_2;2;2;2;2

answered Nov 15 '21 at 20:16

Kent

189,393
32
233
301

`head -1 data.csv; join -t';' data.csv text.txt` to add the header! – dawg Nov 15 '21 at 20:33
Hi for some reason the code does not work. I thought its possible that my example which I used for simplicity sake is not a good reflection of my actual dataset. Now I changed my example so its more representative. Do you think that your provided code should work or maybe there is another solution for this problem? – bart joosten Nov 15 '21 at 20:34
@bartjoosten: With your changed example, BOTH Kent's and my solution 'work'. Please define what you mean by *the code does not work*? – dawg Nov 15 '21 at 20:36
@bartjoosten pls make sure the `.txt` file doesn't have trailing spaces. – Kent Nov 15 '21 at 20:40

score 2 · Answer 3 · answered Nov 15 '21 at 20:58

2

I'd use

grep -F -f text.txt csvfile.csv

That's assuming the data in text.txt can't inadvertently match elsewhere in the lines of the csv file.

answered Nov 15 '21 at 20:58

glenn jackman

238,783
38
220
352

score 1 · Answer 4 · answered Nov 15 '21 at 20:39

You could store the first field in an array for the first file only using FNR==NR, and then for the file(s) after that print the whole line if the first field is present in the array keys.

Note that your values are separated by ; instead of ,

awk -F ';' '
FNR==NR{a[$1]; next}
$1 in a
' text.txt csvfile.csv > new.csv

Content in new.csv

ENSG00000001036.14;87;98;69;32;47;41;35;46;153;72;48;38;32;19;56;103;186;166
ENSG00000001461.17;74;131;159;87;181;101;144;60;347;148;98;65;60;90;96;46;335;224

Use grep or awk to find matching lines between txt and csv file

4 Answers4