Delete rows from 2 files after matching on first column

Question

I've two files file1 & file2 and i want to delete unmatched rows based on first column i.e. ID

File1

ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
14,LL,JAPAN

File2

ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
16,WW,DUBAI

I want to delete lines

From File1
14,LL,JAPAN 

From File2
16,WW,DUBAI

Using DIFF command i am able to file but i want to delete these lines.

Also Using AWK i am able to redirect these lines to another file using below command.

awk 'FNR==NR{a[$1];next};!($1 in a)' File1 File2 > File3

But don't know how to delete.

Can we do this ?

Using sed -id 's/AWK OUTPUT/g' File1

Thanks

I am able to delete lines now using sed and awk : sed -i '/'`awk 'FNR==NR{a[$1];next};!($1 in a)' file2 file1`'/d' file1 — Priyanka, Feb 07 '18 at 07:09
Earlier tried on 2 small sample files and it worked but when i tried on big files having 50K record getting error : sed -i '/'`awk 'FNR==NR{a[$1];next};!($1 in a)' File2 File1`'/d' File1 -bash: /usr/bin/sed: Argument list too long — Priyanka, Feb 07 '18 at 07:35

Allan · Accepted Answer · 2018-02-07T08:05:27.050

2

You do not even need to use awk for this operation grep is more than enough:

$ more file1 file2
::::::::::::::
file1
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
14,LL,JAPAN
::::::::::::::
file2
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
16,WW,DUBAI

$ grep -f <(grep -oP '^[^,]*,' file1) file2 > new_file2; grep -f <(grep -oP '^[^,]*,' file2) file1 > new_file1

$ more new_file*
::::::::::::::
new_file1
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
::::::::::::::
new_file2
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE

Explanations:

you use the grep -oP to extract from each line the id with the comma and you call grep again and pass the list of patterns as if it was a file to analyse the second file this will print only matching lines, you do the same with the other file.

However both files are the same at the end of the process so you do not need to run grep -f <(grep -oP '^[^,]*,' file2) file1 > new_file1

Another way of processing is using the following commands:

$ grep -F -f <(paste -d'\n' <(cut -d',' -f1 file1 | sort -n) <(cut -d',' -f1 file2 | sort -n) | uniq -D | uniq) file1 > new_file1

$ more new_file1
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE

edited Feb 07 '18 at 08:05

answered Feb 07 '18 at 07:15

Allan

12,117
3
27
51

Thanks for the command but i am using AWK to compare both files based on first column i.e. ID – Priyanka Feb 07 '18 at 07:20
Is it possible to improve performance for this command, my both files having 50+k records and this command running from last 5 mins and still running.. – Priyanka Feb 07 '18 at 07:28
Sorry I didn't know your files were that big, currently the complexity is at least like O(n^2) Let me think of a better solution! – Allan Feb 07 '18 at 07:36
Thanks Allan, its working fine but there is two problems, 1st :It will not going to delete lines from files only it will display. 2nd is i am comparing rows based on first column only, if ID is same in both files and name is different then its going to ignore that line also and i want that line in file. – Priyanka Feb 07 '18 at 07:54
both issues can be fixed easily! ;-) for the writing part you can redirect the standard output to a new file right or do you need absolutely to delete the content of the initial 2 files? note that the resulting files will be the same – Allan Feb 07 '18 at 07:58
both points have been fixed and I think it should go a bit faster as well as my first attempt – Allan Feb 07 '18 at 08:06
Allan, if we redirect then new file will contain those lines only which are not matched and problem remain same with original files. So it should delete lines. For 2nd, need to compare lines based on first column only and then delete. – Priyanka Feb 07 '18 at 08:07

score 2 · Answer 2 · answered Feb 07 '18 at 08:00

Answer using tee command:

notice i have removed the ! sign

 awk 'FNR==NR{a[$1];next};($1 in a)' File1 File2 | tee File1

you can use the original awk command you have written with a small change

1) remove the !

2) pip the awk result to file using tee command - using | executes the next command only when command before the | have ended

James Brown · Answer 3 · 2018-02-07T09:32:39.787

One using GNU awk's inplace edit (see link for discussion on versions):

$ awk -i inplace '            # inplace edit modifies the original files
BEGIN { FS="," }              # comma delimited files
NR==FNR {                     # first file, hash all records and set value to 1
    a[$1]++
}
($1 in a) && a[$1]==ARGIND {  # if index in a and it has been in each file read
    a[$1]++                   # ... increase count 
    print                     # ... and print to current file
}' file1 file2 file1

Results:

$ cat file1
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
$ cat file2
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE

Notice: Inplace edit modifies the original files so beware. If you want to test it without using copies of the original files you could remove the -i inplace and replace the print with print FILENAME, $0

Processing the first file each id is hashed to a and its value increased to 1 and is therefore outputed to file1 leaving it in to its original form (if there are no duplicate ids). For the second file ids found in a are increased and printed back to file2 and the applies to file1 processed for the second time.

Got syntax error for this.Usage: awk [POSIX or GNU style options] -f progfile [--] file ... Usage: awk [POSIX or GNU style options] [--] 'program' file ... — Priyanka, Feb 07 '18 at 09:26
Which version does `awk -V` give you? 4.1.0 required (see the link). — James Brown, Feb 07 '18 at 09:29

Delete rows from 2 files after matching on first column

3 Answers3

Answer using tee command: