1

I've two files file1 & file2 and i want to delete unmatched rows based on first column i.e. ID

File1

ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
14,LL,JAPAN

File2

ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
16,WW,DUBAI

I want to delete lines

From File1
14,LL,JAPAN 

From File2
16,WW,DUBAI

Using DIFF command i am able to file but i want to delete these lines.

Also Using AWK i am able to redirect these lines to another file using below command.

awk 'FNR==NR{a[$1];next};!($1 in a)' File1 File2 > File3

But don't know how to delete.

Can we do this ?

Using sed -id 's/AWK OUTPUT/g' File1

Thanks

Allan
  • 12,117
  • 3
  • 27
  • 51
Priyanka
  • 169
  • 10
  • Are the files sorted? – Allan Feb 07 '18 at 07:09
  • I am able to delete lines now using sed and awk : sed -i '/'`awk 'FNR==NR{a[$1];next};!($1 in a)' file2 file1`'/d' file1 – Priyanka Feb 07 '18 at 07:09
  • No @Allan, files are not sorted. – Priyanka Feb 07 '18 at 07:10
  • Earlier tried on 2 small sample files and it worked but when i tried on big files having 50K record getting error : sed -i '/'`awk 'FNR==NR{a[$1];next};!($1 in a)' File2 File1`'/d' File1 -bash: /usr/bin/sed: Argument list too long – Priyanka Feb 07 '18 at 07:35

3 Answers3

2

You do not even need to use awk for this operation grep is more than enough:

$ more file1 file2
::::::::::::::
file1
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
14,LL,JAPAN
::::::::::::::
file2
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
16,WW,DUBAI

$ grep -f <(grep -oP '^[^,]*,' file1) file2 > new_file2; grep -f <(grep -oP '^[^,]*,' file2) file1 > new_file1

$ more new_file*
::::::::::::::
new_file1
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
::::::::::::::
new_file2
::::::::::::::
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE

Explanations:

you use the grep -oP to extract from each line the id with the comma and you call grep again and pass the list of patterns as if it was a file to analyse the second file this will print only matching lines, you do the same with the other file.

However both files are the same at the end of the process so you do not need to run grep -f <(grep -oP '^[^,]*,' file2) file1 > new_file1

Another way of processing is using the following commands:

$ grep -F -f <(paste -d'\n' <(cut -d',' -f1 file1 | sort -n) <(cut -d',' -f1 file2 | sort -n) | uniq -D | uniq) file1 > new_file1

$ more new_file1
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
Allan
  • 12,117
  • 3
  • 27
  • 51
  • Thanks for the command but i am using AWK to compare both files based on first column i.e. ID – Priyanka Feb 07 '18 at 07:20
  • Is it possible to improve performance for this command, my both files having 50+k records and this command running from last 5 mins and still running.. – Priyanka Feb 07 '18 at 07:28
  • Sorry I didn't know your files were that big, currently the complexity is at least like O(n^2) Let me think of a better solution! – Allan Feb 07 '18 at 07:36
  • Thanks Allan, its working fine but there is two problems, 1st :It will not going to delete lines from files only it will display. 2nd is i am comparing rows based on first column only, if ID is same in both files and name is different then its going to ignore that line also and i want that line in file. – Priyanka Feb 07 '18 at 07:54
  • both issues can be fixed easily! ;-) for the writing part you can redirect the standard output to a new file right or do you need absolutely to delete the content of the initial 2 files? note that the resulting files will be the same – Allan Feb 07 '18 at 07:58
  • both points have been fixed and I think it should go a bit faster as well as my first attempt – Allan Feb 07 '18 at 08:06
  • Allan, if we redirect then new file will contain those lines only which are not matched and problem remain same with original files. So it should delete lines. For 2nd, need to compare lines based on first column only and then delete. – Priyanka Feb 07 '18 at 08:07
2

Answer using tee command:

notice i have removed the ! sign

 awk 'FNR==NR{a[$1];next};($1 in a)' File1 File2 | tee File1 

you can use the original awk command you have written with a small change

1) remove the !

2) pip the awk result to file using tee command - using | executes the next command only when command before the | have ended

shaiki siegal
  • 392
  • 3
  • 10
2

One using GNU awk's inplace edit (see link for discussion on versions):

$ awk -i inplace '            # inplace edit modifies the original files
BEGIN { FS="," }              # comma delimited files
NR==FNR {                     # first file, hash all records and set value to 1
    a[$1]++
}
($1 in a) && a[$1]==ARGIND {  # if index in a and it has been in each file read
    a[$1]++                   # ... increase count 
    print                     # ... and print to current file
}' file1 file2 file1

Results:

$ cat file1
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE
$ cat file2
ID,NAME,ADDRESS
11,PP,LONDON
12,SS,BERLIN
13,QQ,FRANCE

Notice: Inplace edit modifies the original files so beware. If you want to test it without using copies of the original files you could remove the -i inplace and replace the print with print FILENAME, $0

Processing the first file each id is hashed to a and its value increased to 1 and is therefore outputed to file1 leaving it in to its original form (if there are no duplicate ids). For the second file ids found in a are increased and printed back to file2 and the applies to file1 processed for the second time.

James Brown
  • 36,089
  • 7
  • 43
  • 59