0

I have two files:

File1:

ABC123
XYZ123

File2:

ABC123,APPLE
ABC123,BALL
XYZ123,BAT
ABC123,CAT
HJK456,MAT

I want to remove all the patterns in File2, which are there in File1. Meaning I want to removed ABC123 and XYZ123 from File2. For doing this I am running the below script.

        while read -r line
        do
        sed -i "/$line/d" $File2
        done<File1

With this script the File2 will have

HJK456,MAT

This script is serving my purpose, but I want to do this where File1 has 100 000 entries and File2 has 500 000 entries. I know the sed is slow here. Can anyone help me to get some command which will do this job and its faster.

James Z
  • 12,209
  • 10
  • 24
  • 44
Programmer
  • 329
  • 2
  • 6
  • 25
  • "1 lakh"? What is that? – Droppy Jun 02 '16 at 10:15
  • @Droppy: 'lakh' is the indian currency equivalent for 100,00 :) – Inian Jun 02 '16 at 10:19
  • 1
    First time I've heard a currency used to describe the number of lines in a file before... – Droppy Jun 02 '16 at 10:20
  • @Droppy : It's not a currency, simply a metric like million! – blackSmith Jun 02 '16 at 10:22
  • This is a duplicate. Please search before asking questions – 123 Jun 02 '16 at 10:23
  • I did not find any script which is faster, in my search. Thats the reason for asking this question. – Programmer Jun 02 '16 at 10:24
  • look in the grep man page. – 123 Jun 02 '16 at 10:25
  • I mean there are 1,00,000 entries in File1 and 5,00,000 entries in File2. – Programmer Jun 02 '16 at 10:25
  • but grep will be slower than sed right? – Programmer Jun 02 '16 at 10:25
  • @Programmer what would make you think that? And I don't think anything is slower than running processes for each line in a read while loop. – 123 Jun 02 '16 at 10:27
  • My experience is that sed, awk & grep are the fastest you can use. If you are finding performance problems, I would try rethinking the while loop, but I don`t see an obvious answer. – jordi Jun 02 '16 at 10:30
  • tr command can help? Yes as you said, thinking of while loop itself. Next I am thinking of splitting the file into multiple files and running the commands. – Programmer Jun 02 '16 at 10:32
  • 6
    `grep -vf file file2` – 123 Jun 02 '16 at 10:33
  • 1
    Might get bashed for this, but just going to try it! bash `join` and `sort` could help, but am not sure of the performance metrics for it. Can you try grep `join -v 2 <(sort file1) <(sort file2 | cut -d ',' -f 1)` file2. It is actually grep followed by ` join -v 2 <(sort file1) <(sort file2 | cut -d ',' -f 1)` file2 ` and file2 – Inian Jun 02 '16 at 10:34
  • @123's:- suggestion is more elegant than mine – Inian Jun 02 '16 at 10:35
  • @123- Thanks for the solution. I think this will serve my purpose. Have kept for execution now. – Programmer Jun 02 '16 at 11:28
  • @Inian - Thanks for your reply as well. – Programmer Jun 02 '16 at 11:28
  • @123 You should post that as an answer so that Programmer can accept it and mark this question as answered. – Anthony Geoghegan Jun 02 '16 at 13:59
  • @AnthonyGeoghegan You can post it if you want, it already exists on the site though. This should have been closed as a dupe, but i honestly can't be bothered looking for the other one. – 123 Jun 02 '16 at 14:04
  • @123 https://stackoverflow.com/questions/3832988/difference-between-the-content-of-two-files is the best I could find but it's not an exact duplicate and your answer is more elegant than any of those. I'd suggest that Inian post an answer given he/she put so much effort into it already. – Anthony Geoghegan Jun 02 '16 at 14:14
  • I did grep -Fvf file file2, which is much faster compared to grep -vf file file2 – Programmer Jun 15 '16 at 15:52

0 Answers0