Deleting a pattern from a file using shell script

Asked Jun 02 '16 at 10:13

Active Jun 05 '16 at 11:40

Viewed 54 times

I have two files:

File1:

ABC123
XYZ123

File2:

ABC123,APPLE
ABC123,BALL
XYZ123,BAT
ABC123,CAT
HJK456,MAT

I want to remove all the patterns in File2, which are there in File1. Meaning I want to removed ABC123 and XYZ123 from File2. For doing this I am running the below script.

        while read -r line
        do
        sed -i "/$line/d" $File2
        done<File1

With this script the File2 will have

HJK456,MAT

This script is serving my purpose, but I want to do this where File1 has 100 000 entries and File2 has 500 000 entries. I know the sed is slow here. Can anyone help me to get some command which will do this job and its faster.

edited Jun 05 '16 at 11:40

James Z

12,209
10
24
44

asked Jun 02 '16 at 10:13

Programmer

"1 lakh"? What is that? – Droppy Jun 02 '16 at 10:15
@Droppy: 'lakh' is the indian currency equivalent for 100,00 :) – Inian Jun 02 '16 at 10:19
1

First time I've heard a currency used to describe the number of lines in a file before... – Droppy Jun 02 '16 at 10:20
@Droppy : It's not a currency, simply a metric like million! – blackSmith Jun 02 '16 at 10:22
This is a duplicate. Please search before asking questions – 123 Jun 02 '16 at 10:23
I did not find any script which is faster, in my search. Thats the reason for asking this question. – Programmer Jun 02 '16 at 10:24
look in the grep man page. – 123 Jun 02 '16 at 10:25
I mean there are 1,00,000 entries in File1 and 5,00,000 entries in File2. – Programmer Jun 02 '16 at 10:25
but grep will be slower than sed right? – Programmer Jun 02 '16 at 10:25
@Programmer what would make you think that? And I don't think anything is slower than running processes for each line in a read while loop. – 123 Jun 02 '16 at 10:27
My experience is that sed, awk & grep are the fastest you can use. If you are finding performance problems, I would try rethinking the while loop, but I don`t see an obvious answer. – jordi Jun 02 '16 at 10:30
tr command can help? Yes as you said, thinking of while loop itself. Next I am thinking of splitting the file into multiple files and running the commands. – Programmer Jun 02 '16 at 10:32
6

`grep -vf file file2` – 123 Jun 02 '16 at 10:33
1

Might get bashed for this, but just going to try it! bash `join` and `sort` could help, but am not sure of the performance metrics for it. Can you try grep `join -v 2 <(sort file1) <(sort file2 | cut -d ',' -f 1)` file2. It is actually grep followed by ` join -v 2 <(sort file1) <(sort file2 | cut -d ',' -f 1)` file2 ` and file2 – Inian Jun 02 '16 at 10:34
@123's:- suggestion is more elegant than mine – Inian Jun 02 '16 at 10:35
@123- Thanks for the solution. I think this will serve my purpose. Have kept for execution now. – Programmer Jun 02 '16 at 11:28
@Inian - Thanks for your reply as well. – Programmer Jun 02 '16 at 11:28
@123 You should post that as an answer so that Programmer can accept it and mark this question as answered. – Anthony Geoghegan Jun 02 '16 at 13:59
@AnthonyGeoghegan You can post it if you want, it already exists on the site though. This should have been closed as a dupe, but i honestly can't be bothered looking for the other one. – 123 Jun 02 '16 at 14:04
@123 https://stackoverflow.com/questions/3832988/difference-between-the-content-of-two-files is the best I could find but it's not an exact duplicate and your answer is more elegant than any of those. I'd suggest that Inian post an answer given he/she put so much effort into it already. – Anthony Geoghegan Jun 02 '16 at 14:14
I did grep -Fvf file file2, which is much faster compared to grep -vf file file2 – Programmer Jun 15 '16 at 15:52

Deleting a pattern from a file using shell script

0 Answers0