BASH ISSUE : Compare Two Different Larger Set Text Files and Get the Matching IP address

Question

I have two TXT files , 1.txt has 11,000 IPs and 2.txt has 1 Million IPs. I want to match 1.txt against 2.txt ( 1 million IPs) and get the matching ones.

#1.txt
1,1.1.1.1
2,2.2.2.2
3,3.3.3.3
.........

#2.txt
51.51.6.10
12.10.25.16
1.3.50.55
0.0.0.0
6.6.6.6
1.1.1.1
2.2.2.2
5.5.5.5
6.6.6.6
7.7.7.7
20.200.100.30
Like wise 1 Million lines of IPs.......

Matching Result :
1,1.1.1.1
2,2.2.2.2

I tried doing awk -F, 'NR==FNR{a[$0];next}($2 in a)' 2.txt 1.txt,It gives me the exact answer for the smaller subset(Test Runs). But checking against the original files 11,000 against 1 Million IPs,It's returning me all the IPs which is in 1.txt.
Tried sed -n -f <(sed 's|.*|/,&$/p|' 2.txt) 1.txt, Process is automatically killed.
Tried, comm -23 1.txt 2.txt > 3.txt,Again returning all the IPs from 1.txt.

Not sure with the issue on where i'm making mistakes / matching against 1 million IPs is not possible using sed , awk , comm or any ? Can some one help me on suggesting what will be the issue ?

Reference Used : http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file

The awk command won't change how it works for larger sets of data... — 123, Feb 17 '16 at 10:24

mauro · Accepted Answer · 2016-02-17T10:52:48.173

1

Assumption #1: files are sorted as show in your original question

Assumption #2: ip addresses are unique

If you want just the IP addresses:

$ comm -12 <(cut -d, -f2 1.txt) 2.txt 
1.1.1.1
2.2.2.2

If you want the whole line in 1.txt:

$ comm -12 <(cut -d, -f2 1.txt) 2.txt  | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2

UPDATE

If my Assumption#1 is not valid, then you have to sort 1.txt and 2.txt in-line.

This is the solution to get just common IP addresses:

$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt) 
1.1.1.1
2.2.2.2

and this will show the full line from 1.txt:

$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt) | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2

I also made a quick test on my small MacBook Air using 1ML IPs in 1.txt and 0.5ML IPs in 2.txt. It takes 19 seconds if files have to be sorted.

edited Feb 17 '16 at 10:52

answered Feb 17 '16 at 10:26

mauro

5,730
2
26
25

When getting the whole line in 1.txt,I'm getting status as comm : file 2 is not in sorted order. comm : file 2 is not in sorted order. ? But i have sorted the files (#sort nu) already. – Arun Feb 17 '16 at 10:42
I tried it again @mauro, Just empty results when redirected to a file and checked. I'm running it in ubuntu 12.04 machine. – Arun Feb 17 '16 at 11:01
@Arun. Check your input files... `head -n 5 1.txt | od-c` and similar for 2.txt – mauro Feb 17 '16 at 11:07
Checking against #1.txt @mauro,I got : 0000000 3 , 5 0 . 1 6 . 2 2 4 . 1 0 7 \n 0000020 5 , 9 5 . 1 7 0 . 7 2 . 1 1 3 \n 0000040 6 , 2 1 2 . 8 4 . 6 6 . 1 9 \n 7 0000060 , 2 1 3 . 1 7 9 . 5 4 . 1 1 \n 9 0000100 , 1 1 8 . 1 2 7 . 4 5 . 4 5 \n 0000117 – Arun Feb 17 '16 at 11:16
checking against #2.txt @mauro i got this : 0000000 1 . 5 2 . 3 5 . 1 6 \n 1 . 2 3 0000020 4 . 5 1 . 2 3 0 \n 4 . 2 6 . 6 0000040 7 . 1 1 2 \n 4 . 2 6 . 2 0 9 . 0000060 2 5 0 \n 4 . 5 3 . 1 1 1 . 3 3 0000100 \n 0000102 – Arun Feb 17 '16 at 11:17
@Arun. They seems ok. The only explanation for empty results is that there are no common IPs. Check with a "controlled test env" where you know in advance the expected result. – mauro Feb 17 '16 at 11:37
Will check once again @mauro . I tried to create a separate database tables and wrote a JOIN query to compare the IPs in one table (1.txt) to match against another (2.txt). Frankly speaking the DB crashed to render matches. So i wanted to find a way in bash to find it. Anyway thanks for your help indeed ! will check it under "controlled test env " – Arun Feb 17 '16 at 11:46

BASH ISSUE : Compare Two Different Larger Set Text Files and Get the Matching IP address

1 Answers1