1

I have two TXT files , 1.txt has 11,000 IPs and 2.txt has 1 Million IPs. I want to match 1.txt against 2.txt ( 1 million IPs) and get the matching ones.

#1.txt
1,1.1.1.1
2,2.2.2.2
3,3.3.3.3
.........

#2.txt
51.51.6.10
12.10.25.16
1.3.50.55
0.0.0.0
6.6.6.6
1.1.1.1
2.2.2.2
5.5.5.5
6.6.6.6
7.7.7.7
20.200.100.30
Like wise 1 Million lines of IPs.......

Matching Result :
1,1.1.1.1
2,2.2.2.2
  1. I tried doing awk -F, 'NR==FNR{a[$0];next}($2 in a)' 2.txt 1.txt,It gives me the exact answer for the smaller subset(Test Runs). But checking against the original files 11,000 against 1 Million IPs,It's returning me all the IPs which is in 1.txt.

  2. Tried sed -n -f <(sed 's|.*|/,&$/p|' 2.txt) 1.txt, Process is automatically killed.

  3. Tried, comm -23 1.txt 2.txt > 3.txt,Again returning all the IPs from 1.txt.

Not sure with the issue on where i'm making mistakes / matching against 1 million IPs is not possible using sed , awk , comm or any ? Can some one help me on suggesting what will be the issue ?

Reference Used : http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file
Arun
  • 1,160
  • 3
  • 17
  • 33

1 Answers1

1

Assumption #1: files are sorted as show in your original question

Assumption #2: ip addresses are unique

If you want just the IP addresses:

$ comm -12 <(cut -d, -f2 1.txt) 2.txt 
1.1.1.1
2.2.2.2

If you want the whole line in 1.txt:

$ comm -12 <(cut -d, -f2 1.txt) 2.txt  | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2

UPDATE

If my Assumption#1 is not valid, then you have to sort 1.txt and 2.txt in-line.

This is the solution to get just common IP addresses:

$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt) 
1.1.1.1
2.2.2.2

and this will show the full line from 1.txt:

$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt) | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2

I also made a quick test on my small MacBook Air using 1ML IPs in 1.txt and 0.5ML IPs in 2.txt. It takes 19 seconds if files have to be sorted.

mauro
  • 5,730
  • 2
  • 26
  • 25
  • When getting the whole line in 1.txt,I'm getting status as comm : file 2 is not in sorted order. comm : file 2 is not in sorted order. ? But i have sorted the files (#sort nu) already. – Arun Feb 17 '16 at 10:42
  • I tried it again @mauro, Just empty results when redirected to a file and checked. I'm running it in ubuntu 12.04 machine. – Arun Feb 17 '16 at 11:01
  • @Arun. Check your input files... `head -n 5 1.txt | od-c` and similar for 2.txt – mauro Feb 17 '16 at 11:07
  • Checking against #1.txt @mauro,I got : 0000000 3 , 5 0 . 1 6 . 2 2 4 . 1 0 7 \n 0000020 5 , 9 5 . 1 7 0 . 7 2 . 1 1 3 \n 0000040 6 , 2 1 2 . 8 4 . 6 6 . 1 9 \n 7 0000060 , 2 1 3 . 1 7 9 . 5 4 . 1 1 \n 9 0000100 , 1 1 8 . 1 2 7 . 4 5 . 4 5 \n 0000117 – Arun Feb 17 '16 at 11:16
  • checking against #2.txt @mauro i got this : 0000000 1 . 5 2 . 3 5 . 1 6 \n 1 . 2 3 0000020 4 . 5 1 . 2 3 0 \n 4 . 2 6 . 6 0000040 7 . 1 1 2 \n 4 . 2 6 . 2 0 9 . 0000060 2 5 0 \n 4 . 5 3 . 1 1 1 . 3 3 0000100 \n 0000102 – Arun Feb 17 '16 at 11:17
  • @Arun. They seems ok. The only explanation for empty results is that there are no common IPs. Check with a "controlled test env" where you know in advance the expected result. – mauro Feb 17 '16 at 11:37
  • Will check once again @mauro . I tried to create a separate database tables and wrote a JOIN query to compare the IPs in one table (1.txt) to match against another (2.txt). Frankly speaking the DB crashed to render matches. So i wanted to find a way in bash to find it. Anyway thanks for your help indeed ! will check it under "controlled test env " – Arun Feb 17 '16 at 11:46