BASH - Find duplicates in multiple files

Question

I have multiple files in the same directory, each file represents a user and contains IP's used to log into this account, each in a new line.

I want to create a script that will check if the same IP occurs in multiple files and of course print duplicates.

I've tried using awk but with no luck, any help appreciated!

[edit] your question to show concise, testable sample input and expected output plus what you've tried so far (i.e. a [mcve]) so we can start trying to help you. — Ed Morton, Nov 11 '16 at 00:51
You mention matching same values in different files and duplicates. Could you clarify if you only want to find matching values in different files or also duplicate entries in the same files? Those would be two different results. — artdanil, Nov 11 '16 at 18:47
Related: Find duplicates in two files: https://stackoverflow.com/q/15470260/873282 — koppor, Feb 08 '18 at 00:59

Jamil Said · Answer 1 · 2016-11-11T07:58:08.863

Assuming that there are no repeated IP addresses on the same file, this should work for IPv4 addresses in many Bash versions:

#!/bin/bash
#For IP addresses v4, assuming no repeated IP addresses on the same file; result is stored on the file /tmp/repeated-ips
mkdir -p /tmp
grep -rhEo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /home/user/folder > /tmp/ipaddresses-holder
sort /tmp/ipaddresses-holder | uniq -d > /tmp/repeated-ips
Exit 0

The script below is a little more complex, but it would work whether or not there are repeated IP addresses on a single file:

#!/bin/bash
#For IP addresses v4, result is stored on the file /tmp/repeated-ips
mkdir -p /tmp
grep -rEo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /home/user/folder > /tmp/ipaddresses-holder
sort -u /tmp/ipaddresses-holder  > /tmp/ipaddresses-holder2
grep -rhEo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/ipaddresses-holder2 > /tmp/ipaddresses-holder3
sort /tmp/ipaddresses-holder3 | uniq -d > /tmp/repeated-ips
Exit 0

In both cases, the result is stored on the file /tmp/repeated-ips

Jay Rajput · Answer 2 · 2016-11-11T01:24:34.577

0

Use the following awk command:

awk '$0 in a {print FILENAME, "IP:", $0, "also in:", a[$0]; next} {a[$0] = FILENAME}' /tmp/user*

Assuming that you have file just with the IP like this

[tmp]$cat /tmp/user1
1.1.1.1
[tmp]$cat /tmp/user2
2.2.2.2
[tmp]$cat /tmp/user3
1.1.1.1

Output

[tmp]$awk '$0 in a {print FILENAME, "IP:", $0, "also in:", a[$0]; next} {a[$0] = FILENAME}' /tmp/user*
/tmp/user3 IP: 1.1.1.1 also in: /tmp/user1

Explanation

awk '
  $0 in a {                        # if IP already exists in array a
    print FILENAME, "IP:", $0, \   # print the output
       "also in:", a[$0];
    next;                          # get the next record without further
  }                                # processing
  {a[$0] = FILENAME}               # if reached here, then we are seeing IP
'                                  # for the first time, so store it

edited Nov 11 '16 at 01:24

answered Nov 11 '16 at 01:14

Jay Rajput

1,813
17
23

My understanding is that there is only a single IP in the file. It is tricky to answer the question without knowing the format for the file storing the IP for the user – Jay Rajput Nov 11 '16 at 01:21
You've reverted your change, so I'm reposting my comment: If the same IP is listed in the same file multiple times, your script will write about that, but the OP only wants information about the same IP appearing in different files. – chw21 Nov 11 '16 at 01:44
Yeah I thought about that. Without knowing the requirements, it was unnecessary cluttering the code. I will let the OP comment and let us know the requirements, before I change. There are tons of things..like what happens if the IP can be expanded in one place and compressed at other place..Shall that be matched? – Jay Rajput Nov 11 '16 at 01:57

chw21 · Answer 3 · 2016-11-11T01:31:55.363

Not sure I understand your question correctly, so here's what I think you want to do:

You have several files. Each file refers to a specific user and logs every IP address that that user has used to log in from. Example:

$ cat alice.txt
192.168.1.1
192.168.1.5
192.168.1.1
192.168.1.1
$ cat bob.txt
192.168.0.1
192.168.1.3
192.168.1.2
192.168.1.3
$ cat eve.txt
192.168.1.7
192.168.1.5
192.168.1.7
192.168.0.7

You want to find out whether the same IP address appears in multiple files.

Here's what I came up with.

#!/usr/bin/env bash
SEARCH_TERMS="search_terms.txt"
for source_file in $@
do
    for search_term in $(sort -u $source_file)
    do
        found=$(grep -F "${search_term}" $@ --exclude=${source_file})
        if [[ -n "${found}" ]]; then
            echo "Found ${search_term} from ${source_file} also here:"
            echo ${found}
        fi
    done
done

It's probably not the best solution.

score 0 · Answer 4 · answered Nov 11 '16 at 16:09

0

How about something like:

diff -u <(cat * | sort) <(cat * | sort | uniq)

In other words, the difference between all the files concatenated and sorted, and all the files concatenated, sorted, and then the duplicates removed.

answered Nov 11 '16 at 16:09

EvansWinner

158
1
5

BASH - Find duplicates in multiple files

4 Answers4