0

Alright, so I have a file, lets call it file1.txt. It has 5000 lines, and I have file2.txt that has 2,000,000 lines.

I have ran the following command: comm -23 <(sort file2.txt) <(sort file1.txt) > file3.txt

I now have file3.txt with 1,996,000 lines. I would like to extract the 1000 unique lines that file1.txt contains. How would that be possible?

I have tried: comm -23 <(sort file1.txt) <(sort file3.txt) > file4.txt with no avail. file4.txt was not filtered, it was a copy of file1.txt

Thanks in advance.

PS: I am using cygwin so some functionality may be limited. Thanks a lot

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • 1
    Please add sample input (no descriptions, no images, no links) and your desired output for that sample input to your question (no comment). – Cyrus Aug 02 '20 at 05:29
  • As side note, you do not have to re-sort file3. It is already sorted. – dash-o Aug 02 '20 at 05:37

3 Answers3

1

Using awk to get unique lines of file1. First some test data (comments are not part of data):

file1:

1  # unique in file1 so this is what we want
2  # common in file1 and file2

file2:

2  # common in file1 and file2
3  # unique in file2

The awk:

$ awk '
NR==FNR {         # process file1
    a[$0]         # hash all records
    next
}                 # process file2 below this point
($0 in a) {       # if common entry found in hash
    delete a[$0]  # delete it from the hash
}
END {             # in the end
    for(i in a)   # loop all leftovers
        print i   # and output them
}' file1 file2    # mind the order

Output:

1

Output will not be in any meaningful order due to implementation issues.

James Brown
  • 36,089
  • 7
  • 43
  • 59
0

On surface, the issue is with using file3 for extract unique lines from file1. Given the file3 has only the unique lines of file2, the last comm (file1 and file3) will not remove any data from file1

Consider instead:

comm -23 <(sort -t: -u file1.txt) <(sort -t: -u file2.txt)
dash-o
  • 13,723
  • 1
  • 10
  • 37
  • Thanks for your repsonse. I tried that, and got this `-bash: $(sort file1.txt): ambiguous redirect` – Henry Stark Aug 02 '20 at 06:18
  • @HenryStark I've fixed the syntax. Also speeding up process using uniq option on sort. – dash-o Aug 02 '20 at 09:26
  • Thanks, it seems to have worked (but sometimes a few lines are skipped) - I'm still able to locate manually a few lines that are still in both files (is it because of Linux-Windows text formatting differences?) – Henry Stark Aug 02 '20 at 12:05
  • @HenryStark, there are small (and annoying) differences between the default sort order, and the order expected by 'comm'. Consider passing a '-t:' (replace the : with a characters that does NOT exists on the line). It usually helps in those situation. – dash-o Aug 02 '20 at 12:18
  • I've ran that command (replaced : with {, after removing all { from the text file(s). The files have almost all alphanumerics A-Z, a-z, 0-9, so I figured removing a symbol would be the best way to make it work, but it still misses a few lines). Made sure there was no syntax error. Is there a better solution to this than `comm` I wonder? Currently my file1.txt is around 420,000 lines and the file2 is around 7,000,000 lines. Could the size possibly be causing the error? (Thanks again for the help so far btw) – Henry Stark Aug 05 '20 at 04:01
  • @HenryStark I do not believe size matters. Those utilities have been around for long time, and will not fail on large size. Try to produce minimal data that shows the problem. Also, consider posting the offending output. Hard to SO users to give you feedback without knowing what the problem is – dash-o Aug 05 '20 at 04:25
0
grep -F -x -f file1.txt file3.txt.

Full disclosure:- This answer was found here.

petrus4
  • 616
  • 4
  • 7
  • The above solution has few issues (1) is does not find the unique line of file1 NOT found in file3 (2) it is slow for large files, as it attempt to match each pattern (frm file1) into all the lines of file3. – dash-o Aug 02 '20 at 12:42
  • I apologise. I only tested it on a pair of small files. – petrus4 Aug 02 '20 at 12:44
  • 1
    I've tried this already :) No issues & thanks for your time – Henry Stark Aug 02 '20 at 19:31