Finding difference between two text files with millions of rows

Question

I have two text files, text file 1 contains all the rows and text file 2 is a subset of file 1. How do I compare the two line by line and return rows in file 1 that doesn't exist in file 2?

#|# is the delimiter and the first row is the header.

Format:

DB#|#row_id#|#Date#|#Time#|#Entry#|#Source#|#Date2
GP120#|#1#|#2021-10-01#|#16:51:01#|#1#|#REPO 3.0 SETUP#|#2021-06-29 00:00:00

Why not just use the [`diff`](https://en.wikipedia.org/wiki/Diff) utility, or an existing tool like that? Is there a reason you need to do it in R? Will matching rows be exactly the same, or is there some level of difference that you'll except? Do you want the rows themselves, or just the row numbers? — divibisan, Jan 25 '22 at 23:16
Thanks, the rows that will be same will be identical between the two files. I want the entire row as the output, and I am choosing R because thats what I am comfortable with! — Xin, Jan 25 '22 at 23:18
The fastest and easiest way to do this is probably with unix tools built for this, as in this question: [How to remove the lines which appear on file B from another file A?](https://stackoverflow.com/q/4366533/8366499) — divibisan, Jan 25 '22 at 23:24
. But if you want to do it in R, you can read in each file as a data frame and then get the "anti-join" of them as in these questions: [https://stackoverflow.com/questions/35809923/subsetting-a-data-frame-to-the-rows-not-appearing-in-another-data-frame](https://stackoverflow.com/q/35809923/8366499) or [R selecting all rows from a data frame that don't appear in another](https://stackoverflow.com/q/17427916/8366499) — divibisan, Jan 25 '22 at 23:24
If you're comfortable using R but want speed ... `filediff <- system('diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" ', internTRUE)` (taken from divibisan's first link). This will return a `character` vector of all lines from one that are not in the other. — r2evans, Jan 26 '22 at 01:55
... and it will be far faster than anything R can do, especially if you have "millions of rows". — r2evans, Jan 26 '22 at 02:08

Finding difference between two text files with millions of rows

0 Answers0