0

I’ve got two text files, each with several hundred lines. Some of the lines exist in both files, and I want to remove those so that they exist in only one of the files. Basically, I want to reduce them to get a unique set of lines. The catch is that I can’t sort them (they are stripped-down dumps of my Chromium history).

What is the easiest way to do this?

I tried WinDiff, but that gave incorrect results. I figure that I could knock together a PHP script in a while, but am hoping that there is an easier way (preferably a command-line tool).

Synetech
  • 9,643
  • 9
  • 64
  • 96

2 Answers2

0

Well, I ended up writing a PHP script after all.

I read both files into a string, then exploded the strings into arrays using \r\n as the delimiter. I then iterated through the arrays to remove any elements that exist, and finally dumped them back out to a file.

The only problem was that by trying to refactor the stripping routine to a function, I found that passing the array that gets changed (elements removed) by reference caused it to slow down to the point of needing to be Ctrl-C’d, so I just passed by value and returned the new array (counterintuitive). Also, using unset to delete the elements was slow no matter what, so I just set the element to an empty string and skipped those during the dump.

Community
  • 1
  • 1
Synetech
  • 9,643
  • 9
  • 64
  • 96
0

If you have a bash shell (cygwin), the following shell commands would remove all lines that appear in both files from a.txt:

comm -12 <(sort a.txt|uniq) <(sort b.txt|uniq) | while read dupe; do dupe_escaped=$(echo "$dupe" | sed 's/[][\.*^$/]/\\&/g'); sed -e "/${dupe_escaped}/d" -i a.txt; done
codecraft
  • 1,163
  • 9
  • 11
  • Like I said, I can’t sort because then I lose the order of the visits to the URLs, thus losing all context. If I could sort, it would be **really** easy. – Synetech Feb 27 '11 at 23:20
  • the sorting just creates an intermediate list of duplicates which is then used to filter out the duplicates from the unsorted file. – codecraft Feb 27 '11 at 23:23
  • If you want to merge both files into one you could also use the AWK tool: `awk '!($0 in a) {a[$0];print}' a.txt b.txt` – codecraft Feb 27 '11 at 23:29