0

I have a txt file (A.txt) with 20,000 domain names, one per line. I have another txt file (B.txt) that contains thousands of Whois records compiled together. I want to see which domains in A.txt are not referenced in B.txt. It's trivial to do this one-by-one, but how can I do it in mass? Thanks

user1543782
  • 123
  • 1
  • 1
  • 5
  • Is using [spreadsheets/Excel](http://stackoverflow.com/questions/4160243/join-two-spreadsheets-on-a-common-column-in-excel-or-openoffice) out of the question? – Primoz Mar 22 '13 at 10:13

1 Answers1

0

You could edit file A.txt to have lines of the style example.com A other stuff and file B.txt to have lines of the form example.com B other stuff. Then sort the two files together. Next run a Notepad++ regular expression replace, searching for ^([^ ]+) A .*\r\n(\1 B ) and replacing with \2. The effect is that any A.txt line that matches a B.txt is removed, leaving the B.txt line. In case there are multiple A.txt lines that match one B.txt then run the replace two or more times until no lines are replaced. Finally, delete the B.txt lines (use a regular expression to find and mark lines looking for ^([^ ]+) B then remove bookmarked lines) leaving the unmatched A.txt lines.

Not knowing the format of the source files A.txt and B.txt I cannot suggest a regular expression to put the URL followed by an A or B at the start of the lines.

AdrianHHH
  • 13,492
  • 16
  • 50
  • 87