1

I work in SEO and sometimes I have to manage lists of domains to be considered for certain actions in our campaigns. On my iMac, I have 2 lists, one provided for consideration - unfiltered.txt - and another that has listed the domains I've already analyzed - used.txt. The one provided for consideration, the new one (unfiltered.txt), looks like this:

site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
... etc

List of domains that needs to be used as a filter, to be eliminated (used.txt) - looks like this.

site4.org
site5.me
site6.co.nz
gland.org.uk
kland.co.nz
site7.de
site8.it
... etc

Is there a way to use my OS X terminal to remove from unfiltered.txt all the lines found in used.txt? Found a software solution that partially solves a problem, and, aside from the words from used.txt, eliminates also words containing these smaller words. It means I get a broader filter and eliminate also domains that I still need.

For example, if my unfiltered.txt contains a domain named fogland.org.uk it will be automatically eliminated if in my used.txt file I have a domain named gland.org.uk.

Files are pretty big (close to 100k lines). I have pretty good configuration, with SSD, i7 7th gen, 16GB RAM, but it is unlikely to let it run for hours just for this operation.

... hope it makes sense.

TIA

designarti
  • 609
  • 1
  • 8
  • 18
  • Duplicate of [Remove Lines from File which appear in another File](http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file) – davidcondrey Dec 31 '16 at 10:28

4 Answers4

1

You can do that with awk. You pass both files to awk. Whilst parsing the first file, where the current record number across all files is the same as the record number in the current file, you make a note of each domain you have seen. Then, when parsing the second file, you only print records that correspond to ones you have not seen in the first file:

awk 'FNR==NR{seen[$0]++;next} !seen[$0]' used.txt unfiltered.txt 

Sample Output for your input data

site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz

awk is included and delivered as part of macOS - no need to install anything.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • thank you, but is this faster then grep for large files? – designarti Dec 31 '16 at 08:07
  • Your question didn't mention speed was an issue and I don't know how big your files are, nor how fast your disk is - please just try it. – Mark Setchell Dec 31 '16 at 08:23
  • Alright. Your solution is given in several other pages on stackoverflow. The main problem everybody talks about when it comes to awk is speed. Tried it few days ago, it is the simplest solution but that's about it. Speed is a factor, file size is another factor. I will add speed and file size to the specs. Thanks again. – designarti Dec 31 '16 at 10:04
  • I just tested with 100,000 lines in each file in under 1 second. – Mark Setchell Dec 31 '16 at 11:38
  • 1,000,000 lines in each file takes 2.6 seconds. – Mark Setchell Dec 31 '16 at 15:56
  • "used.txt" 100k and "unfiltered.txt" 100k as well? I mean did you filted 100k entries with another 100k lines file? Because I've tried it a few days ago and I had to stop it after a few minutes. If you are right, that HAS to be the answer for this. – designarti Jan 01 '17 at 09:42
1

I have always used

grep -v -F -f expunge.txt filewith.txt > filewithout.txt

to do this. When "expunge.txt" is too large, you can do it in stages, cutting it into manageable chunks and filtering one after another:

cp filewith.txt original.txt

and loop as required:
    grep -v -F -f chunkNNN.txt filewith.txt > filewithout.txt
    mv filewithout.txt filewith.txt

You could even do this in a pipe:

 grep -v -F -f chunk01.txt original.txt |\
 grep -v -F -f chunk02.txt original.txt |\
 grep -v -F -f chunk03.txt original.txt \
 > purged.txt
LSerni
  • 55,617
  • 10
  • 65
  • 107
0

You can use comm. I haven't got a mac here to check but I expect it will be installed by default. Note that both files must be sorted. Then try:

comm -2 -3 unfiltered.txt used.txt

Check the man page for further details.

user133831
  • 590
  • 5
  • 13
  • Works with grep, but it doesn't handle well large files. Found grep -Fvx -f used.txt unfiltered.txt >final.txt && mv final.txt unfiltered.txt. I assume comm wouldn't help either when working with 100k lines in each file. What does "sorted" mean? – designarti Dec 27 '16 at 19:00
  • I expect `comm` would be much faster than `grep` at this task. I just tested using 2 files with 100K lines and `comm` took about 0.2s on my laptop on its first run. I tried with `grep` but it hadn't finished after a minute so I killed it. – user133831 Dec 27 '16 at 23:17
  • "sorted" means to be arranged in order. There are different ordering systems but in this case which you use doesn't matter as long as both files use the same one. A common sorting order is "alphabetical". See https://en.wikipedia.org/wiki/Sorting – user133831 Dec 27 '16 at 23:21
  • Ah - I just saw you want to match substrings. `comm` won't do that. I can't think of a better tool to use than grep - though maybe if you have multiple cores some divide and conquer might help. – user133831 Dec 27 '16 at 23:32
  • I'm gonna put a bounty on this in 1 day. Thanks for your efffort. – designarti Dec 29 '16 at 08:32
  • I'd be surprised if there is a common command line tool for this - I expect fro any decent performance you will need an implementation of quicksearch or similar. – user133831 Dec 29 '16 at 16:25
0

You can use comm and process substitution to do everything in one line:

comm -23 <(sort used.txt) <(sort unfiltered.txt) > used_new.txt

P.S. tested on my Mac running OSX 10.11.6 (El Capitan)

mauro
  • 5,730
  • 2
  • 26
  • 25