Simplest way to remove smaller data set from bigger one in command line

Question

I looking for easiest way to filter two files.

INPUT: txt files
File 1 (bigger one):

abc111
abc112
abc113
abc114
abc115
...
zbc999

File 2 (smaller one):

abc111
abc112
abc113

OUTPUT: On the output I want to have new file with non-recurring (unique) data set. In other words, in output file must be only those entries from a larger file 1 that are unique and do not occur in a smaller file 2.

BTW:
How to do it easily if the file names are long and difficult to enter them every time the console?

While this is doable with only standard functionality, the best way I can think of is still extremely painful. So what you want is simply to find a program that can do this easily, which makes the question off topic. — Jon, May 30 '14 at 10:58

Stephan · Accepted Answer · 2014-05-30T12:10:50.757

0

"extremely painful"?

@echo off
REM step 1: remove doublettes from file1
echo.>file1.tmp
for /f %%i in (file1.txt) do (
 findstr /x /L "%%i" file1.tmp>nul ||echo %%i>>file1.tmp
)

REM step 2: extract lines that doesn't exist in file2
findstr /v /x /L /g:file2.txt file1.tmp >output.txt

type output.txt

edited May 30 '14 at 12:10

answered May 30 '14 at 11:13

Stephan

53,940
10
58
91

Is there any way to make method to choose file1.txt and file2.txt from popup window? – MarekW May 30 '14 at 11:30
Well, **that** would be the "extremely painful" part. But do you know, that you can use the `TAB` key to complete a filename at the command prompt? Enter the first two or three characters and press `TAB` several times, it will toggle through all matching filenames. – Stephan May 30 '14 at 11:35
Your basic logic is sound, but this has bugs, one of which cannot be solved. FIND is a poor choice because it will report that `234` matches `1234`. FINDSTR is better because you can specify the `/X` option to match the entire line exactly. It also needs the `/L` option to prevent regex interpretation. But \\ and \" literals within search strings require the leading \ to be escaped as \\. More troubling is a horrific FINDSTR bug that can cause searches with multiple strings of different lengths to sometimes miss matches. See http://stackoverflow.com/q/8844868/1012053. – dbenham May 30 '14 at 11:42
@dbenham 's last point can be resolved with another `for` to check for one string after another. Ugly and slow, but reliable (?). I didn't get the "But \` and \"` literals..." part - maybe some characters are not shown as intended. (I read the link, but parts of that is (yet) a bit behind my horizon.) I edited my answers to include the first parts of dbenhams comment. – Stephan May 30 '14 at 12:33
@Stephan: So... do we agree on "extremely painful" now? ;-) Frankly, batch files is about the most masochistic way to do this. PowerShell at the very least. – Jon May 30 '14 at 20:36
@jon: well - uhm - ok, point for you. But it's not the logic that's painful, but the working around some bugs. – Stephan May 31 '14 at 06:06
@Stephan: Agreed. That's just how batch files turn out a lot of the time: the logic is simple, but the execution is another issue entirely. – Jon May 31 '14 at 22:24

Simplest way to remove smaller data set from bigger one in command line

1 Answers1