1

This is tricky and been searching for hours, I can't found anything helpful :(

I don't care how to do it, powershell, batch, npp++ or any software but this is what I want to do:

I have a text file text1.txt with 2888 lines. Have another file text2.txt with 3440 lines, but in the second file, there is already 2888 exact lines than in the first file.

So what I want to to is "remove" those 2888 lines of my text1.txt from text2.txt, keeping only the rest of lines "unique".

aleeis
  • 61
  • 2
  • 9
  • You don't tell if the lines of `text1.txt` are leading, trailing or intermixed with the lines in `text2.txt` also could there be duplicate lines in the 2888 or 522 lines. That may be important for the method used. –  Jun 26 '18 at 06:37

4 Answers4

3

This is two lines in batch; you can use findstr to compare the two files.

findstr /V /G:text1.txt /L /X text2.txt >text3.txt
move /y text3.txt text2.txt

/G gets search strings from text1.txt
/V returns everything except those strings
/L indicates that the lines in text1.txt are meant to be taken literally instead of as regex (you only need this if your lines contain symbols that are used by regex, like [ and ] or $)
/X matches full lines, so "stone" won't get picked up by "one" for example

The data gets stored in a temporary file because redirecting immediately to text2.txt wipes out the file. Once the temporary file is created, move overwrites the old file and /y does it without asking if you're sure you want to overwrite the file.

SomethingDark
  • 13,229
  • 5
  • 50
  • 55
  • 1
    Nice, but I think you need the `/L` and the `/X` switches too. Also regard that there is a nasty `findstr` bug with case-sensitive literal search strings: [Why doesn't this FINDSTR example with multiple literal search strings find a match?](https://stackoverflow.com/q/8921253)... – aschipfl Jun 26 '18 at 11:01
  • Do I need both `/L` and `/X`? That seems a bit redundant. Also, I don't know what the raw data looks like, so I can't tailor my answer accordingly just yet. – SomethingDark Jun 26 '18 at 21:48
  • 2
    `/G` does not define literal or regex mode, so `/L` is required; `/X` matches whole lines, which is needed here, I think, as the OP is always talking about lines... – aschipfl Jun 26 '18 at 21:55
  • If all they're checking are simple words, I don't need to specify literal vs regex; since I still don't have sample data, I'll add `/L` just in case. – SomethingDark Jun 27 '18 at 14:23
  • Thanks both! Actually the code only worked with `/L /X` ... without them it stucks and never finish, my text file is 3million lines, and I know there are 44k original lines, sad thing is with `/L /X` I only get 20k :( – aleeis Jun 28 '18 at 15:06
1

Using notepad++ you can easily do this.

You need to copy data of text1.txt to text2.txt on notepad++.

After merge you can use this regex(>Notepad++ 6) in the search and replace dialogue:

^(.*?)$\s+?^(?=.*^\1$)

and replace with nothing. This leaves from all duplicate rows the last occurrence in the file. You need to check the options "Regular expression" and ". matches newline":

NullPointer
  • 7,094
  • 5
  • 27
  • 41
1

Install CudaText editor. Install in it plugin Sort via menu Plugins/AddonManager/Install.

  • open file1 in 1st tab
  • open file2 in 2nd tab
  • make new tab (3rd) and paste first 2 tabs into it (in 1st tab: Select All, Copy, then in 3rd tab: Paste)
  • in this 3rd (filled) tab, do Select All
  • call Sort plugin: menu item "Plugins/ Sort/ Remove duplicate lines + origins"
  • 3rd tab has the result
Prog1020
  • 4,530
  • 8
  • 31
  • 65
  • Thank you! But it stucks and never ends :( my file is 3million line so its a big file, i have 16GB Ram and a good CPU, so I don't know if there is any easy alternative like the one you described! – aleeis Jun 28 '18 at 15:08
0

A PowerShell solution is missing, so try this:

## Q:\Test\2018\06\26\SO51033576.ps1
$text1 = Get-Content `.\text1.txt
$text2 = Get-Content `.\text2.txt
(Compare-Object $text2 $text1 | Where-Object sideindicator -eq '<=').Inputobject|
    Set-Content '.\new-text2.txt'