0

I have a text file that is over 50GB. It contains many lines, each line is on average around 15 characters. I want each line to be unique (case sensitive). So if a line is exactly the same as another one, it must be removed, without changing the order of the other lines or sorting the file in any way.

My question is different from others because I have a huge file that cannot be handled with other solutions that I searched.

I have tried:

awk !seen[$0]++ bigtextfile.txt > dublicatesremoved.txt

it starts nice and fast but very soon I get the following error:

awk: (FILENAME=bigtextfile.txt FNR=19083509) fatal: more_nodes: nextfree: can't allocate 4000 bytes of memory (Not enough space)

The above error appears when the output file is about 200MB.

Is there any other fast way that I can do the same thing on windows?

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    You have a 50GB log file(if it is log file) then you need to really look for some log rotator seriously that will help you to keep environment pretty and clean. – RavinderSingh13 Sep 29 '19 at 14:12
  • It is not a log file – aris melachroinos Sep 29 '19 at 14:31
  • What's wrong with `sort -u file`? Ah, never mind I see you're on Windows. If you're able to use UNIX tools instead then [edit] your question to include concise, testable sample input and expected output and you'll probably get an answer. – Ed Morton Sep 29 '19 at 14:33
  • 2
    Windows doesn't come with `awk`, so I think it's safe to say they can install GNU `sort` or `uniq`. – SomethingDark Sep 29 '19 at 14:37
  • Yes I did install awk and tried that. I dont mind installing something to try, but it has to be on Windows. Also sort -u sorts the file as well, but I do not want to do anything more than remove the duplicates, sorry I did not mention that previously. – aris melachroinos Sep 29 '19 at 14:40
  • 1
    That's where posting sample input/output helps to demonstrate your needs. Can you install Cygwin on Windows so can run UNIX tools from that? Again, post some sample input/output as a starting point to getting help. @SomethingDark thanks for the info but I don't know which tools are available for Windows, nor the Windows quoting rules, nor how to pass the output of one command to the input of another, etc, so I personally wouldn't be able to help provide an answer that runs directly on Windows but hopefully others can. – Ed Morton Sep 29 '19 at 14:41
  • 1
    Have you tried a solution from [this thread](https://stackoverflow.com/q/11689689)? particularly [`JSORT.BAT`](http://www.dostips.com/forum/viewtopic.php?f=3&t=5595) from [this answer](https://stackoverflow.com/a/11691976)? – aschipfl Sep 29 '19 at 14:42
  • @aschipfl I saw that it sorts the file so skipped that solution – aris melachroinos Sep 29 '19 at 14:48
  • @EdMorton Yes I can install Cygwin if that helps. Sorry I do not understand what exactly you want me to post. – aris melachroinos Sep 29 '19 at 14:48
  • `uniq` should be able to do what you want, but if awk ran out of memory, you may run into issues. I definitely recommend trying it out, though. I'd test it on my side, but I don't have a 50 GB text file lying around. – SomethingDark Sep 29 '19 at 14:50
  • @aris I want you to post a few lines of text including duplicate lines as your sample input and then the output you'd want given that input file. I'll post an answer you can run on cygwin so you can see what I mean. – Ed Morton Sep 29 '19 at 14:51
  • As I saw, uniq requires sorting first, doesnt it? – aris melachroinos Sep 29 '19 at 14:53
  • @SomethingDark for `uniq` to work the input has to be sorted and `sort` has a `-u` flag to only output unique values so `uniq` itself isn't useful in this context. – Ed Morton Sep 29 '19 at 14:53
  • If the amount of duplicates is nontrivial, splitting the file in two halves and then processing the filtered halves combined may be feasible. Of course, this is pretty much what `sort -u` does for large files behind the scenes. – tripleee Sep 29 '19 at 14:53
  • @tripleee Yes that is what I thought too, just wanted to know if there was another easier way. – aris melachroinos Sep 29 '19 at 14:56
  • Alternatively, can you partition the file e.g. by the first character on each line? By definition, duplicates will have the same first character but if the file has significant internal variability, this should allow you to split it into sub-gigabyte chunks. – tripleee Sep 29 '19 at 14:56

1 Answers1

6

You could do this on a UNIX box or Cygwin on top of Windows:

$ cat file
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Speed, bonnie boat, like a bird on the wing,
Thunderclaps rend the air;
Onward! the sailors cry;
Baffled, our foes stand by the shore,
Carry the lad that's born to be King
Follow they will not dare.
Over the sea to Skye.

.

$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.

The only command above trying to process the whole file at once is sort and sort is designed to use paging, etc. to handle exactly that for large files (see https://unix.stackexchange.com/q/279096/133219) so IMHO it's your best shot at being able to do this.

Start with the cat -n file and then add each command to the pipeline one at a time to see what it's doing (see below) but it's just adding line numbers first so we can then sort uniquely by content to get the unique values and then sort by the original line numbers to get the original line order back and then remove the line numbers we added at the first step:

$ cat -n file
     1  Speed, bonnie boat, like a bird on the wing,
     2  Onward! the sailors cry;
     3  Carry the lad that's born to be King
     4  Over the sea to Skye.
     5
     6  Loud the winds howl, loud the waves roar,
     7  Speed, bonnie boat, like a bird on the wing,
     8  Thunderclaps rend the air;
     9  Onward! the sailors cry;
    10  Baffled, our foes stand by the shore,
    11  Carry the lad that's born to be King
    12  Follow they will not dare.
    13  Over the sea to Skye.
    14

.

$ cat -n file | sort -k2 -u
     5
    10  Baffled, our foes stand by the shore,
     3  Carry the lad that's born to be King
    12  Follow they will not dare.
     6  Loud the winds howl, loud the waves roar,
     2  Onward! the sailors cry;
     4  Over the sea to Skye.
     1  Speed, bonnie boat, like a bird on the wing,
     8  Thunderclaps rend the air;

.

$ cat -n file | sort -k2 -u | sort -n
     1  Speed, bonnie boat, like a bird on the wing,
     2  Onward! the sailors cry;
     3  Carry the lad that's born to be King
     4  Over the sea to Skye.
     5
     6  Loud the winds howl, loud the waves roar,
     8  Thunderclaps rend the air;
    10  Baffled, our foes stand by the shore,
    12  Follow they will not dare.

.

$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • After a couple of seconds I got this error: sort: string comparison failed: Invalid or incomplete multibyte or wide character sort: Set LC_ALL='C' to work around the problem. I did try 'LC_ALL='C'' but got the same error, am I doing something wrong? – aris melachroinos Sep 29 '19 at 15:09
  • Oh forgot to export the var – aris melachroinos Sep 29 '19 at 15:11
  • You **may** also need to run `dos2unix` or similar (e.g. `sed 's/\r$//' file`) on your input file if it was created on Windows and contains `\r\n` line endings instead of just `\n`s but I don't think you will as the commands shown should just treat `\r`s like any other characters. Did it work for you? – Ed Morton Sep 29 '19 at 15:21
  • I tried your suggestion and it filled my ssd, so I had to cancel it and I will try again now that I freed up more space. (I had 70GB+ space available but it needs more for some reason). I will report back here when it finishes. – aris melachroinos Sep 29 '19 at 15:32