Handling duplicate lines in huge file

Question

I want to remove duplicate lines in a file roughly 34GB. Then load it to a mysql database. Load it to db with constraints takes a lot of time. I tried sorting the file then use awk command, it took a lot of time and some memory problems. Is there another way to handle this problem?

Processing 34GB of data will always take quite some time. Why not simply load everything into the database and use the power of your database to remove duplicate lines. That's what I would probably do. — KIKO Software, Jan 26 '18 at 15:28
@FrankerZ so you think sort will be able to make its job on a 34Gb file on a common machine without any problem? OP already told that he had some memory problems... — Jean-Baptiste Yunès, Jan 26 '18 at 15:55
@Jean-BaptisteYunès I recommend reading [this](https://stackoverflow.com/a/930051/4875631). — Blue, Jan 26 '18 at 15:56
@FrankerZ Waoh! Wasn't aware of this, seems only GNU tools version of sort. Any way, Linux is so widespread... Make it an answer then. — Jean-Baptiste Yunès, Jan 26 '18 at 16:00

score 1 · Answer 1 · answered Jan 26 '18 at 16:04

From this answer here:

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

Simply use the following to remove duplicates. Should be memory efficient, and work for you without involving MySQL:

sort large_filename.txt | uniq > unique_filename.txt

Handling duplicate lines in huge file

1 Answers1