0

I understand how to do this using std::string and std::unordered_set, however, each line and each element of the set takes up a lot of unnecessary, inefficient memory, resulting in an unordered_set and half the lines from the file being 5 -10 times larger than the file itself.

Is it possible (and how, if so) to somehow reduce memory consumption, for example, so that you can remove duplicates from a 10 gigabyte file using no more than 20 gigabytes of RAM? In this case, of course, it is necessary to do this at a speed of O(n).

Chase
  • 43
  • 4
  • 1
    Are the duplicates always next to each other? – sweenish Aug 16 '22 at 19:08
  • Unfortunately no, duplicates can be scattered around the file randomly. Theoretically, there is an option to sort, but this will not work with files of several hundred gigabytes – Chase Aug 16 '22 at 19:12
  • 1
    Memory map the file (using `mmap()` or similar), and maintain hashes of the lines along with pointers to those lines. Don't reduce the file until you've indexed all of the duplicates. After you identify duplicates, then compress. – JohnFilleau Aug 16 '22 at 19:12
  • @JohnFilleau, please tell me how to store this data so that there is instant access by hash and the structure does not take up as much space for each element as unordered_set? What is the best hashing to use for strings of ascii characters from 5 to 50 characters long? – Chase Aug 16 '22 at 19:16
  • 5-10 times larger? I don't understand. Might be larger, but shouldn't be more than 2x, possibly less. Read line, if not in set, output it, and add to set. If necessary at end, delete original file & rename output. – Avi Berger Aug 16 '22 at 19:17
  • @AviBerger, yes, such a naive implementation uses 5-10 times more memory than the file itself takes. [Link here](https://stackoverflow.com/questions/37177163/why-does-unordered-set-use-significantly-more-ram-than-data-it-contains) – Chase Aug 16 '22 at 19:21
  • @ThomasWeller, even using twice as much RAM in my case would be a good result. If you have an even more efficient algorithm, I will be glad to hear – Chase Aug 16 '22 at 19:22
  • @ThomasWeller ty. I am ready to send you a python program that will generate 10 gigabytes of random data similar to what I have in my files – Chase Aug 16 '22 at 19:27
  • @Chase thanks, got it. I didn't think about short lines. – Avi Berger Aug 16 '22 at 19:32
  • 1
    @Chase -- *Most memory efficient way to remove duplicate lines in a text file* -- I'll be honest with you -- use `awk` or a utility to do this. I remember where someone was being given an interview for a C++ position, and was asked a similar question (concerning searching for text in a file). The answer was -- use grep, not C++. – PaulMcKenzie Aug 16 '22 at 19:38
  • @PaulMcKenzie: the time for writing a fast C++ program will definitely exceed the time waiting for a slower algorithm to do the job reliably. – Thomas Weller Aug 16 '22 at 19:40
  • @ThomasWeller I write, [this](https://replit.com/@ChaseBooth/GenerateFileDup#main.py) generates 350mb of random data for 1 minute – Chase Aug 16 '22 at 19:47
  • @Chase: how does that code ensure that the file will be 10 GB in the end? Can't see how that'll work. Why `if randint(0, 50) == 25:`? – Thomas Weller Aug 16 '22 at 19:51
  • @ThomasWeller, `if randint(0, 50) == 25:` To add a small amount (a couple of percent) of matching lines to a file. And I know that on average a file of 500 million random lines of length about 15 weighs 10 gigabytes. In addition, it is not so important: you can check on a 5 gigabyte file, you can also on a 15 gigabyte file. – Chase Aug 16 '22 at 19:56
  • @ThomasWeller, during the process, you can create a temporary file, if this does not slow down the algorithm, as a result, all lines without duplicates should be written to result file – Chase Aug 16 '22 at 19:59

4 Answers4

2

You can quickly find duplicate lines using the hash of each line as shown in other answers. But if you only store the hashes then that assumes there is no hash collision. If you use std::hash that won't be true. You can probably get away with it if you use a good cryptographic hash.

With your input being only 10G I would suggest a different approach. Well, apart from the trivial. 10G is something you can probably just load into memory and store each line as string on modern systems.

But lets save some memory:

  • First you should mmap the file so all its data is accessible from C++ without it being loaded into memory all at the same time.
  • create a std::unordered_multimap<std::size_t, std::string_view> lines; to keep track of lines already seen in the input
  • loop over the input file and create a string_view for every line of text, compute the hash and look if up in lines. If the hash exists compare the line against the other lines with the same hash. If the line is unique then add it to lines and output it.

This will use 32 bytes of memory per (unique) line I think. So with short lines the memory required might be more than the input file. On the other hand with short lines there are probably a lot less unique lines.

PS: You can save memory by only storing the beginning of each line. And if you estimate the number of (unique) lines you can use a hashtable with a different collision strategy (without bins) to bring it down to 8 byte per line.

Goswin von Brederlow
  • 11,875
  • 2
  • 24
  • 42
0

Sure, you can do this in O(n^2) time using only the amount of memory required to hold two lines, a boolean flag, and two file offsets.

The basic algorithm would be:

  • Open input file
  • Open output file
  • Set flag to false
  • Set pre-read offset to 0
  • While more input:
    • Read a first input line
    • Save current input file offset as post-read offset
    • Seek input file to offset 0
    • While current input file offset is less than pre-read offset:
      • Read a second input line
      • If first input line equals second input line:
        • Set flag to true
        • Break
    • If flag is false:
      • Write first input line to output file
    • Set flag to false
    • Seek input file to post-read offset
    • Set pre-read offset to post-read offset

This is, of course, extremely time-inefficient, but it's about as memory-efficient as you can get.

Possible C++ implementation:

std::ifstream input(inputFilePath);
std::ofstream output(outputFilePath);

std::streamoff offset_before = 0;
std::streamoff offset_after = 0;
bool found_dupe = false;

std::string line1;
while (std::getline(input, line1)) {
    offset_after = input.tellg();

    input.seekg(0);
    std::string line2;
    while (input.tellg() < offset_before && std::getline(input, line2)) {
        if (line1 == line2) {
            found_dupe = true;
            break;
        }
    }

    if (!found_dupe) {
        output << line1 << '\n';
    }

    found_dupe = false;
    input.seekg(offset_after);
    offset_before = offset_after;
}

Live Demo

Miles Budnek
  • 28,216
  • 2
  • 35
  • 52
  • Is the questioner young enough that he might live to see this algorithm complete on a 10 gigabyte data set? :) – Jeremy Friesner Aug 16 '22 at 19:44
  • To be honest, I'm even afraid to imagine how long it can take to execute an algorithm at a speed of O(N^2) on a file of a couple of tens of gigabytes... – Chase Aug 16 '22 at 19:51
  • If Moore's Law keeps up there'll probably be a computer fast enough for it to finish at some point. I will admit this answer was a bit tounge-in-cheek based purely on the title of the question. – Miles Budnek Aug 16 '22 at 19:56
0

I know it's a bit late for an answer to this question, but just for fun I wrote an implementation that I think is quite memory-efficient while still being reasonably performant.

In particular, this solution runs in O(N*log(N)) time and uses (on my machine) just 360 kilobytes(!) of heap memory while de-duping a 100,000,000 line (5 gigabyte) text file that contains 99,990,000 randomly-arranged duplicate lines, and finishes in 6 minutes, 32 seconds.

Of course it does cheat a bit, in that it writes out a temporary index file to disk (the index contains a hash value for each line in the input file, as well as that line's position within the input file). The index file needs 16 bytes for each text line, so it came out to ~1.4GB in my test.

To do the de-duplication, the program mmap()'s the index file into RAM, sorts its contents by hash code, then scans the index and invalidates any now-adjacent entries with the same hash code that refer to the same string in the input file.

After that it, re-sorts the index file by byte-offset, and then iterates through the index one more time to generate the de-duped output file.

Output from my test run (on a 2018 Intel Mac Mini) is as follows:

Jeremys-Mac-mini-2:~ jaf$ time ./a.out
Step 1:  Read input file [big_input_file.txt] and write index file [index_file.tmp]
Step 2:  mmap() index file [index_file.tmp] into RAM and sort its entries by hash-code
Step 3:  Iterate through index file [index_file.tmp] and invalidate any duplicate entries
Step 4:  Re-sort the index file [index_file.tmp] by byte-offset in preparation for generating output
Step 5:  Write output file [big_output_file.txt]
Step 6:  Delete index file and exit
Final result:  Out of 100000001 lines read, 99990000 duplicate lines detected and removed.
real        6m32.800s
user        3m39.215s
sys         2m41.448s

Source code is below (I compiled it with g++ -std=c++20 -O3 ./dedup_file.cpp):

#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>

#include <array>
#include <fstream>
#include <iostream>
#include <span>
#include <string>

using SizeTPair = std::array<size_t, 2>;

static const SizeTPair INVALID_INDEX_ENTRY = {(std::size_t)-1, (std::size_t)-1};  // special value meaning "not a valid index entry"

// Given a pointer into the contents of the input file, returns a string_view representing
// the line of text starting there.  (This is necessary since we can't modify the input file
// and the lines in the input file are not NUL-terminated)
static std::string_view GetStringAtOffset(const char * inputFileMap, size_t offset)
{
   if (offset == (size_t)-1) return "";

   const char * s  = &inputFileMap[offset];
   const char * nl = strchr(s, '\n');
   return nl ? std::string_view(s, nl-s) : std::string_view(s);
}

// Comparison functor to sort SizeTPairs by the text they point to
// breaks ties by sorting by line-number (so that if a duplicate line is
// detected in the text, it will always be the second instance of that line that
// is excluded from our program's output, not the first instance)
class SortIndicesByStringCompareFunctor
{
public:
   SortIndicesByStringCompareFunctor(const char * inputFileMap) : _inputFileMap(inputFileMap) {/* empty */}

   bool operator()(const SizeTPair & a, const SizeTPair & b) const
   {
      const std::string_view textA = GetStringAtOffset(_inputFileMap, a[0]);
      const std::string_view textB = GetStringAtOffset(_inputFileMap, b[0]);
      if (textA != textB) return (textA < textB);
      return (a[1] < b[1]); // sub-sort by line number
   }

private:
   const char * _inputFileMap;
};

static void WriteEntryToIndexFile(std::ofstream & indexFile, const SizeTPair & entry, size_t & indexSizeItems)
{
   indexFile.write(reinterpret_cast<const char *>(&entry), 2*sizeof(size_t));
   indexSizeItems++;
}

int main(int, char **)
{
   const char * bigInputFileName  = "big_input_file.txt";
   const char * indexFileName     = "index_file.tmp";
   const char * bigOutputFileName = "big_output_file.txt";

   std::cout << "Step 1:  Read input file [" << bigInputFileName << "] and write index file [" << indexFileName << "]" << std::endl;

   // Step 1.  Read through the big input-text file, and generate a binary
   // index-file containing (for each line) that line's hash-code and also
   // its location in the input file
   size_t indexSizeItems = 0;
   size_t inputFileSizeBytes = 0;
   {
      std::ifstream inputFile;
      inputFile.open(bigInputFileName, std::ios_base::binary | std::ios_base::ate);  // binary only so we can get valid file-offsets out of tellg()
      inputFileSizeBytes = inputFile.tellg();  // get file size
      inputFile.seekg(0, std::ios_base::beg);  // then go back to the beginning of the file so we can read it

      std::ofstream indexFile;
      indexFile.open(indexFileName, std::ios_base::binary);

      std::string nextLine;
      while(inputFile.good())
      {
         const std::streampos curFileOffset = inputFile.tellg();  // do this before reading the line:  record our current read-offset into the file
         std::getline(inputFile, nextLine);
         WriteEntryToIndexFile(indexFile, {std::hash<std::string>()(nextLine), (std::size_t)curFileOffset}, indexSizeItems);
      }

      // Add a final dummy-entry to the end of the index, just to force the flushing of any
      // final text-line(s) in our for-loop in step (3)
      WriteEntryToIndexFile(indexFile, INVALID_INDEX_ENTRY, indexSizeItems);
   }

   std::cout << "Step 2:  mmap() index file [" << indexFileName << "] into RAM and sort its entries by hash-code" << std::endl;

   // Step 2.  mmap() the index-file we just generated, and sort its contents by hash-code (sub-sort by byte-offset)
   const int indexFD = open(indexFileName, O_RDWR, (mode_t)0666);
   if (indexFD < 0) {std::cerr << "Couldn't open() index file!" << std::endl; exit(10);}

   char * indexFileMap = (char *) mmap(0, indexSizeItems*(2*sizeof(size_t)), PROT_READ | PROT_WRITE, MAP_SHARED, indexFD, 0);
   if (indexFileMap == MAP_FAILED) {std::cerr << "mmap() of index file failed!" << std::endl; exit(10);}

   SizeTPair * index = reinterpret_cast<SizeTPair *>(indexFileMap);
   std::span<SizeTPair> indexSpan(index, index+indexSizeItems);
   std::sort(std::begin(indexSpan), std::end(indexSpan));

   std::cout << "Step 3:  Iterate through index file [" << indexFileName << "] and invalidate any duplicate entries" << std::endl;

   // Step 3.  Go through the index file and invalidate any duplicate
   // entries (i.e. any entries that have the same hash code and same
   // underlying string as a previous entry)
   const int inputFD = open(bigInputFileName, O_RDONLY, (mode_t)0666);
   if (inputFD < 0) {std::cerr << "Couldn't open() input file!" << std::endl; exit(10);}

   const char * inputFileMap = (const char *) mmap(0, inputFileSizeBytes, PROT_READ, MAP_SHARED, inputFD, 0);
   if (indexFileMap == MAP_FAILED) {std::cerr << "mmap() of index file failed!" << std::endl; exit(10);}

   size_t dupesRemoved = 0;
   ssize_t runStartIdx = -1;
   for (size_t i=0; i<indexSizeItems; i++)
   {
      SizeTPair & curEntry = index[i];

      // swap to put the line number in [0] and the hash in [1], since in the future
      // we will want to sort by line number and this will make it easier to do that.
      std::swap(curEntry[0], curEntry[1]);

      const size_t curByteOffset = curEntry[0];
      const size_t curHash       = curEntry[1];

      if (runStartIdx >= 0)
      {
         if (curHash != index[runStartIdx][1])
         {
            // A run of identical hashes started at (runStartIdx) and ended just before (i)
            if ((i-runStartIdx)>1)
            {
               // Re-sort the index-entries-with-identical-hashes by the strings they represent
               // so that we can find and remove any duplicate-strings easily.  (We have to do this
               // because the same hash could, at least in principle, be associted with two different strings)
               std::span<SizeTPair> duplicateHashesSpan(index+runStartIdx, index+i);
               std::sort(std::begin(duplicateHashesSpan), std::end(duplicateHashesSpan), SortIndicesByStringCompareFunctor(inputFileMap));
               std::string_view previousEntryTextLine;
               for (size_t j=runStartIdx; j<i; j++)
               {
                  const std::string_view curEntryTextLine = GetStringAtOffset(inputFileMap, index[j][0]);
                  if (curEntryTextLine == previousEntryTextLine)
                  {
                     index[j] = INVALID_INDEX_ENTRY;
                     dupesRemoved++;
                  }
                  previousEntryTextLine = curEntryTextLine;
               }
            }
            runStartIdx = i;
         }
      }
      else runStartIdx = i;
   }

   std::cout << "Step 4:  Re-sort the index file [" << indexFileName << "] by byte-offset in preparation for generating output" << std::endl;

   // Step 4.  Re-sort the index file by byte-offset (note that each line's byte-offset is stored
   //          as the first entry in its SizeTPair now!)
   std::sort(std::begin(indexSpan), std::end(indexSpan));

   std::cout << "Step 5:  Write output file [" << bigOutputFileName << "]" << std::endl;

   // Step 5.  Read through the big text file one more time, and
   // write out only those lines that still exist in the index file
   std::ofstream outputFile;
   outputFile.open(bigOutputFileName);
   for (size_t i=0; i<indexSizeItems; i++)
   {
      const SizeTPair & curEntry = index[i];
      if (curEntry == INVALID_INDEX_ENTRY) break;  // these will all have been sorted to the end so we can stop now
                                      else outputFile << GetStringAtOffset(inputFileMap, curEntry[0]) << std::endl;
   }
   outputFile.close();

   // Step 6.  Clean up our mess and exit
   std::cout << "Step 6:  Delete index file and exit" << std::endl;
   close(inputFD);
   close(indexFD);
   remove(indexFileName);

   std::cout << "Final result:  Out of " << (indexSizeItems-1) << " lines read, " << dupesRemoved << " duplicate lines detected and removed. " << std::endl;
   return 0;
}
Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234
-1

This code is reading the input file line by line, storing only hashes of the strings in memory. If the line was not seen before, it writes the result into the output file. If the line was seen before, it does not do anything.

It uses sparsepp to reduce the memory footprint.

Input data:

  • 12 GB file size
  • ~197.000.000 different lines
  • line length < 120 characters

Build:

  • C++20
  • release build
  • do not run within Visual Studio (no debugger attached)

Processing:

  • AMD Ryzen 2700X
  • 32 GB of RAM
  • Seagate SSD
  • 190 seconds
  • 954 MB virtual memory peak

Is that good enough? I can't tell, because your performance requirements are quite vague and you didn't give proper performance comparison data. It may depend on your machine, your data, your RAM size, your hard disk speed, ...

#include <chrono>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <array>
#include <cstring>
#include <functional>
#include <random>
#include <string>
#include "spp.h"
int main()
{
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    std::ifstream input;
    std::ofstream output;
    input.open("10gb.txt");
    output.open("10gb_nodupes.txt");
    std::string inputline;
    spp::sparse_hash_map<size_t, size_t> hashes;
    while (std::getline(input, inputline))
    {
        std::size_t hash = std::hash<std::string>{}(inputline);
        if (!hashes.contains(hash))
        {
            output << inputline << '\n';
            hashes[hash]=0;
        }
    }
    input.close();
    output.close();
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::seconds>(end - begin).count() << "[s]" << std::endl;
    std::cout << "Done";
}
Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
  • 1
    It looks like you could save memory by using a `std::unordered_set` instead. You’re only concerned with the key, not a pair. – Blastfurnace Aug 16 '22 at 20:23
  • I dont know, program uses [more than 3gb](https://prnt.sc/a2STo8FKJ6l_) on [700mb file](https://prnt.sc/ZsbdBHoFn4Ld) – Chase Aug 16 '22 at 20:30
  • @ThomasWeller The file was generated by the python program [I sent you](https://replit.com/@ChaseBooth/GenerateFileDup#main.py). The generation was stopped manually at 760 megabytes – Chase Aug 16 '22 at 20:33
  • Program was compiled using MSVC 2022 compiler. If it is desirable to provide some compiler options, please tell me which ones, because it is quite difficult to figure it out there. – Chase Aug 16 '22 at 20:37
  • @ThomasWeller, yes, this is a realistic scenario, lines longer than 40-50 characters are not expected in principle – Chase Aug 16 '22 at 20:38
  • Configuration release, no optimization, x64 system – Chase Aug 16 '22 at 20:39
  • Specifically, I have not tried your hash-only approach, it spends almost half as much memory as my own program. I'm sorry that you've wasted so much time because of me, but I'm not very knowledgeable on the subject and therefore I can't understand which information from VS is valuable and which is not – Chase Aug 16 '22 at 20:44
  • Can you please give me a link to the hash library you used? I will try with this algorithm – Chase Aug 16 '22 at 20:59
  • @ThomasWeller, I have absolutely no idea how this happened, but replacing the standard library with this one, along with a few other small changes, totally helped. I am a poor student and I cannot thank you for your time, but I ask you to send me the address of your bitcoin wallet, I will transfer as much as I can – Chase Aug 16 '22 at 21:53
  • @Blastfurnace `std::unordered_sat` assumed you have no hash collisions. – Goswin von Brederlow Aug 17 '22 at 06:16
  • @Chase: no need. It was a nice refresher for my rusted C++ anyways. – Thomas Weller Aug 17 '22 at 06:26
  • 2
    This approach is unreliable. It assumes `std::hash{}(inputline);` always yields different values for different `inputline` values - that's not the way hashing works. Statistically, you will probably get away with it for 1) larger word size (e.g. a 64-bit app with 64-bit `size_t` & hash output, rather than a 32-bit app/hash, helps), 2) a relatively small number of distinct `inputline` values (as a rule of thumb, stay well under 2^32~=4b keys for 64 bit hashes, and under 2^16=64k keys for 32-bit hashes), and 3) a genuinely strong hash function. Chase uses MSVC2022 => v. weak hash. – Tony Delroy Aug 17 '22 at 14:19
  • FWIW, this approach could be salvaged simply by using a better hash function... I'd say given the scales here, it should be a 128-bit hash. A little google-fu and OP can find one that's suitable. (VC++'s string hashing just does some xoring and shifting using ten characters evenly spaced along the string... it doesn't even look at every character; it'd be terrible for this on many input sets.) – Tony Delroy Aug 17 '22 at 18:05