1

Good afternoon, We are building a prototype of a deduper. We are using a array of STL strings to store the records to be depuped. The array looks like this:

std::string* StringArray = new std::string[NumberDedupeRecords]

The records are very large, as large as 160,000,000 bytes. When we try to store a std::string version of a record to deduped in the std::string* StringArray, STL makes a deep copy of the string and mallocs a new buffer of at least 160,000,000 bytes. We quickly run out of heap memory and get a std::bad_alloc exception. Is there a workaround to avoid the deep copy and std::bad_alloc? Perhaps we should use a new data structure for storing the std::string records to be deduped or maybe we should save auto_ptr's.

We show a code snippet here:

std::string clara5(curr.getPtr()); 
char* const maryptr = (curr.getPtr() + n - curr.low()); 
maryptr[54] = '\x0'; 
StringArray[StringArrayCount] = clara5; 
curr.mPtr = (char*)StringArray[StringArrayCount].c_str(); 

std::multiset<Range>::iterator miter5 = ranges_type.lower_bound(Range(n));
(*miter5).mPtr = curr.mPtr; StringArrayCount += 1;

Thank you.

Mat
  • 202,337
  • 40
  • 393
  • 406
Frank
  • 1,406
  • 2
  • 16
  • 42
  • Can you please post code snippets of how you are storing records to the StringArray? – user258808 May 19 '11 at 18:47
  • What is the meaning of "dedupe" and "depupe"? – fredoverflow May 19 '11 at 18:49
  • 1
    [This](http://stackoverflow.com/questions/1116040/) might help. – Björn Pollex May 19 '11 at 18:50
  • @user258808, Thank you for your reply. Here is some code: std::string clara5(curr.getPtr()); char* const maryptr = (curr.getPtr() + n - curr.low()); maryptr[54] = '\x0'; StringArray[StringArrayCount] = clara5; curr.mPtr = (char*)StringArray[StringArrayCount].c_str(); Thank your for your help. std::multiset::iterator miter5 = ranges_type.lower_bound(Range(n)); (*miter5).mPtr = curr.mPtr; StringArrayCount += 1; – Frank May 19 '11 at 18:52
  • 2
    @Frank Comments are not meant for code; edit your question instead. – Etienne de Martel May 19 '11 at 18:53
  • @Space_C0wb0y That looks promising, but wouldn't it only work if there are duplicate records? – Chris Frederick May 19 '11 at 18:55
  • @Fred Overflow. Here is a short explanation of dedupe. Suppose you have a very large database of customer prospects gather from different sources. There were will be a significiant number of duplicates with the same name, address, company, zipcode. To avoid sending marketing material to the same customer prospect, you remove the duplicates. That is the essence of deduping. Thank you for your inquiry, – Frank May 19 '11 at 18:57
  • Etienne de Martel , Thank you for your reply. I put a code snippet in the question. – Frank May 19 '11 at 19:03

3 Answers3

5

You can simply take a pointer or reference to the original std::string- including smart pointer if you find it necessary to enforce various ownership strategems.

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 2
    I think accessing those strings through `boost::shared_ptr`s (or equivalent) would be the best way. – Etienne de Martel May 19 '11 at 18:53
  • @DeadMG, Thank you for answer. Ijust accepted your answer. The oproblem take a taking pointers to the original std:: string is that suppose we modify the records during processing by removing all junk characters when cleanse the records, then we end up with multiple copy of the same record and it is hard to keep track of which is correct. Would smart auto_ptr resolve this problem. Thank you for your help. – Frank May 19 '11 at 19:09
  • @Etienne de Martel, I think your idea of boost::shared pointers ia good one. Unfortunately, We have some UNIX customers. So I have ported our Windows program to LINUX/UNIX. The problem is that UNIX licenses are 5 to 6 years old. So, my supervisor is reluctant to use Boost sinceit would mean we would have a costly purchase of Solarisand IBM-AIX UNIX licenses, Is their an C/C++ equivalent to boost::shared_ptr? Thank you for your help. – Frank May 19 '11 at 19:15
  • @Frank: I don't see the connection between using Boost and having to purchase licences. Boost is free. However, if you take a pointer to a `std::string`, then you modify the string, you will not end up with multiple copies, as they are the *same* std::string. If you process the original string, then access it through the pointer, it will reflect the changes. – Puppy May 19 '11 at 19:22
  • @DeadMG, Thank you for your answer. I will try to use your idea of taking pointers to a std::string. As for Boost, I worked on a large product using the Boost regex library. I download the latest distribution of Boost and it compiled perfectly in Windows and Red Hat Linux. But then, when I tried compiling it on our 6 year old UNIX compilers, I got multiple compile error on the template copy. So I modified the Boost regex source code to run on Solaris and IBM AIX. Finally I ended having 3 versions of the Boost regex source code. My supervisor told me this a Subversion nightmare. Thank you. – Frank May 19 '11 at 19:32
2

If possible, rather than trying to use smart pointers, you may want to change your code so that you only have a few instances of std::string in memory at a time. This of course will depend on your access patterns, but you may be able to load and process one string (record) at a time rather than allocating an array for all of them at once.

EDIT: Given that the OP is trying to remove duplicates, this may not work very well.

Chris Frederick
  • 5,482
  • 3
  • 36
  • 44
  • Christopher Frederick , Thank you for your answer . I think you have a good idea here. I will try it. I just accepted your answer. The keyissue is since this we removing multiple duplicates, we would like to cache some strings. However, We have a limited amount of heap memory, so we can't make too many deep copies of large std::string, Thank you for your help. – Frank May 19 '11 at 19:20
1

I think the real answer to your problem is to use a rope - see http://www.sgi.com/tech/stl/Rope.html - std::string is not really designed to be use for very large strings.