Find duplicates in a stringlist very fast

Question

what is the fastest way to find duplicates in a Tstringlist. I get the data I need to search for duplicates in a Stringlist. My current idea goes like this :

    var  TestStringList, DataStringList  : TstringList;


    for i := 0 to  DataStringList.Items-1 do
    begin
        if TestStringList.Indexof(DataStringList[i])< 0 < 0 then
        begin
          TestStringList.Add(DataStringList[i])
        end
        else
        begin
           memo1.ines.add('duplicate item found');
        end;

    end;
   ....

Don't withold Delphi version you use! Edit the question and set the version-related tag! — Arioch 'The, Nov 29 '13 at 11:46

Ken White · Answer 1 · 2013-11-29T14:04:12.980

Just for completeness, (and because your code doesn't actually use the duplicate, but just indicates one has been found): Delphi's TStringList has the built-in ability to deal with duplicate entries, in it's Duplicates property. Setting it to dupIgnore will simply discard any duplicates you attempt to add. Note that the destination list has to be sorted, or Duplicates has no effect.

TestStringList.Sorted := True;
TestStringList.Duplicates := dupIgnore;

for i := 0 to  DataStringList.Items-1 do
   TestStringList.Add(DataStringList[i]);
Memo1.Lines.Add(Format('%d duplicates discarded',
                      [DataStringList.Count - TestStringList.Count]));

A quick test shows that the entire loop can be removed if you use Sorted and Duplicates:

TestStringList.Sorted := True;
TestStringList.Duplicates := dupIgnore;

TestStringList.AddStrings(DataStringList);
Memo1.Lines.Add(Format('%d duplicates discarded',
                      [DataStringList.Count - TestStringList.Count]));

See the TStringList.Duplicates documentation for more info.

score 7 · Answer 2 · answered Nov 29 '13 at 11:48

I think that you are looking for duplicates. If so then you do the following:

Case 1: The string list is ordered

In this scenario, duplicates must appear at adjacent indices. In which case you simply loop from 1 to Count-1 and check whether or not the elements of index i is the same as that at index i-1.

Case 2: The string list is not ordered

In this scenario we need a double for loop. It looks like this:

for i := 0 to List.Count-1 do
  for j := i+1 to List.Count-1 do
    if List[i]=List[j] then
      // duplicate found

There are performance considerations. If the list is ordered the search is O(N). If the list is not ordered the search is O(N²). Clearly the former is preferable. Since a list can be sorted with complexity O(N log N), if performance becomes a factor then it will be advantageous to sort the list before searching for duplicates.

or introduce a temporary collection, optimized for dup search — Arioch 'The, Nov 29 '13 at 11:55
@Arioch'The You are quite right. That brings it back to `O(N)`. — David Heffernan, Nov 29 '13 at 11:57

Arioch 'The · Answer 3 · 2013-11-29T12:18:51.947

4

Judging by the use of IndexOf you use an unsorted list. The scaling factor of your algorithm then is n^2. That is slow. You can optimize it as David shown by limiting search area in the internal search and then the average factor would be n^2/2 - but that still scales badly.

Note: scaling factor here makes sense for limited workloads, say dozen or hundreds of strings per list. For larger sets of data asymptotic analysis O(...) measure would suit better. However finding O-measures for QuickSort and for hash-lists is a trivial task.

Option 1: Sort the list. Using quick-sort it would have scaling factor n + n*log(n) or O(n*log(n)) for large loads.

Set Duplicates to accept
Set Sorted to True
Iterate the sorted list and check if the next string exists and is the same
http://docwiki.embarcadero.com/Libraries/XE3/en/System.Classes.TStringList.Duplicates
http://docwiki.embarcadero.com/Libraries/XE3/en/System.Classes.TStringList.Sorted

Option 2: use hashed list helper. In modern Delphi that would be TDictionary<String,Boolean>, in older Delphi there is a class used by TMemIniFile

You iterate your stringlist and then check if the string was already added into the helper collection.

The scaling factor would be a constant for small data chunks and O(1) for large ones - see http://docwiki.embarcadero.com/Libraries/XE2/en/System.Generics.Collections.TDictionary.ContainsKey

If it was not - you add it with "false" value.
If it was - you switch the value to "true"

For older Delphi you can use THashedStringList in a similar pattern (thanks @FreeConsulting)

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/IniFiles_THashedStringList_IndexOf.html

edited Nov 29 '13 at 12:18

answered Nov 29 '13 at 11:49

Arioch 'The

15,799
35
62

Your big O notation is a bit off. There's no such thing as `O(n + n log n)`. It's just `O(n log n)`. And likewise, it's `O(n^2)` rather than `O(n^2/2)`. – David Heffernan Nov 29 '13 at 11:54
1

@DavidHeffernan id did not used O-notation here. I used a close, related term, but not O. And yes, it would be `n*(log(n)+1)` for you have to iterate the list after the sorting – Arioch 'The Nov 29 '13 at 11:58
1

`TMemIniFile` uses `THashedStringList` which in turn uses Jenkins hash to avoid full string comparison. – Free Consulting Nov 29 '13 at 11:59
Well, it's easy to get something right if you use terminology that nobody else uses and you get to choose what it means?! But `n + n*log(n)` is meaningless. You are adding together unrelated quantities. It's like adding seconds to metres! – David Heffernan Nov 29 '13 at 12:00
1

@FreeConsulting thank you, but i did not use it and am not very experienced with it. – Arioch 'The Nov 29 '13 at 12:03
1

@DavidHeffernan we always do. When you talk about n*log(n) in quick sort - which operations do you mean, comparison ? or reading ? or writing ? or copying (reading+writing) ? We take the average spread of operations and how it would react to the changes of n there. You say we should account for operations inside QuickSort but should ignore operations we would explicitly do when iterating and comparing after the sort. Why not the other way around ? – Arioch 'The Nov 29 '13 at 12:06
@Arioch'The I don't think you've really grasped big O – David Heffernan Nov 29 '13 at 12:10
1

@DavidHeffernan i just do not think we are always going to plus-infinity. Hundred or thousand strings are also a practical case. Asymptotic analysis is good - but is not always the only measure. – Arioch 'The Nov 29 '13 at 12:13
1

Information I gave is correct, you could incorporate it into your answer, but keep in mind what THashedStringList computes hash values for Keys[] and Values[] separately but not for whole Strings[], so it is valuable more likely for illustrative rather than practical purposes. – Free Consulting Nov 29 '13 at 12:27
@FreeConsulting is hope its "IndexOf" is working correctly and using the hashes ? i remember JCL XML parser used it – Arioch 'The Nov 29 '13 at 12:28

score 2 · Answer 4 · edited May 23 '17 at 12:24

Unfortunately it is unclear what you want to do with the duplicates. Your else clause suggests you just want to know whether there is one (or more) duplicate(s). Although that could be the end goal, I assume you want more.

Extracting duplicates

The previously given answers delete or count the duplicate items. Here an answer for keeping them.

procedure ExtractDuplicates1(List1, List2: TStringList; Dupes: TStrings);
var
  Both: TStringList;
  I: Integer;
begin
  Both := TStringList.Create;
  try
    Both.Sorted := True;
    Both.Duplicates := dupAccept;
    Both.AddStrings(List1);
    Both.AddStrings(List2);
    for I := 0 to Both.Count - 2 do
      if (Both[I] = Both[I + 1]) then
        if (Dupes.Count = 0) or (Dupes[Dupes.Count - 1] <> Both[I]) then
          Dupes.Add(Both[I]);
  finally
    Both.Free;
  end;
end;

Performance

The following alternatives are tried in order to compare performance of the above routine.

procedure ExtractDuplicates2(List1, List2: TStringList; Dupes: TStrings);
var
  Both: TStringList;
  I: Integer;
begin
  Both := TStringList.Create;
  try
    Both.AddStrings(List1);
    Both.AddStrings(List2);
    Both.Sort;
    for I := 0 to Both.Count - 2 do
      if (Both[I] = Both[I + 1]) then
        if (Dupes.Count = 0) or (Dupes[Dupes.Count - 1] <> Both[I]) then
          Dupes.Add(Both[I]);
  finally
    Both.Free;
  end;
end;

procedure ExtractDuplicates3(List1, List2, Dupes: TStringList);
var
  I: Integer;
begin
  Dupes.Sorted := True;
  Dupes.Duplicates := dupAccept;
  Dupes.AddStrings(List1);
  Dupes.AddStrings(List2);
  for I := Dupes.Count - 1 downto 1 do
    if (Dupes[I] <> Dupes[I - 1]) or (I > 1) and (Dupes[I] = Dupes[I - 2]) then
      Dupes.Delete(I);
  if (Dupes.Count > 1) and (Dupes[0] <> Dupes[1]) then
    Dupes.Delete(0);
  while (Dupes.Count > 1) and (Dupes[0] = Dupes[1]) do
    Dupes.Delete(0);
end;

Although ExtractDuplicates3 marginally performs better, I prefer ExtractDuplicates1 because it reeds better and the TStrings parameter provides more usability. ExtractDuplicates2 performs noticeable worst, which demonstrates that sorting all items afterwards in a single run takes more time then continuously sorting every single item added.

Note

This answer is part of this recent answer for which I was about to ask the same question: "how to keep duplicates?". I didn't, but if anyone knows or finds a better solution, please comment, add or update this answer.

The question clearly discards them - `if TestStringList.Indexof(DataStringList[i])< 0`, ignoring the obvious `< 0 < 0` typo. Nice answer though, for future readers who might not want to do the same. +1. — Ken White, Nov 30 '13 at 00:39
@Ken On the question code: the if handles the unique items, the else handles the duplicates. Besides logging, it is unclear what OP really wants to do with the items that appear in both lists. Thx though. — NGLN, Nov 30 '13 at 02:49
No, the code simply writes the fact that an item wasn't added (not which item it was that isn't added). There's no point in that unless you're going to count the items that weren't added, and that can be done simply by subtracting counts. :-) I stand by my previous comment. :-) — Ken White, Nov 30 '13 at 02:52

score 1 · Answer 5 · answered Nov 25 '14 at 20:59

This is an old thread but I thought this solution may be useful.

An option is to pump the values from one stringlist to another one with the setting of TestStringList.Duplicates := dupError; and then trap the exception.

var  TestStringList, DataStringList  : TstringList;
TestStringList.Sorted := True;
TestStringList.Duplicates := dupError;

for i := 0 to  DataStringList.Items-1 do
begin
    try
      TestStringList.Add(DataStringList[i])
    except
        on E : EStringListError do begin
            memo1.Lines.Add('duplicate item found');
        end;
    end;
end;

....

Just note that the trapping of the exception also masks the following errors: There is not enough memory to expand the list, the list tried to grow beyond its maximal capacity, a non-existent element of the list was referenced. (i.e. the list index was out of bounds).

score 0 · Answer 6 · edited Dec 27 '14 at 14:24

0

function TestDuplicates(const dataStrList: TStringList): integer;
begin 
  with TStringlist.create do begin
    {Duplicates:= dupIgnore;}
    for it:= 0 to DataStrList.count-1 do begin
      if IndexOf(DataStrList[it])< 0 then
        Add(DataStrList[it])
      else 
        inc(result)
    end;
    Free;
  end;
end;

edited Dec 27 '14 at 14:24

NGLN

43,011
8
105
200

answered Dec 27 '14 at 14:00

user3971311

21
1

Find duplicates in a stringlist very fast

6 Answers6

Extracting duplicates

Performance

Note