3

In a column with hundreds or even 1-2 thousand strings of approximately 40 characters, with one string per cell and many repeating entries, what is the best way to populate the column to conduct the fastest possible search later? The search should return a row number so that the corresponding row can be deleted.

Is there some way to append metadata or label to a cell/row for faster search? Is there some other mechanism that can identify cells that will make searching easier?

I'm new to VBA, and I want to set out on the best path before I get too far into the project and have to search through thousands of strings.

edit: Someone requested an example cell: The cells will have email addresses in them. I can control the email addresses on the server, so they will roughly be 40 characters long each. They will contain alphanumeric characters only.

Community
  • 1
  • 1
ampersand
  • 4,264
  • 2
  • 24
  • 32
  • (Assuming that you want to keep track of duplicates) A hidden sheet maybe, with as many rows as the original one, where each cell corresponds to the original entry (first occurrence)'s row number? – dotNET Aug 26 '17 at 14:29
  • *Don't use a loop-type search*, in VBA use **Application.WorksheetFunction.MATCH()** to get the row number, assuming the strings are about length 40 and less than 255. – Gary's Student Aug 26 '17 at 14:38
  • 1
    A Dictionary provides instant lookups - read the strings as keys and row IDs as values as a one-time operation, then retrieve rows or use `.Exists()` – paul bica Aug 26 '17 at 14:50
  • Could you please provide several cells just as the example? – omegastripes Aug 26 '17 at 15:36
  • good suggestions everyone, thanks. I'll try some of these things. @omegastripes, I updated my post with info on the cell contents. – ampersand Aug 30 '17 at 01:33
  • Should search just return first instance row numer if there are repeating entries found? Do you need to update the worksheet after each row deletion, or you have a list of emails, and worksheet should be updated once after all matced to the list entries are deleted in batch? In former case `System.Collections.ArrayList` is worth a shot. – omegastripes Aug 30 '17 at 03:59
  • @omegastripes search should result in all rows with that email in the search column. After that, it's the 2nd option you listed: those rows will be deleted in batch. (The next step is to append new rows with the same emails but different contents in the other cells and likely a different # of rows) – ampersand Aug 30 '17 at 11:04
  • I was trying to keep the list of rows with the same value in another cell, but once rows are deleted, all of these references need to be updated, so that didn't work.... – ampersand Sep 03 '17 at 16:37
  • @paulbica, so are you suggesting I should create this dictiory every time the worksheet is opened? Create a dictionary then update it as needed? Does the dictionary persist throughout the lifetime of an open worksheet? – ampersand Sep 03 '17 at 16:39
  • Yes, the dictionary will be created every time you open the file, but it will persist while the file is open. 2 or 3 thousand string is relatively small so reading them will be almost instant (as opposed to 100s of thousands of rows) – paul bica Sep 03 '17 at 19:49

1 Answers1

1

Example of a fast way to implement a dictionary lookup

  • Data is on Sheet1, and starts in column A
  • The strings are in column B

Option Explicit

Public Sub SearchStrings()

    Dim ur As Variant, r As Long, d As Object

    Const COL_ID = 2

    Set d = CreateObject("Scripting.Dictionary") 'or Reference to Microsof Scripting Runtime

    d.CompareMode = TextCompare 'Case insensitive, or "BinaryCompare" otherwise

    ur = Sheet1.UsedRange.Columns(COL_ID)   'read strings from column COL_ID into array

    For r = LBound(ur) To UBound(ur)        'populate dictionary; Key = string (unique)

        If Not IsError(ur(r, 1)) Then d(CStr(ur(r, 1))) = r      'Item = row id

    Next

    Debug.Print d.Keys()(3)     'prints the string in row 3
    Debug.Print d.Items()(3)    'prints the row number of the 3rd string

End Sub

If you want to store string duplicates use this:

        If Not IsError(ur(r, 1)) Then d(COL_ID & "-" & r) = CStr(ur(r, 1))

which is Key = Column ID & "-" & row ID (2-5), and Item = String itself

paul bica
  • 10,557
  • 4
  • 23
  • 42
  • Thanks so much for this. I ended up using a variation of this approach where every key is a unique email, and the value is a Range containing the cells with that email. This allows me to quickly delete a group of rows and add new ones. The persistence of the Dictionary is the key to this. Not sure how it will scale as the # of entries will climb into the thousands. I also need to figure out a way to store the Dictionary in the file and reload on file open. The wrinkle is that each entry is a Range. Any ideas? I can post another question if that's what you suggest. – ampersand Sep 05 '17 at 00:10
  • The performance of arrays and dictionaries are as goog as any operations performed in memory so you don't need to worry until you get to a million rows. Reading data from disk when the file is open can be improved with a fast SSD but for just a few thousand it should be fine – paul bica Sep 05 '17 at 01:04
  • What Im not sure about is the "entry as a Range". You mean the items in the dictionary? If so, there are ways to deal with that; for example create multiple keys for the same email + an index, and each entry should contain no more than 256 chars, similar to the implementation in [this question](https://stackoverflow.com/questions/30959315/excel-vba-performance-1-million-rows-delete-rows-containing-a-value-in-less). Or create key like "email_row" - it needs parsing but it'll still be fast. You may need to provide more details if you run into issues, so a new question can help – paul bica Sep 05 '17 at 01:15
  • Yes you understood correctly. The items in the dictionary are ranges. Your suggestions and link are helpful. Thank you!. – ampersand Oct 02 '17 at 19:02