4

My sub compares two lists of strings and returns the closest matches. I've found that the sub gets tripped up over some common words such as "the" and "facility". I would like to write a function that would be supplied an array of words to exclude and check each string for these words and exclude them if found.

Here is a sample input:

|aNames        |  bNames        | words to exclude
|thehillcrest  |oceanview health| the
|oceanview, the|hillCrest       | health

Intended Output:

|aResults     |bResuts
|hillcrest    |hillcrest
|oceanview    |oceanview

So far I have:

Dim ub as Integer
Dim excludeWords() As String

'First grab the words to be excluded
If sheet.Cells(2, 7).Value <> "" Then
  For y = 2 To sheet.Range("G:G").End(xlDown).Row
    ub = UBound(excludeWords) + 1             'I'm getting a subscript out of range error here..?
    ReDim Preserve excludeWords(0 To ub)
    excludeWords(ub) = sheet.Cells(y, 7).Value
  Next y
End If

Then my comparison function, using a double loop, will compare each string in column A with column B. Before the comparison, the value in column a and b will go through our function which will check for these words to exclude. It's possible that there will be no words to exclude, so the parameter should be optional:

Public Function normalizeString(s As String, ParamArray a() As Variant)
  if a(0) then           'How can I check?
    for i = 0 to UBound(a)
      s = Replace(s, a(i))
    next i
  end if
  normalizeString = Trim(LCase(s))
End Function

There's probably a few parts in this code that won't work. Might you be able to point me in the right direction?

Thank you!

Community
  • 1
  • 1
ZAR
  • 2,550
  • 4
  • 36
  • 66
  • How does `oceanview, the` become `Oceanview`? You can replace `the` but that would give you `oceanview,` and not `Oceanview`. Would your list of words to exclude also include special characters? – Siddharth Rout Nov 06 '14 at 21:21
  • Right, I should edit: case doesn't matter here (I'm calling LCASE on it) – ZAR Nov 06 '14 at 21:31
  • 1
    I am not talking about case here :) I am talking about the `comma` Please see the answer posted below. You may have to refresh the page – Siddharth Rout Nov 06 '14 at 21:32
  • ooo, right. I'd probably like to get rid of that too! – ZAR Nov 06 '14 at 21:32

2 Answers2

6

To store the list in the array, you can do this

Sub Sample()
    Dim excludeWords As Variant
    Dim lRow As Long

    With Sheet1 '<~~ Change this to the relevant sheet
        '~~> Get last row in Col G
        lRow = .Range("G" & .Rows.Count).End(xlUp).Row

        excludeWords = .Range("G2:G" & lRow).Value

        'Debug.Print UBound(excludeWords)

        'For i = LBound(excludeWords) To UBound(excludeWords)
            'Debug.Print excludeWords(i, 1)
        'Next i
    End With
End Sub

And then pass the array to your function. The above array is a 2D array and hence needs to be handled accordingly (see commented section in the code above)

Also like I mentioned in the comments above

How does oceanview, the become Oceanview? You can replace the but that would give you oceanview, (notice the comma) and not Oceanview.

You may have to pass those special characters to Col G in the sheet or you can handle them in your function using a loop. For that you will have to use the ASCII characters. Please see this

Followup from comments

Here is something that I wrote quickly so it is not extensively tested. Is this what you are looking for?

Sub Sample()
    Dim excludeWords As Variant
    Dim lRow As Long

    With Sheet1
        lRow = .Range("G" & .Rows.Count).End(xlUp).Row

        excludeWords = .Range("G2:G" & lRow).Value

        '~~> My column G has the word "habilitation" and "this"
        Debug.Print normalizeString("This is rehabilitation", excludeWords)

        '~~> Output is "is rehabilitation"
    End With
End Sub

Public Function normalizeString(s As String, a As Variant) As String
    Dim i As Long, j As Long
    Dim tmpAr As Variant

    If InStr(1, s, " ") Then
        tmpAr = Split(s, " ")

        For i = LBound(a) To UBound(a)
            For j = LBound(tmpAr) To UBound(tmpAr)
                If LCase(Trim(tmpAr(j))) = LCase(Trim(a(i, 1))) Then tmpAr(j) = ""
            Next j
        Next i
        s = Join(tmpAr, " ")
    Else
        For i = LBound(a) To UBound(a)
            If LCase(Trim(s)) = LCase(Trim(a(i, 1))) Then
                s = ""
                Exit For
            End If
        Next i
    End If

    normalizeString = Trim(LCase(s))
End Function
Siddharth Rout
  • 147,039
  • 17
  • 206
  • 250
  • Since I'm using the levenshtein distance algorithm, i'm not too worried about small things like commas. But whole words such as "rehabilitation" can completely mess up the comparison – ZAR Nov 06 '14 at 21:40
  • Also, how is the normalizeString function looking? Is this valid? – ZAR Nov 06 '14 at 21:41
  • Yes that could be a problem in case of `false positive`. I haven't checked `normalizeString yet` – Siddharth Rout Nov 06 '14 at 21:42
  • For `levenshtein distance algorithm` or case of `False Positives` what you can do is split the word `split(word," ")` and then match it with the array so that you match with complete words – Siddharth Rout Nov 06 '14 at 21:43
  • @ZAR: See the followup that I posted. You may have to refrsh the page – Siddharth Rout Nov 06 '14 at 21:53
  • @ZAR: Kool. I am off to bed now. It is already 3:25 AM :P If you have any questions, I will answer them when I get up.. i.e if someone else has not addressed them... – Siddharth Rout Nov 06 '14 at 21:56
  • @ZAR: BTW I feel what you are actually trying to achieve has nothing to do with `levenshtein distance algorithm` Yours is more of a simple replacing of words. – Siddharth Rout Nov 06 '14 at 22:04
  • What I've written here, yes. But I've simplified the actual program just to emphasize the problem I'm facing. The reason I use the levenshtein algorithm is for cases like "Mary's convalescent" versus "Marys convalescent". The two sources of data have many small alterations, not all of which I can account for. Instead, I just compare them all and whichever have the lowest levenshtein distance is the most likely match. – ZAR Nov 06 '14 at 23:40
  • 1
    it worked! You are awesome! Fighting code crime at 3 in the morning, what a vigilante! So I wasn't aware that arrays could be created implicitly like this. I thought that you had to first dimension them with size params and then redimension for every addition. Here it seems that split automatically creates "a" into an array.. huh.. – ZAR Nov 07 '14 at 00:09
5

First of all, you cannot call UBound function for the Array that doesn't have a size yet:

Dim excludeWords() As String

ub = UBound(excludeWords) + 1  'there is no size yet

To remove some of the unwanted words use Replace function

String1 = Replace(String1, "the", "")

To do the comparison you described I would use Like function. Here is documentation. http://msdn.microsoft.com/pl-pl/library/swf8kaxw.aspx

PolaEla
  • 116
  • 4
  • Thanks for the heads up on UBound. So what are my options here? I am using the replace function already. Finally, I'm using the levenshtein distance algorithm to compute similarity. But that section isn't necessary to include in this question, I didn't think. Thanks @PolaEla – ZAR Nov 06 '14 at 21:29
  • Yes, probably your way to compare is better ;). So what is the main question here? – PolaEla Nov 06 '14 at 21:47