Logically parse a string in excel to trim near duplicates

Question

Assume the string:

item1, item1N, item1Z, item1fhg, item1_any_letters, item2, item3, item3N, item3H

my goal output is simply

item1, item2, item3

this is about a 100,000 line Excel file currently, but can be migrated to another program etc if needed temporarily.

Essentially I need to determine duplicates (any initial phrase ending in a number) with no regard to letters after the number. Some phrases might have for example "Brand item2, Brand item34" as well, the only determining factor of a duplicate is any and all terminology AFTER the number.

any ideas on where to begin with this? Each string usually has between 2 and 500 values in it, seperated by comma and a space. No comma follows the final value.

This would be pretty easy in any programming language; it's not easy in Excel. — Carl Manaster, Jun 26 '12 at 21:41
sure; but if needed to somehow migrate it into a seperate file to be used with a language, I can do that. I just don't personally know any languages that are able to do this (html, javascript only) — user1484009, Jun 26 '12 at 21:50
I am not master, not even a novice, at this, but would RegExp help. VBA has RegEx. See [here](http://stackoverflow.com/questions/9150552/lookbehind-on-regex-for-vba) — Scott Holtzman, Jun 26 '12 at 21:52
So you need unique items per line, or across the whole file? — Tim Williams, Jun 26 '12 at 21:56
@CarlManaster Actually it is straightforward in Excel VBA - which should be unsurprising given VBA is a programming language. Use of the `regexp` further simplifies your suggested pseudocode — brettdj, Jun 27 '12 at 03:55

score 3 · Accepted Answer · answered Jun 27 '12 at 00:01

Sub Tester()

    Dim re As Object, match As Object
    Dim dict As Object
    Dim arr, arrItems, x As Long, y As Long
    Dim val, matches, valMatch


    Set dict = CreateObject("scripting.dictionary")
    Set re = CreateObject("VBScript.RegExp")
    re.Pattern = "([\w ]+\d+)"
    re.ignorecase = True
    re.Global = True

    arr = ActiveSheet.Range("A1:A100").Value

    For x = LBound(arr, 1) To UBound(arr, 1)
        arrItems = Split(arr(x, 1), ",")
        dict.RemoveAll
        For y = LBound(arrItems) To UBound(arrItems)

            val = Trim(arrItems(y))

            If re.Test(val) Then
               Set matches = re.Execute(val)
               valMatch = matches(0).Value
               If Not dict.exists(valMatch) Then dict.Add valMatch, 1
            End If
        Next y

        Debug.Print arr(x, 1)
        Debug.Print Join(dict.keys, ",") 'where do you want this?

    Next x

End Sub

+ 1 Daniel Craig + RegExp : Deadly Combination :) – Siddharth Rout Jun 27 '12 at 06:04 — Siddharth Rout, Jun 27 '12 at 06:04

brettdj · Answer 2 · 2012-06-27T03:58:04.183

A VBA approach that is somehwat similar to Tim's for the first pathway

Use a RegExp to remove the invalid charcaters (characters after a number and before a comma)

Eliminate the duplicates with
a) Use a Dictionary
b) Excel's inbuilt remove duplicates functionality (writes to a sheet)

Const strDelim = ", "

Sub TestMe()
Dim strTest As String
Dim x
strTest = "item1, item1N, item1Z, item1fhg, item1_any_letters, item2, item3, item3N, item3H"
x = Split(DeDupe(strTest), strDelim)
'fix last element
x(UBound(x)) = Left$(x(UBound(x)), Len(x(UBound(x))) - 1)
Call Method2(x)
End Sub

Sub Method2(ByVal x)
Dim objDic As Object
Dim y As Variant
Set objDic = CreateObject("Scripting.Dictionary")
Dim lngRow As Long
For lngRow = LBound(x) To UBound(x)
objDic(x(lngRow)) = 1
Next lngRow
MsgBox Join(objDic.keys, strDelim)
End Sub      

Function DeDupe(strIn As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Global = True
.Pattern = "(.+?\d+)[^\d]+(,|$)"
DeDupe = .Replace(strIn, "$1,")
End With
End Function

Option B

    'another potential option. Not applied in this code
    Sub Method1(ByVal x)
    Dim y As Variant
    Dim rng1 As Range
    With ActiveSheet
    .[a1].Resize(UBound(x) + 1, 1) = Application.Transpose(x)
    .Columns("A").RemoveDuplicates Columns:=1, Header:=xlNo
    y = Application.Transpose(Range([a1], Cells(Rows.Count, "A").End(xlUp)))
    End With
    MsgBox Join(y, strDelim)
    End Sub

score 0 · Answer 3 · answered Jun 26 '12 at 22:14

This is probably imperfect, since it's a quick hack which only removes the rightmost non-digit strings. You will need some regexp knowledge to tune it to your needs.

Anyway, follow the "installation" steps given here, save the module, and you will be able to write in your sheet a formula such as

=S(A1;"[^0-9]*$";"")

in, say, the B1 cell. If A1 cell contains "Item 1234 blah blah", then B1 will now contain "Item 1234". Drag the formula in all cells of column B, and save values to another Excel file for sorting (or you can try sorting and sub-totaling in-place).

Unfortunately, I do not believe that doing this in 100,000+ cells is practical (I even advise against subtotaling in-place).

You would be much better served by installing textools (sed, grep, uniq...) for Windows, and running your file through a filter. Assuming that each row represents one item as above, a filter such as

sed -e 's/^\([^0-9][^0-9]*[0-9][0-9]*\).*/\1/g' | sort | uniq -c | sort -rn

would get your 100,000 line file and return something like

79283 Item 1
 1234 Item 2
  993 Item 3
  ..........

(on some platforms you could have written (\D+\d+) instead of ([^0-9]..., but I'm unsure of the Windows behaviour).

An even better choice of tools would be (Strawberry)Perl, which has CSV support too, or Python language.

Logically parse a string in excel to trim near duplicates

3 Answers3