3

Assume the string:

item1, item1N, item1Z, item1fhg, item1_any_letters, item2, item3, item3N, item3H

my goal output is simply

item1, item2, item3

this is about a 100,000 line Excel file currently, but can be migrated to another program etc if needed temporarily.

Essentially I need to determine duplicates (any initial phrase ending in a number) with no regard to letters after the number. Some phrases might have for example "Brand item2, Brand item34" as well, the only determining factor of a duplicate is any and all terminology AFTER the number.

any ideas on where to begin with this? Each string usually has between 2 and 500 values in it, seperated by comma and a space. No comma follows the final value.

brettdj
  • 54,857
  • 16
  • 114
  • 177
  • This would be pretty easy in any programming language; it's not easy in Excel. – Carl Manaster Jun 26 '12 at 21:41
  • sure; but if needed to somehow migrate it into a seperate file to be used with a language, I can do that. I just don't personally know any languages that are able to do this (html, javascript only) – user1484009 Jun 26 '12 at 21:50
  • 1
    I am not master, not even a novice, at this, but would RegExp help. VBA has RegEx. See [here](http://stackoverflow.com/questions/9150552/lookbehind-on-regex-for-vba) – Scott Holtzman Jun 26 '12 at 21:52
  • So you need unique items per line, or across the whole file? – Tim Williams Jun 26 '12 at 21:56
  • 2
    @CarlManaster Actually it is straightforward in Excel VBA - which should be unsurprising given VBA is a programming language. Use of the `regexp` further simplifies your suggested pseudocode – brettdj Jun 27 '12 at 03:55

3 Answers3

3
Sub Tester()

    Dim re As Object, match As Object
    Dim dict As Object
    Dim arr, arrItems, x As Long, y As Long
    Dim val, matches, valMatch


    Set dict = CreateObject("scripting.dictionary")
    Set re = CreateObject("VBScript.RegExp")
    re.Pattern = "([\w ]+\d+)"
    re.ignorecase = True
    re.Global = True

    arr = ActiveSheet.Range("A1:A100").Value

    For x = LBound(arr, 1) To UBound(arr, 1)
        arrItems = Split(arr(x, 1), ",")
        dict.RemoveAll
        For y = LBound(arrItems) To UBound(arrItems)

            val = Trim(arrItems(y))

            If re.Test(val) Then
               Set matches = re.Execute(val)
               valMatch = matches(0).Value
               If Not dict.exists(valMatch) Then dict.Add valMatch, 1
            End If
        Next y

        Debug.Print arr(x, 1)
        Debug.Print Join(dict.keys, ",") 'where do you want this?

    Next x

End Sub
Tim Williams
  • 154,628
  • 8
  • 97
  • 125
2

A VBA approach that is somehwat similar to Tim's for the first pathway

  1. Use a RegExp to remove the invalid charcaters (characters after a number and before a comma)
  2. Eliminate the duplicates with
    a) Use a Dictionary
    b) Excel's inbuilt remove duplicates functionality (writes to a sheet)

    Const strDelim = ", "
    
    Sub TestMe()
    Dim strTest As String
    Dim x
    strTest = "item1, item1N, item1Z, item1fhg, item1_any_letters, item2, item3, item3N, item3H"
    x = Split(DeDupe(strTest), strDelim)
    'fix last element
    x(UBound(x)) = Left$(x(UBound(x)), Len(x(UBound(x))) - 1)
    Call Method2(x)
    End Sub
    
    Sub Method2(ByVal x)
    Dim objDic As Object
    Dim y As Variant
    Set objDic = CreateObject("Scripting.Dictionary")
    Dim lngRow As Long
    For lngRow = LBound(x) To UBound(x)
    objDic(x(lngRow)) = 1
    Next lngRow
    MsgBox Join(objDic.keys, strDelim)
    End Sub      
    
    Function DeDupe(strIn As String) As String
    Dim objRegex As Object
    Set objRegex = CreateObject("vbscript.regexp")
    With objRegex
    .Global = True
    .Pattern = "(.+?\d+)[^\d]+(,|$)"
    DeDupe = .Replace(strIn, "$1,")
    End With
    End Function
    

Option B

    'another potential option. Not applied in this code
    Sub Method1(ByVal x)
    Dim y As Variant
    Dim rng1 As Range
    With ActiveSheet
    .[a1].Resize(UBound(x) + 1, 1) = Application.Transpose(x)
    .Columns("A").RemoveDuplicates Columns:=1, Header:=xlNo
    y = Application.Transpose(Range([a1], Cells(Rows.Count, "A").End(xlUp)))
    End With
    MsgBox Join(y, strDelim)
    End Sub
brettdj
  • 54,857
  • 16
  • 114
  • 177
0

This is probably imperfect, since it's a quick hack which only removes the rightmost non-digit strings. You will need some regexp knowledge to tune it to your needs.

Anyway, follow the "installation" steps given here, save the module, and you will be able to write in your sheet a formula such as

=S(A1;"[^0-9]*$";"")

in, say, the B1 cell. If A1 cell contains "Item 1234 blah blah", then B1 will now contain "Item 1234". Drag the formula in all cells of column B, and save values to another Excel file for sorting (or you can try sorting and sub-totaling in-place).

Unfortunately, I do not believe that doing this in 100,000+ cells is practical (I even advise against subtotaling in-place).

You would be much better served by installing textools (sed, grep, uniq...) for Windows, and running your file through a filter. Assuming that each row represents one item as above, a filter such as

sed -e 's/^\([^0-9][^0-9]*[0-9][0-9]*\).*/\1/g' | sort | uniq -c | sort -rn

would get your 100,000 line file and return something like

79283 Item 1
 1234 Item 2
  993 Item 3
  ..........

(on some platforms you could have written (\D+\d+) instead of ([^0-9]..., but I'm unsure of the Windows behaviour).

An even better choice of tools would be (Strawberry)Perl, which has CSV support too, or Python language.

LSerni
  • 55,617
  • 10
  • 65
  • 107