1

I have a data set with several hundred rows. Most rows have complete information, but in some cases two rows share the same key while some attributes are repeated, others are not. Here is an example:

Key   Campaign   Message  Stat1  Stat2 Stat3 Stat4  
123   Fun        yay      1      2  
123   temp       yay                    3     4  

Intended result
123   Fun        yay      1      2      3     4

Issues:

  1. Needs to search the entire dataframe of hundreds of records, most of which are not duplicates. Ignore the non-duplicates
  2. Has to specify that when combining rows to accept the Campaign data that is NOT "temp"
  3. All other columns where data matches is ok
  4. Columns where one value is null will result in the non-null value being used in the new record
  5. I am open to solutions in R, SQL or excel (vba)

Appreciate any help!

Drew
  • 24,851
  • 10
  • 43
  • 78
BigData
  • 38
  • 7
  • 1
    I would go into Excel and select blanks and then press ctrl+up, see https://www.extendoffice.com/documents/excel/771-excel-fill-blank-cells-with-value-above.html. Other than that I would use if then statements in VBA. Still, maybe an aggregate in R, if the key is unique and specific. A distinct function in SQL. – Lowpar Jul 14 '16 at 21:16
  • If you want to make this a `mysql` question, then supply a schema and sample data in sqlfiddle. – Drew Jul 15 '16 at 01:05
  • It currently resides in a .csv file, so it is most easily dealt with in R or Excel. I'm even trying a workaround in java, but I'm hoping for an R or Excel solution. I think one piece of this might be to deal with the Campaign attribute first as a separate function, and then tackle the merge. – BigData Jul 15 '16 at 01:17
  • But if you want to model it after a mysql load data infile question like [This Answer](http://stackoverflow.com/a/38385192) I gave him, and follow it up with all the relevant tables for a `join`, fine with me. But it took me 3 hours at least asking him for info (common), and finally said I will make up my own schema and showed him. So, that is an option for `mysql` if you want to go that route. If you document it well, a lot of mysql people here will answer it for you. – Drew Jul 15 '16 at 02:09
  • Thank you @Lowpar - the excel workaround is useful... for now. As this data set grows I will look for something more automated, but that excel shortcut did the trick. – BigData Jul 15 '16 at 13:53

1 Answers1

1

Turned out to be a bit more involved than I thought, but here it is. I am using a collection to merge duplicate keys. Change IGNORE_TEMP constant to include or exclude temp records.

enter image description here

Sub mergeNonNulls()

    ' change this constant to ignore or include temp results
    Const IGNORE_TEMP As Boolean = True

    ' temporary store of merged rows
    Dim cMerged As New Collection

    ' data part of the table
    Dim data As Range
    Set data = ActiveSheet.[a2:g3]

    Dim rw As Range  ' current row
    Dim r As Range   ' temporary row
    Dim c As Range   ' temporary cell

    Dim key As String
    Dim arr() As Variant
    Dim v As Variant
    Dim vv As Variant

    Dim i As Long
    Dim isChanged As Boolean

    For Each rw In data.Rows
        key = rw.Cells(1)  ' the first column is key

        If IGNORE_TEMP And rw.Cells(2) = "temp" Then
            DoEvents ' pass temp if enabled
        Else
            If Not contains(cMerged, key) Then
                ' if this is new key, just add it
                arr = rw
                cMerged.Add arr, key
            Else
                ' if key exists - extract, merge nulls and replace
                arr = cMerged(key)

                ' iterate through cells in current and stored rows,
                ' identify blanks and merge data if current is empty
                i = 1
                isChanged = False
                For Each c In rw.Cells
                    If Len(Trim(arr(1, i))) = 0 And Len(Trim(c)) > 0 Then
                        arr(1, i) = c
                        isChanged = True
                    End If
                    i = i + 1
                Next

                ' collections in vba are immutable, so if temp row
                ' was changed, replace it in collection
                If isChanged Then
                    cMerged.Remove key
                    cMerged.Add arr, key
                End If
            End If
        End If
    Next

    ' output the result
    Dim rn As Long: rn = 1 ' output row
    Dim numRows As Long
    Dim numCols As Long

    With ActiveSheet.[a6]  ' output start range
        For Each v In cMerged
            numRows = UBound(v, 1) - LBound(v, 1) + 1
            numCols = UBound(v, 2) - LBound(v, 2) + 1
            .Cells(rn, 1).Resize(numRows, numCols).Value = v
            rn = rn + 1
        Next
    End With

End Sub

' function that checks if the key exists in a collection
Function contains(col As Collection, key As String) As Boolean
    On Error Resume Next
    col.Item key
    contains = (Err.Number = 0)
    On Error GoTo 0
End Function
Logan Reed
  • 882
  • 7
  • 13
  • Thank you @LoganReed I had similar issues, it seemed easy and turned into a bigger project than I originally imagined. I REALLY appreciate the vba code. – BigData Jul 15 '16 at 20:39
  • @BigData Enjoy! It should handle fairly large tables, but if output will be slow -- just enclose the display section into `Application.ScreenUpdating = False`. – Logan Reed Jul 15 '16 at 20:47