-2

I have excel with Survey Results. There are multiple submissions(rows) by the same email. I want to find the most filled / most completed row and delete the rest of the duplicate rows with the same email address.

I have a table like this. Multiple Submissions from a single email.

Email Question 1 Question 2 Question 3
Email1 Answer Answer Answer
Email1 Answer Answer
Email2 Answer
Email2 Answer
Email2 Answer Answer

I need to delete the least filed duplicated rows. and get Results like bellow.

Email Question 1 Question 2 Question 3
Email1 Answer Answer Answer
Email2 Answer Answer

How can I do this using openrefine or Excel?

Robert Mearns
  • 11,796
  • 3
  • 38
  • 42
Kombuwa
  • 1,613
  • 4
  • 20
  • 35
  • Afaik your examples aren't complete: Do you mean that a valid series must cover all subvariants? What if Email2 in row 5 shows "Answer1" as in OP but "Answer3" instead of "Answer2"? - Are there three *unique* "Email2" rows left or does a new row have to be filled with all the answers? – T.M. Nov 25 '22 at 09:57

6 Answers6

6

This is easily possible with Excel.

  1. Create a helper column containing the formula COUNTA(%RowRange%).

  2. Then sort this helper column from largest to smallest.

  3. Last you select the entire table, click "Remove Duplicates" and in the pop-up box unselect all fields except email.

See these step-by-step screenshots:

Helper column formula

Remove duplicates

enter image description here

Note:

In a case like this:

Email Question 1 Question 2 Question 3
Email2 Answer
Email2 Answer Answer

The answer to question 2 would be lost of course. If you need to consolidate such cases to result in one row like this:

Email Question 1 Question 2 Question 3
Email2 Answer Answer Answer

That is also easily possible. Let me know if this is what you need or if the first solution is sufficient.

GWD
  • 3,081
  • 14
  • 30
3

Use Excel Formulas

These steps, using only formulas, give you the result that you require.

  1. Add a helper column to calculate the number of answers per row.
=COUNTA($B$2:$D$2)
  1. Add a second helper column to calculate the most number of answers per Email.
=MAXIFS($E$2:$E$6,$A$2:$A$6,A2)
  1. Add a third helper column to calculate if a particular row has the most number of answers for the Email.
=IF(E2=F2,"Keep","Remove")
  1. Combine the UNIQUE and FILTER functions to extract the e-mails to keep
=UNIQUE(FILTER(A1:D6,G1:G6="Keep",""),FALSE,FALSE)

An e-mail may have two or more rows with the same number of Most no. of answers.
Step 4 keeps the first row and removes the others.

enter image description here


Use Power Query

An alternate approach using the same methodology, is to use the Power Query functionality built into Excel.

In the M code, the assumption is that the source data is held in the same workbook as the query.
The data is in an Excel Table called Table1

  1. Select a cell in the table holding the data.
  2. On the Data menu tab in the Get & Transform Data section, click on From Table/Range
  3. Power Query will launch. On the Home menu tab, click on Advanced Editor.
  4. Copy and paste the code below into the window, replacing the existing code. Click on the Done button.
  5. On the Home menu tab, click on Close & Load

A new sheet will be added with the filtered data.

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Email", type text}, {"Question 1", type text}, {"Question 2", type text}, {"Question 3", type text}}),
    #"Added Column - No. of answers" = Table.AddColumn(#"Changed Type", "No. of answers", each List.Count(List.Select({[Question 1],[Question 2],[Question 3]},each _ <> null))),
    #"Grouped Rows - By No. of answers" = Table.Group(#"Added Column - No. of answers", {"Email"}, {{"Most no. of answers", each List.Max([No. of answers]), type number}, {"All", each _, type table [Email=nullable text, Question 1=nullable text, Question 2=nullable text, Question 3=nullable text, No. of answers=number]}}),
    #"Expanded grouped data" = Table.ExpandTableColumn(#"Grouped Rows - By No. of answers", "All", {"Question 1", "Question 2", "Question 3", "No. of answers"}, {"Question 1", "Question 2", "Question 3", "No. of answers"}),
    #"Added Column - Keep or Remove" = Table.AddColumn(#"Expanded grouped data", "Keep or Remove", each if ([Most no. of answers] = [No. of answers]) then "Keep" else "Remove"),
    #"Filtered Rows" = Table.SelectRows(#"Added Column - Keep or Remove", each ([Keep or Remove] = "Keep")),
    #"Removed Added Columns" = Table.RemoveColumns(#"Filtered Rows",{"Most no. of answers", "No. of answers", "Keep or Remove"}),
    #"Removed Duplicates" = Table.Distinct(#"Removed Added Columns", {"Email"})
in
    #"Removed Duplicates"

enter image description here

Robert Mearns
  • 11,796
  • 3
  • 38
  • 42
3

Remove Least Filled Duplicates

enter image description here

  • After reviewing your post, it seems logical that you want to keep the last occurrence of an email with the most answers while this code keeps the first (see Email2 in the screenshot: it should keep the one in the 30th data row). To get this functionality, only a simple change is needed:

    If qArr(dict(Key)) > qArr(sr) Then IsRow = True
    
  • The result is ordered by when the first email was sent (2, 5, 3, 1, 4), not necessarily the email with the first maximum answers (2, 5, 1, 3, 4 or 5, 1, 3, 4, 2 after adopting the previously mentioned change). Also, the ID column, which has nothing to do with the code's functionality, 'testifies' about this. If this is necessary, let me know. The change is a little more elaborate.

Sub RemoveLeastFilledDuplicates()
    
    ' Define constants.
    Const WORKSHEET_NAME As String = "Sheet1"
    Const E_COLUMN As Long = 1
    Const Q_COLUMNS As String = "2:4"
    Const Q_DELIMITER As String = ":" ' one character only
    
    ' Reference the workbook and the worksheet.
    Dim wb As Workbook: Set wb = ThisWorkbook ' workbook containing this code
    Dim sws As Worksheet: Set sws = wb.Worksheets(WORKSHEET_NAME)
    
    Dim rg As Range, srCount As Long, cCount As Long
    
    ' Reference the range.
    With sws.Range("A1").CurrentRegion
        srCount = .Rows.Count - 1 ' use -1 to exclude headers
        cCount = .Columns.Count
        Set rg = .Resize(srCount).Offset(1)
    End With
    
    ' Write the values from the range to the Data array.
    Dim Data() As Variant: Data = rg.Value
    
    ' Extract the first and last question columns.
    Dim qPos As Long: qPos = InStr(Q_COLUMNS, Q_DELIMITER)
    Dim qFirst As Long: qFirst = Left(Q_COLUMNS, qPos - 1)
    Dim qLast As Long: qLast = Right(Q_COLUMNS, Len(Q_COLUMNS) - qPos)
    
    ' Define the Questions array.
    Dim qArr() As Long: ReDim qArr(1 To srCount)
    
    Dim sr As Long, c As Long, qCount As Long
       
    ' Write the number of non-empty question columns to the Questions array.
    For sr = 1 To srCount
        For c = qFirst To qLast
            If Len(CStr(Data(sr, c))) > 0 Then qCount = qCount + 1
        Next c
        If qCount > 0 Then qArr(sr) = qCount: qCount = 0
    Next sr
            
    ' Write the unique values from the email column to a dictionary.
    ' The 'keys' will hold the emails while the 'items' will hold the rows.
    ' The data from the Questions array is used to determine the row
    ' with the most non-empty question columns per email.
    ' The emails are ordered by their first occurrence.
            
    Dim dict As Object: Set dict = CreateObject("Scripting.Dictionary")
    dict.CompareMode = vbTextCompare
    
    Dim Key As Variant
    Dim IsRow As Boolean
    
    For sr = 1 To srCount
        Key = Data(sr, E_COLUMN)
        If Not IsError(Key) Then
            If Len(CStr(Key)) > 0 Then
                If dict.Exists(Key) Then
                    If qArr(dict(Key)) >= qArr(sr) Then IsRow = True
                End If
                If IsRow Then IsRow = False Else dict(Key) = sr
            End If
        End If
    Next sr
    
    Erase qArr
    
    ' Write the data from the dictionary to the top of the Data array.
    
    Dim dr As Long
    
    ' Write to the top.
    For Each Key In dict.Keys
        sr = dict(Key)
        dr = dr + 1
        For c = 1 To cCount
            Data(dr, c) = Data(sr, c)
        Next c
    Next Key
    
    Set dict = Nothing
    
    ' Clear the bottom.
    For sr = dr + 1 To srCount
        For c = 1 To cCount
            Data(sr, c) = Empty
        Next c
    Next sr
    
    ' Write the values from the Data array to the range.
    
    rg.Value = Data
    
    ' Inform.
    
    MsgBox "Least filled duplicates removed.", vbInformation
    
End Sub
VBasic2008
  • 44,888
  • 5
  • 17
  • 28
2

I am going to assume that what you mean by:

delete least filled duplicate rows?

is to preserve the rows with the maximum number of responses rather than remove the rows with the least number of questions responded for a given email. If you want to remove the row per email group with the least number of responses I provided a modified version of Formula 1 at the end that satisfies this requirement.

Note: Based on your input data both approaches will produce the same result, but in a more general case the result will differ.

In cell F2 you can put the following formula (Formula 1);

=LET(emails, A2:A6, ans, B2:D6, cols, COLUMN(ans), ones, TOCOL(cols/cols),
 nonEmpty, IF(ans <>"", 1, 0), counts, MMULT(nonEmpty, ones),
 rank, MAP(emails, counts, LAMBDA(e,c, SUMPRODUCT((emails=e) * (counts > c))+1)),
 out, FILTER(HSTACK(emails, ans), rank = 1),
 IF(out=0, "", out)
)

and here is the output:

sample excel file

Note: This solution doesn't require helper columns to obtain the result.

Explanation

We use LET function for easy reading and composition. We are going to count non-blanks via MMULT. First, we identify non-empty answers as follows: IF(ans <>"", 1, 0) and name it nonEmpty. We define a unit vector (ones) to do the multiplication and sum by row. The name counts, has the output of MMULT(nonEmpty, ones). This is the corresponding output for the sample data:

3
2
1
1
2

Now we are going find the rank for rows within the same email based on counts. The rank 1 will be for the email with the maximum number of answers, then rank 2, and so on. For more information check the answer to the question: Set row maximum for FILTER formula provided by @DavidLeal. This is the condition to calculate the rank by descending order of counts, and we name it rank.

MAP(emails, counts, LAMBDA(e,c, SUMPRODUCT((emails=e) * (counts > c))+1))

Because we want to filter the result for emails with the maximum number of responses, then we filter by rank equal to 1. If we want to select also the row with the second most answered questions, then the condition will be instead rank <= 2. So it allows you to customize the result you want to filter.

FILTER(HSTACK(emails, ans), rank = 1)

The final step is just to remove the 0's generated by FILTER with an empty string.

Note: Once we have counts, there are multiple ways of achieving the same result. For example via DROP/REDUCE/VSTACK pattern. Check the answer from this question: how to transform a table in Excel from vertical to horizontal but with different length from @DavidLeal:

=LET(emails, A2:A6, ans, B2:D6,
 cols, COLUMN(ans), ones, TOCOL(cols/cols),
 nonEmpty, IF(ans <>"", 1, 0), counts, MMULT(nonEmpty, ones),
 uxEmails, UNIQUE(emails), 
 out, DROP(REDUCE("", uxEmails, LAMBDA(acc, email, LET(
  maxCnt, MAX(FILTER(counts, emails=email)), result, 
    FILTER(HSTACK(emails, ans), (emails=email) * (counts = maxCnt)),
    VSTACK(acc, result)))),1),
 IF(out=0, "", out)
)

If one or more rows per unique email have the maximum number of responses, all corresponding rows will be returned.

In my opinion, the rank approach gives more flexibility in case you want to show rows with a wider range of rank values and it is a less verbose formula.

Exclude least filled duplicate rows

If you want instead to exclude the rows with the least number of responses for a given group, the Formula 1 can be modified as follow:

=LET(emails, A2:A6, ans, B2:D6, cols, COLUMN(ans), ones, TOCOL(cols/cols),
 nonEmpty, IF(ans <>"", 1, 0), counts, MMULT(nonEmpty, ones),
 rank, MAP(emails, counts, LAMBDA(e,c, SUMPRODUCT((emails=e) * (counts < c))+1)),
 out, FILTER(HSTACK(emails, ans), rank > 1),
 IF(out=0, "", out)
)

We made the following changes:

  • counts > c -> counts < c
  • rank = 1 -> rank > 1

The first change reverses the rank, now the highest rank (rank =1) will be the the email within a group with the least number of responses. So if we want to exclude it, we just filter for rank greater than 1.

David Leal
  • 6,373
  • 4
  • 29
  • 56
  • 1
    Beautiful. I usually prefer formulaic solutions over programmatic solutions when I can as its usually easier for future users to manage when source data changes. – Cameron Critchlow Nov 25 '22 at 19:22
1

This is a straightforward solution how to solve this with OpenRefine.

  1. Count the number of answers in a new column named Sorting Helper via the GREL expression.
forEach(
    row.columnNames,
    colName,
    if(isNonBlank(row.cells[colName].value), 1, 0))
  .sum()

Note that there will be an offset in the count, as we additionaly count the column Email. This has no impact on the sorting.

  1. Sort the column Email.
  2. Additional descending sort on the Sorting Helper column.
  3. Use the blank down function via the Email => Edit cells => Blank down.
  4. Filter the rows with now empty Email cells using a Facet via Email => Facet => Customized facets => Facet by blank (null or empty string).
  5. Delete the filtered rows via All => Edit rows => Remove matching rows.
b2m
  • 529
  • 3
  • 11
  • This logic not detect which duplicated rows are least filed. I only need most completed (filled) row – Kombuwa Nov 21 '22 at 10:11
  • You are right... I read "last" filled... not "least" filled which makes a complete difference. I updated the answer accordingly. – b2m Nov 21 '22 at 12:17
1

You have quite some answers already (and strangely haven't accepted any of the working).

Here's another way to spill the desired results in Office 365:

=LET(data,   A2:D6,
     count,  MMULT(--(data<>""),TRANSPOSE(COLUMN(data)^0)),
     e,      TAKE(data,,1),
     ue,     UNIQUE(e),
     mec,    DROP(REDUCE(0,e,LAMBDA(a,b,HSTACK(a,(e=b)*count))),,1),
     emec,   BYCOL(mec,LAMBDA(x,MAX(x))),
     uec,      UNIQUE(e&"|"&TRANSPOSE(emec)),
     ec,      e&"|"&count,
SUBSTITUTE(
           INDEX(data,
                 XMATCH(uec,ec),
                 SEQUENCE(1,COLUMNS(data)))
           ,"",""))

This counts the number of non-blanks per row then checks the max number of non-blanks per unique email name. Than the data is indexed to match the rows where the count of answers equals the max count for the email name.

enter image description here

P.b
  • 8,293
  • 2
  • 10
  • 25