How to get distinct rows in a pandas df and merge the duplicate items into a column?

Question

I am in a bit of a weird situation. I have already solved my programming problem before but I am looking back on it and trying to implement it using pandas. I thought this would be a good place to practice using pandas.

I am querying a database, doing some calculations, and then displaying the results onto a GUI with a PyQt QTableWidget.

An example table after the calculations could look like this:

test_list = [["a", "b", "c", "d"],
            ["1", "3", "5", "7"],
            ["1", "4", "5", "7"],
            ["2", "3", "6", "8"],
            ["2", "4", "6", "9"]]

What I want to do before I display it is: get the distinct rows based on columns "a", "c", and "d", and merge the dropped elements from column "b" back into the column. The result I want looks like this:

['a', 'b', 'c', 'd']
['1', '3, 4', '5', '7']
['2', '3', '6', '8']
['2', '4', '6', '9']

Notice how in column "b", "3, 4" are both represented in their row.

Here is how I did it initially with lists and dictionaries:

def mergeDistinct(my_list):
    new_list_dict = {}

    for elem in my_list[1:]:
        key_str = (elem[0], elem[2], elem[3])
        if key_str in new_list_dict.keys():
            new_list_dict[key_str][1] += ", " + elem[1]
        else:
            new_list_dict[key_str] = elem[::]
            new_list_dict[key_str][1] = elem[1]

    ret_list = new_list_dict.values()

    return [my_list[0]] + ret_list

I loop over all of the rows and use a dictionary to keep track of what distinct combination of values I have seen so far. I think it feels a bit clunky and I am trying my hand at the pandas library. I feel like it should definitely be possible but maybe I don't know the right term to google to understand how to do it.

This is what I have so far:

df = pd.DataFrame(data=test_list[1:], columns=test_list[0])

def mergeDistinctPandas(my_df):
        #I feel like this is close but I don't know how to continue
        df = my_df.set_index(['a', 'b', 'c', 'd']).groupby(level=['a', 'c', 'd'])
        # for elem in df:
            # print(elem)
        # new_df = pd.DataFrame()
        # for elem in df:
            # merged = pd.concat([elem[1] for i, row in elem[1].iterrows()])  #.to_frame()
            # merged.index = ['duplicate_{}'.format(i) for i in range(len(merged))]
            # new_df = pd.concat([new_df, merged], axis=1)

        return False

If I print out what I have so far I see the rows are separated and I should be able to merge them back, leaving "b" separated, but I can't see how to do it.

If pandas isn't suited to this problem, that's fine too, I'm just trying to get to grips with it. Thanks for the help.

Here are some related questions I have found: How to "select distinct" across multiple data frame columns in pandas? and How do I merge duplicate rows into one on a DataFrame when they have different values

score 5 · Accepted Answer · answered Oct 07 '17 at 01:28

5

df.groupby([‘a’, ‘c’, ‘d’]).b.apply(‘, ‘.join) \
  .reset_index()[df.columns]

answered Oct 07 '17 at 01:28

piRSquared

285,575
57
475
624

LOL cute comma :-) – BENY Oct 07 '17 at 01:35

How to get distinct rows in a pandas df and merge the duplicate items into a column?

1 Answers1