1

I'm working on a dataset that can have a lot of duplicates, and I want to deduplicate it based on 2 columns, and take the first non null value of each row in every 'group' of duplicates. The only problem is that there are two 'ID' columns that I don't want to lose any information from.

EDIT: to explain it a bit better. We assume that all rows that have the same combination of name/city are the same person. I want to have as few rows as possible in the dataset without losing any information from my_first_id and my_second_id.

Here is an example of my data:

df = pd.DataFrame({"name": ["John", "John", "Louis", "Maria", "Maria", "Maria", "Maria"],
                   "city": ["New-York", "New-York", "Paris", "Madrid", "Madrid", "Madrid", "Madrid"],
                   "to_fill": [np.nan, "test", "abc", "123", "345", np.nan, np.nan],
                   "my_first_id": ["id1", np.nan, "id2", np.nan, "id4", "id5", "id6"],
                   "my_second_id": [np.nan, "id_a", "id_b", "id_c", np.nan, np.nan, np.nan]})

print(df)
name city to_fill my_first_id my_second_id
John New-York NaN id1 NaN
John New-York test NaN id_a
Louis Paris abc id2 id_b
Maria Madrid 123 NaN id_c
Maria Madrid 345 id4 NaN
Maria Madrid NaN id5 NaN
Maria Madrid NaN id6 NaN

And now the expected output:

name city to_fill my_first_id my_second_id
John New-York test id1 id_a
Louis Paris abc id2 id_b
Maria Madrid 123 id4 id_c
Maria Madrid NaN id5 NaN
Maria Madrid NaN id6 NaN

There is this question pandas group by and find first non null value for all columns that helped me get the first non value of each row in a group of duplicate, but the problem is that I'm losing the last 2 rows of my expected output.

df.groupby(["name", "city"]).first().reset_index()

Result:

name city to_fill my_first_id my_second_id
John New-York test id1 id_a
Louis Paris abc id2 id_b
Maria Madrid 123 id4 id_c

Any idea? Thanks for your help!

victorfink
  • 343
  • 4
  • 17
  • This doesn't follow logically. How do you know if that's the same John from New York? If we can safely assume that Maria from Madrid is actually in the table twice, but there are at least two others that are distinct, which one? – Kenny Ostrom Feb 16 '22 at 15:54
  • Let's just assume that everything that has the same combination of name and city is actually the same person :) What I really want is to not 'lose' any information contained in `my_first_id` and `my_second_id` by deduplicating with a groupby. If I do for example `groupby.first().reset_index()`, I will lose the `id_5` and `id_6` in `my_first_id` column from Maria, and that's what I want to avoid. Is that clear? – victorfink Feb 16 '22 at 15:57
  • So to actually answer your question, all the Maria in the dataset are the same one! I want to have as few rows as possible without losing information from `my_first_id` and `my_second_id`. That's why we end up with 3 rows for Maria. – victorfink Feb 16 '22 at 16:05
  • Do you need to keep NaN for to_fill and my_second_id for id5 and 6 row? – Emma Feb 16 '22 at 16:34
  • I would prefer to keep NaN for to_fill for id5 and id6 but if not that's also ok for me. The most important thing for me is to not lose information from the 2 ids columns :) – victorfink Feb 16 '22 at 16:40
  • `df.groupby(['name', 'city']).agg({'to_fill': 'first', 'my_first_id': lambda x: x.dropna().tolist(), 'my_second_id': 'first'}).explode('my_first_id')` This will keep all ids but it does not keep NaN. – Emma Feb 16 '22 at 17:05
  • It's getting close to what I want but if there is an `id_d` in `my_second_id` on the same row as `id5`, `id_d` is getting replaced by `id_c` – victorfink Feb 17 '22 at 08:13

1 Answers1

0

I came up with a solution. I'm using groupby and then check for each group if I actually want to deduplicate it or not. It seems highly inefficient but it does the trick. If someone can come up with a better solution, I will gladly accept it.

First, we add columns to count the number of unique IDs per group, and columns containing a boolean indicating if for each row, there is an ID or not. And lastly, a count of number of IDs per row (useful for sorting the dataframe later).

df["first_id_count"] = df.groupby(["name", "city"])["my_first_id"].transform('nunique')
df["second_id_count"] = df.groupby(["name", "city"])["my_second_id"].transform('nunique')

def check_if_id_is_present(x):
    try:
        if not(pd.isnull(x)):
            return True
        else:
            return False
    except:
        return False
    
df["my_first_id_present"] = df["my_first_id"].apply(check_if_id_is_present)
df["my_second_id_present"] = df["my_second_id"].apply(check_if_id_is_present)

def create_count_ids_per_row(x):
    try:
        count = 0
        if not(pd.isnull(x[0])):
            count += 1
        if not(pd.isnull(x[1])):
            count += 1
        return count
    except:
        return 0

df["ids_count"] = df[["my_first_id", "my_second_id"]].apply(create_count_ids_per_row, axis=1)

Then, we can start the groupby and iterate over each group.

df_final = pd.DataFrame()
ids_to_deduplicate = ["first_id_count", "second_id_count"]
ids_present = ["my_first_id_present", "my_second_id_present"]

for name, group in grouped:
    if group["first_id_count"].iloc[0] < 2 and group["second_id_count"].iloc[0] < 2:
        # if there are strictly less than 2 unique ids of my_first_id and my_second_id
        # then we can safely deduplicate and add to the final dataframe
        df_final = pd.concat([df_final, group.groupby(["name", "city"]).first().reset_index()])
    else:
        # if not, we have to separate the dataframe into 2
        # one we want to deduplicate
        # one we shouldn't touch
        df_duplicate = group.copy()
        # first, we sort by the number of ids per row
        df_duplicate = df_duplicate.sort_values(by=["ids_count"], ascending=False)
        # and reset the index, for ease of use
        df_duplicate.reset_index(inplace=True)
        df_duplicate = df_duplicate.drop("index", axis=1)

        # rows we want to deduplicate
        rows_to_deduplicate = []
        # rows we want to keep
        rows_to_keep = []
        # create a list with flags for each id column
        flags_list = [False]*len(ids_to_deduplicate)
        for idx in df_duplicate.index:
            flag = False
            # first, we check if one of our flags is set to True at the same time as our row in df_duplicate
            # in this case, it means that we already have a row with an id, and we have have another row with another id
            # so we want to keep this row to not lose any information
            for idx_id, my_id in enumerate(ids_to_deduplicate):
                if flags_list[idx_id] and df_duplicate.loc[idx, ids_present[idx_id]]:
                    # we add it to the rows to keep
                    rows_to_keep.append(idx)
                    flag = True
                    break
            if flag:
                continue
            # now, we know that we want to deduplicate this row, otherwise we would have flaged this row
            for idx_id, my_id in enumerate(ids_to_deduplicate):
                if not(flags_list[idx_id]) and df_duplicate.loc[idx, ids_present[idx_id]]:
                    # we add it to the rows to deduplicate
                    rows_to_deduplicate.append(idx)
                    # we have to add to the flags_list all the according booleans
                    # there can be several ids on a row so we have to make a for loop
                    for idx_id_temp, id_to_check in enumerate(ids_to_deduplicate): 
                        if df_duplicate.loc[idx, ids_present[idx_id_temp]]:
                            flags_list[idx_id_temp] = True
                    break

        # now we have our 2 separate dataframes
        df_duplicate_keep = df_duplicate.loc[rows_to_keep].copy()
        df_duplicate_deduplicate = df_duplicate.loc[rows_to_deduplicate].copy()

        # we can keep one, and deduplicate the other and concatenate the result
        df_final_duplicate = pd.concat([df_duplicate_keep, df_duplicate_deduplicate.groupby(["name", "city"]).first().reset_index()])
        
        # and add the result to our final dataframe
        df_final = pd.concat([df_final, df_final_duplicate])

And to clean our mess:

df_final = df_final.drop("ids_count", axis=1)
for col in ids_to_deduplicate:
    df_final = df_final.drop(col, axis=1)
for col in ids_present:
    df_final = df_final.drop(col, axis=1)

And we have the desired output.

Again, this seems really ugly, so if anyone has a better solution, feel free to share.

victorfink
  • 343
  • 4
  • 17
  • Instead of `pd.concat`, use the answer here https://stackoverflow.com/questions/41888080/python-efficient-way-to-add-rows-to-dataframe with `StringIO` and `csv_writer`. This will speed up the process considerably if you have a lot of data. – victorfink Feb 17 '22 at 16:19