0

I have the following dataframe:

index   name     path
0       Dina     "gs://my_bucket/folder1/img1.png"
1       Dina     "gs://my_bucket/folder1/img2.png"
2       Lane     "gs://my_bucket/folder1/img3.png"
3       Bari     "gs://my_bucket/folder1/img4.png"
4       Andrew   "gs://my_bucket/folder1/img5.png"
5       Andrew   "gs://my_bucket/folder1/img6.png"
6       Andrew   "gs://my_bucket/folder1/img7.png"
7       Beti     "gs://my_bucket/folder1/img7.png"
8       Ladin    "gs://my_bucket/folder1/img5.png"
...

I would like to get new dataframe which will have the unique names appears only once, and the path column will be list with the matching paths. The output should look like this:

index   name     path
0       Dina     ["gs://my_bucket/folder1/img1.png","gs://my_bucket/folder1/img2.png"]
1       Lane     ["gs://my_bucket/folder1/img3.png"]
2       Bari     ["gs://my_bucket/folder1/img4.png"]
3       Andrew   ["gs://my_bucket/folder1/img5.png","gs://my_bucket/folder1/img6.png","gs://my_bucket/folder1/img7.png"]
4       Beti     ["gs://my_bucket/folder1/img7.png"]
5       Ladin    ["gs://my_bucket/folder1/img5.png"]
...

The result should have number of rows equal to unique names in the dataframe. At the moment I'm using something I did with chatgpt, but it used function that I don't understand why is it used and also it duplicates the names of the rows, so if I know I suppose to have 842 unique names, I get 992 ...

This is chatGPT solution:

# Define a custom aggregation function to combine links as a list
def combine_links(links):
    return list(set(links))  # Convert links to a list and remove duplicates

# Group the GeoDataFrame by 'name' and 'dili' and aggregate the 'link' column
result = df.groupby(['name'))['path'].agg(combine_links).reset_index()

My goal is to find a solution the gives me in the end the right number of rows, which is number of unique names.

Reut
  • 1,555
  • 4
  • 23
  • 55
  • It seems like the chatGPT solution should work, can you provide a sample data for where the names are repeating. Provide the input as well as the output for that data – dracarys May 21 '23 at 09:57

2 Answers2

1

I think the answer is hidden here: What is the difference between pandas agg and apply function?

My code works:

import pandas as pd


def combine_links(links):
    return list(set(links))

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon','Parrot', 'Parrot', 'Falcon'],
                       'Max Speed': [380., 370., 24., 26., 370.]})

df_new=df.groupby(['Animal'])['Max Speed'].apply(combine_links).reset_index()

print(df_new)
1

A possible solution:

df.groupby('name')['path'].agg(list).reset_index()

Output:

     name                                               path
0  Andrew  [gs://my_bucket/folder1/img5.png, gs://my_buck...
1    Bari                  [gs://my_bucket/folder1/img4.png]
2    Beti                  [gs://my_bucket/folder1/img7.png]
3    Dina  [gs://my_bucket/folder1/img1.png, gs://my_buck...
4   Ladin                  [gs://my_bucket/folder1/img5.png]
5    Lane                  [gs://my_bucket/folder1/img3.png]
PaulS
  • 21,159
  • 2
  • 9
  • 26