1

I have a fairly large excel file (5000 rows) with some company information. Many companies have been added multiple times, but all of them are missing some columns. My goal is to merge all duplicate companies without losing any of the information from other columns.

Example:

enter image description here

My search results have only showed how to combine two different data frames and how to merge cells if there is no risk of overlap.

(https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)

Probably the closest answer is in this one.

It does, however, merge the different values into single cell instead of splitting them into multiple different cells.

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
Anssi
  • 25
  • 4

1 Answers1

1

You could use a groupby() with the agg option to specify what you want to do with each column. Here is what it could look like :

df = pd.DataFrame({'Company' : ['Apple', 'Google', 'Apple', 'Microsoft'],
                   'Score' : [7, np.nan, 6, 8], 
                   'Employees' : [8888, 55000, np.nan, np.nan],
                   'Description': ['Tasty', np.nan, "Likes trees", "Doesn't like apples"]})

df.groupby('Company').agg({'Score':  max, 'Employees': max, 'Description':list})

Output :

           Score  Employees            Description
Company                                           
Apple        7.0     8888.0   [Tasty, Likes trees]
Google       NaN    55000.0                  [nan]
Microsoft    8.0        NaN  [Doesn't like apples]
vlemaistre
  • 3,301
  • 13
  • 30
  • 1
    Thanks for taking the time to reply (and even copying the the terms from picture :p). That output isn't, however, what I'd like to achieve. I'd like to have additional column for the description from the 2nd Apple row ("Likes trees".) so that no information is lost. – Anssi Jun 27 '19 at 12:58
  • You can change the `'Description' : 'first'` to `Description' : list` to keep all of the descriptions inside of a list. I edited my answer if you want to see the result – vlemaistre Jun 27 '19 at 13:29