-1

Need help merging multiple rows of data with various datatypes for multiple columns

I have a dataframe that contains 14 columns and x number of rows of data. An example slice of the dataframe is linked below:

Current Example of my dataframe

Current Example of my dataframe

I want to be able to merge all four rows of data into a single row based on the "work order" column. See linked image below. I am currently using pandas to take data from four different data sources and create a dataframe that has all the relevant data I want based on each work order number. I have tried various methods including groupby, merge, join, and others without any good results.

How I want my dataframe to look in the end

How I want my dataframe to look in the end

I essentially want to groupby the work order value, merge all the site names into a single value, then have all data essentially condense to a single row. If there is identical data in a column then I just want it to merge together. If there are values in a column that are different (such as in "Operator Ack Timestamp") then I don't mind the data being a continuous string of data (ex. one date after the next within the same cell).

example dataframe data:

df = pd.DataFrame({'Work Order': [10025,10025,10025,10025],
                   'Site': ['SC1', 'SC1', 'SC1', 'SC1'],
                   'Description_1':['','','Inverter 10A-1 - No Comms',''],
                   'Description_2':['','','Inverter 10A-1 - No Comms',''],
                   'Description_3':['Inverter 10A-1 has lost communications.','','',''],
                   'Failure Type':['','','Communications',''],
                   'Failure Class':['','','2',''],
                   'Start of Fault':['','','2021-05-30 06:37:00',''],
                   'Operator Ack Timestamp':['2021-05-30 8:49:21','','2021-05-30 6:47:57',''],
                   'Timestamp of Notification':['2021-05-30 07:18:58','','',''],
                   'Actual Start Date':['','2021-05-30 6:37:00','','2021-05-30 6:37:00'],
                   'Actual Start Time':['','06:37:00','','06:37:00'],
                   'Actual End Date':['','2021-05-30 08:24:00','',''],
                   'Actual End Time':['','08:24:00','','']})

df.head()
  • Please include samples of data as text, not images, so that people can actually use them. This [page](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) might help. – joao Jun 08 '21 at 16:22
  • @joao just added the dataframe data – jonny_two_knives Jun 08 '21 at 17:06

1 Answers1

1

4 steps to get expected output:

  1. Replace empty values by pd.NA,
  2. Group your data by Work Order column because it seems to be the index key,
  3. For each group, fill NA value by last valid observation and keep the last record,
  4. Reset index to have the same format as input.

I choose to group by "Work Order" because it seems to be the index key of your dataframe.
The index of your dataframe is "Work Order":

df = df.set_index("Work Order")
out = df.replace({'': pd.NA}) \
        .groupby("Work Order", as_index=False) \
        .apply(lambda x: x.ffill().tail(1)) \
        .reset_index(level=0, drop=True)```

>>> out.T  # transpose for better vizualisation
Work Order                                                   10025
Site                                                           SC1
Description_1                            Inverter 10A-1 - No Comms
Description_2                            Inverter 10A-1 - No Comms
Description_3              Inverter 10A-1 has lost communications.
Failure Type                                        Communications
Failure Class                                                    2
Start of Fault                                 2021-05-30 06:37:00
Operator Ack Timestamp                          2021-05-30 6:47:57
Timestamp of Notification                      2021-05-30 07:18:58
Actual Start Date                               2021-05-30 6:37:00
Actual Start Time                                         06:37:00
Actual End Date                                2021-05-30 08:24:00
Actual End Time                                           08:24:00
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I added an example dataframe above to the original question if that helps. – jonny_two_knives Jun 08 '21 at 17:12
  • To save to an excel file, you can use `out.to_excel('FileLocation.xlsx', index=False)` – Corralien Jun 08 '21 at 19:51
  • This is exactly what I was looking for. In the larger dataframe that this slice of data came from, I have the "work order" column set as the index. So I removed the last part of your code >>> .reset_index(drop=True) <<< and it displays a double index for "work order." I originally tried just changing True to False in the reset index portion of the code, but got an argument error. Any ideas on preventing a double index but maintaining "work order" as an index? – jonny_two_knives Jun 08 '21 at 20:06
  • I updated my answer to take into account your last comment. – Corralien Jun 08 '21 at 20:16
  • if I wanted the columns "Actual Start Date" and "Actual Start Time" to display the earliest date how would I go about doing that? Let's say I want "6/5/21 1:35:00" to be the output value in the final "grouped-by" row at the end assuming the two available input values were 6/5/21 1:35:00 and 6/7/21 5:23:00 in that column index? Could you pass an argument to make sure the lowest date is displayed? – jonny_two_knives Jun 08 '21 at 21:42