0

I'm working on this raw data frame that needs some cleaning. So far, I have transformed this xlsx file

enter image description here

into this pandas dataframe:

print(df.head(16))
                   date technician alkalinity colour     uv    ph turbidity  \
0   2020-02-01 00:00:00  Catherine       24.5     33   0.15  7.24      1.53   
1            Unnamed: 2        NaN        NaN    NaN    NaN   NaN      2.31   
2            Unnamed: 3        NaN        NaN    NaN    NaN   NaN      2.08   
3            Unnamed: 4        NaN        NaN    NaN    NaN   NaN       2.2   
4            Unnamed: 5     Michel         24     35  0.152  7.22      1.59   
5            Unnamed: 6        NaN        NaN    NaN    NaN   NaN      1.66   
6            Unnamed: 7        NaN        NaN    NaN    NaN   NaN      1.71   
7            Unnamed: 8        NaN        NaN    NaN    NaN   NaN      1.53   
8   2020-02-02 00:00:00  Catherine         24    NaN  0.145  7.21      1.44   
9           Unnamed: 10        NaN        NaN    NaN    NaN   NaN      1.97   
10          Unnamed: 11        NaN        NaN    NaN    NaN   NaN      1.91   
11          Unnamed: 12        NaN        NaN   33.0    NaN   NaN      2.07   
12          Unnamed: 13     Michel         24     34   0.15  7.24      1.76   
13          Unnamed: 14        NaN        NaN    NaN    NaN   NaN      1.84   
14          Unnamed: 15        NaN        NaN    NaN    NaN   NaN      1.72   
15          Unnamed: 16        NaN        NaN    NaN    NaN   NaN      1.85   

   temperature  
0            3  
1          NaN  
2          NaN  
3          NaN  
4            3  
5          NaN  
6          NaN  
7          NaN  
8            3  
9          NaN  
10         NaN  
11         NaN  
12           3  
13         NaN  
14         NaN  
15         NaN

From here, I want to combine the rows so that I only have one row for each date. The values for each row will be the mean in the respective columns. ie.

print(new_df.head(2))
          date      time  alkalinity  colour     uv    ph  turbidity  temperature
0   2020-02-01  00:00:00       24.25      34  0.151  7.23       1.83            3
1   2020-02-02  00:00:00          24    33.5  0.148  7.23       1.82            3

How can I accomplish this when I have Unnamed values in my date column? Thanks!

CG7
  • 101
  • 3
  • 13
  • I would suggest pandas [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) functionality, however, if those unnamed values are of no use then can they be filtered out? – Agnij Oct 05 '21 at 18:57
  • I need to use those values so I can calculate the mean. If you take a look at the xlsx file above, the date spans over 8 columns. – CG7 Oct 05 '21 at 19:07
  • Okay so then you will have to know which unnamed belongs to which date and after data transformation do the **groupby** – Agnij Oct 05 '21 at 19:09
  • Yes, which is what I've been asking. How do I replace the `Unnamed` values in the date column with the true date? – CG7 Oct 05 '21 at 19:11
  • Okay, I have posted a solution that can be referred to or the other answer that is given. – Agnij Oct 05 '21 at 19:20

3 Answers3

1

Try setting the values to NaN and then use ffill:

df.loc[df.date.str.contains('Unnamed', na=False), 'date'] = np.nan

df.date = df.date.ffill()
Anders Källmar
  • 366
  • 1
  • 4
0

If I understand, correctly you want to drop rows that contain 'Unnamed' in the date column, right?

Please look here: https://stackoverflow.com/a/27360130/12790501

The solution would be something like this:

df = df.drop(df['Unnamed' in df.date].index)

Edit:

No, I would like to replace those Unnamed values with the date so I could then use the groupby('date') function to return the mean values for the columns

so in the case you should just iterate over the whole table

last_date = ''
for i in df.index:
    if 'Unnamed' not in df.at[i, 'date']:
        last_date = df.at[i, 'date']
    else:
        df.at[i, 'date'] = last_date
Petr Synek
  • 323
  • 2
  • 7
0

If the 'date' column is of type object i.e. string then just write a logic to loop over the number as seen in image provided it follows a certain pattern-

for _ in range(2,9):
    df.loc[(df['date'] == 'Unnamed: '+str(_), 'date'] = your_value
Agnij
  • 561
  • 3
  • 13