Replace Unnamed values in date column with true values

Question

I'm working on this raw data frame that needs some cleaning. So far, I have transformed this xlsx file

into this pandas dataframe:

print(df.head(16))

                   date technician alkalinity colour     uv    ph turbidity  \
0   2020-02-01 00:00:00  Catherine       24.5     33   0.15  7.24      1.53   
1            Unnamed: 2        NaN        NaN    NaN    NaN   NaN      2.31   
2            Unnamed: 3        NaN        NaN    NaN    NaN   NaN      2.08   
3            Unnamed: 4        NaN        NaN    NaN    NaN   NaN       2.2   
4            Unnamed: 5     Michel         24     35  0.152  7.22      1.59   
5            Unnamed: 6        NaN        NaN    NaN    NaN   NaN      1.66   
6            Unnamed: 7        NaN        NaN    NaN    NaN   NaN      1.71   
7            Unnamed: 8        NaN        NaN    NaN    NaN   NaN      1.53   
8   2020-02-02 00:00:00  Catherine         24    NaN  0.145  7.21      1.44   
9           Unnamed: 10        NaN        NaN    NaN    NaN   NaN      1.97   
10          Unnamed: 11        NaN        NaN    NaN    NaN   NaN      1.91   
11          Unnamed: 12        NaN        NaN   33.0    NaN   NaN      2.07   
12          Unnamed: 13     Michel         24     34   0.15  7.24      1.76   
13          Unnamed: 14        NaN        NaN    NaN    NaN   NaN      1.84   
14          Unnamed: 15        NaN        NaN    NaN    NaN   NaN      1.72   
15          Unnamed: 16        NaN        NaN    NaN    NaN   NaN      1.85   

   temperature  
0            3  
1          NaN  
2          NaN  
3          NaN  
4            3  
5          NaN  
6          NaN  
7          NaN  
8            3  
9          NaN  
10         NaN  
11         NaN  
12           3  
13         NaN  
14         NaN  
15         NaN

From here, I want to combine the rows so that I only have one row for each date. The values for each row will be the mean in the respective columns. ie.

print(new_df.head(2))

          date      time  alkalinity  colour     uv    ph  turbidity  temperature
0   2020-02-01  00:00:00       24.25      34  0.151  7.23       1.83            3
1   2020-02-02  00:00:00          24    33.5  0.148  7.23       1.82            3

How can I accomplish this when I have Unnamed values in my date column? Thanks!

I would suggest pandas [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) functionality, however, if those unnamed values are of no use then can they be filtered out? — Agnij, Oct 05 '21 at 18:57
I need to use those values so I can calculate the mean. If you take a look at the xlsx file above, the date spans over 8 columns. — CG7, Oct 05 '21 at 19:07
Okay so then you will have to know which unnamed belongs to which date and after data transformation do the **groupby** — Agnij, Oct 05 '21 at 19:09
Yes, which is what I've been asking. How do I replace the `Unnamed` values in the date column with the true date? — CG7, Oct 05 '21 at 19:11
Okay, I have posted a solution that can be referred to or the other answer that is given. — Agnij, Oct 05 '21 at 19:20

Anders Källmar · Accepted Answer · 2021-10-05T19:42:29.810

1

Try setting the values to NaN and then use ffill:

df.loc[df.date.str.contains('Unnamed', na=False), 'date'] = np.nan

df.date = df.date.ffill()

edited Oct 05 '21 at 19:42

answered Oct 05 '21 at 19:34

Anders Källmar

366
1
4

`ValueError: Cannot mask with non-boolean array containing NA / NaN values` – CG7 Oct 05 '21 at 19:37
Try setting `na=False`. Updated answer. – Anders Källmar Oct 05 '21 at 19:42

Petr Synek · Answer 2 · 2021-10-05T19:17:07.003

0

If I understand, correctly you want to drop rows that contain 'Unnamed' in the date column, right?

Please look here: https://stackoverflow.com/a/27360130/12790501

The solution would be something like this:

df = df.drop(df['Unnamed' in df.date].index)

Edit:

No, I would like to replace those Unnamed values with the date so I could then use the groupby('date') function to return the mean values for the columns

so in the case you should just iterate over the whole table

last_date = ''
for i in df.index:
    if 'Unnamed' not in df.at[i, 'date']:
        last_date = df.at[i, 'date']
    else:
        df.at[i, 'date'] = last_date

edited Oct 05 '21 at 19:17

answered Oct 05 '21 at 18:57

Petr Synek

323
2
7

No, I would like to replace those `Unnamed` values with the date so I could then use the `groupby('date')` function to return the mean values for the columns – CG7 Oct 05 '21 at 19:03
Ok, I have updated the answer – Petr Synek Oct 05 '21 at 19:20
`TypeError: argument of type 'datetime.datetime' is not iterable` – CG7 Oct 05 '21 at 19:40

score 0 · Answer 3 · answered Oct 05 '21 at 19:18

0

If the 'date' column is of type object i.e. string then just write a logic to loop over the number as seen in image provided it follows a certain pattern-

for _ in range(2,9):
    df.loc[(df['date'] == 'Unnamed: '+str(_), 'date'] = your_value

answered Oct 05 '21 at 19:18

Agnij

561
3
13

Replace Unnamed values in date column with true values

3 Answers3