0

I have this dataset available at: https://gitlab.com/creativitylabb/dataset-test/-/raw/main/final_pagination.csv where the data looks like this after processing:

    TimeStamp   Source  Sensor  Value   LocationLat LocationLong    Measurement
TimeStamp                           
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi    pm10    16.0    45.716700   25.633300   µg/m3
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi    no2 4.0 45.716700   25.633300   µg/m3
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi    no2 2.3 45.716700   25.633300   µg/m3
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi    o3  19.8    45.716700   25.633300   µg/m3
01.02.2021 08:00:00 01.02.2021 08:00:00 Waqi    no2 28.5    45.659833   25.614488   µg/m3

The processing I used:

from datetime import datetime

import pandas as pd

df = pd.read_csv('https://gitlab.com/creativitylabb/dataset-test/-/raw/main/final_pagination.csv')

df = df.drop(['id', 'index', 'type', 'score', 'Unnamed: 0'], 1)

df['TimeStamp'] = df['TimeStamp'].apply(lambda x: datetime.utcfromtimestamp(x / 1000).strftime('%d.%m.%Y %H:%M:%S'))

df = df.sort_values(by='TimeStamp').reset_index(drop=True)

print(df.head().to_string())

df.index = df['TimeStamp']

The Sensor value contains sensors like pm10, pm2.5, co2 and so on. The Value column contains the sensor's measured value. How could I split the data into other columns, so that I can have a column with pm10 values, another with pm2.5 values and so on? (preferably without having all the other columns Nan)

Example output:

    TimeStamp   Source  pm10 pm25 LocationLat LocationLong    Measurement
TimeStamp                           
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi   16.0  20  45.716700   25.633300   µg/m3
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi   4.0 21 45.716700   25.633300   µg/m3
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi    2.3 20 45.716700   25.633300   µg/m3
01.02.2021 07:00:00 01.02.2021 07:00:00 Waqi    19.8 25   45.716700   25.633300   µg/m3
01.02.2021 08:00:00 01.02.2021 08:00:00 Waqi    28.5  24  45.659833   25.614488   µg/m3
Ian Kurtis
  • 137
  • 7
  • 1
    Will you please show a sample dataframe containing your expected output? –  Jan 03 '22 at 18:02
  • Does this answer your question? [How can I pivot a dataframe?](https://stackoverflow.com/questions/47152691/how-can-i-pivot-a-dataframe) – G. Anderson Jan 03 '22 at 18:04
  • @richardec I updated the answer with the example output – Ian Kurtis Jan 03 '22 at 18:08
  • @G.Anderson but pivot table gives me NaN values for all the other columns but one, with the sensor value. Is there any way to escape this? – Ian Kurtis Jan 03 '22 at 18:15
  • @IanKurtis I am a bit confused... in your sample output, if you're keeping the rest of the rows unchanged, how can you have a new column of pm10 with values for every row when the number of pm10 measurements isn't the same as the total number of rows? for example, during the timestamp `01.02.2021 07:00:00` there is only one `pm10` measurement so if you create a new column for `pm10` during that timestamp, what are the other rows for that same timestamp going to be for the column `pm10`? – Derek O Jan 03 '22 at 18:59

1 Answers1

1

From this output, what is the problem with NaN:

out = df.reset_index(drop=True) \
        .pivot_table(values='Value', columns='Sensor', aggfunc='mean',
                     index=['TimeStamp', 'Source', 'LocationLat', 
                            'LocationLong', 'Measurement'])

Output:

Sensor                                                                cho2    co2    no2    o3   pm1  pm10  pm25  so2
TimeStamp           Source      LocationLat LocationLong Measurement                                                 
01.02.2021 07:00:00 Waqi        45.716700   25.633300    µg/m3         NaN    NaN   3.15  19.8   NaN  16.0   NaN  NaN
01.02.2021 08:00:00 Waqi        45.716700   25.633300    µg/m3         NaN    NaN   2.85  21.3   NaN  16.0   NaN  NaN
                                45.634330   25.633872    µg/m3         NaN    NaN  18.30   NaN   NaN  29.0   NaN  NaN
                                45.649117   25.604136    µg/m3         NaN    NaN  13.15   3.6   NaN  16.0   NaN  NaN
                                45.654271   25.624061    µg/m3         NaN    NaN  11.00   NaN   NaN  18.0   NaN  NaN
...                                                                    ...    ...    ...   ...   ...   ...   ...  ...
31.10.2021 23:59:41 UradMonitor 45.641274   25.614817    µg/m3         NaN    NaN    NaN   NaN  11.0  20.0  18.0  NaN
31.10.2021 23:59:42 UradMonitor 45.651464   25.615426    µg/m3         7.0  591.0    NaN  20.0  25.0  35.0  31.0  NaN
31.10.2021 23:59:47 UradMonitor 45.559140   25.324771    µg/m3         NaN    NaN    NaN   NaN  12.0  18.0  17.0  NaN
31.10.2021 23:59:56 UradMonitor 45.657148   25.563752    µg/m3         7.0  681.0    NaN  20.0  20.0  28.0  25.0  NaN
31.10.2021 23:59:57 UradMonitor 45.661962   25.605432    µg/m3         NaN    NaN    NaN   NaN  11.0  25.0  21.0  NaN

[266450 rows x 8 columns]
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I want to use the data for forecasting. Won't the NaN values influence the result of the forecast? Or should I try to have a mean value per day for all the sensors? Maybe in this way I can get away from some NaN values... – Ian Kurtis Jan 04 '22 at 08:06
  • Your data seems to be driven by `pmX` values. Maybe you can fill nan by interpolating values for other columns or front fill last known value, etc. You can also apply ML techniques to fill values. – Corralien Jan 04 '22 at 08:47