[First off, these are my first "real" experiments with pandas, so the terminology in this question might be off.]
I am working with the GHCN weather data (https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/). The data consists of a CSV file that I load into a data frame (simplified here):
data = {
'station': {0: 'AE000041196', 1: 'AE000041196', 2: 'AE000041196', 3: 'AEM00041194', 4: 'AEM00041194', 5: 'AEM00041194', 6: 'AEM00041194', 7: 'AEM00041217', 8: 'AEM00041217', 9: 'AEM00041217'},
'date': {0: 20210101, 1: 20210101, 2: 20210102, 3: 20210103, 4: 20210101, 5: 20210101, 6: 20210101, 7: 20210101, 8: 20210101, 9: 20210101},
'measurement': {0: 'TMAX', 1: 'PRCP', 2: 'TAVG', 3: 'TMAX', 4: 'TMIN', 5: 'PRCP', 6: 'TAVG', 7: 'TMAX', 8: 'TMIN', 9: 'TAVG'},
'value': {0: 278, 1: 0, 2: 214, 3: 266, 4: 178, 5: 0, 6: 217, 7: 262, 8: 155, 9: 202}
}
df = pd.DataFrame(data)
Each row specifies a station and a date, as well as a measurement type and its value. In the actual data, there are around 50 different measurement types. I need to turn this data into a more "traditional" format where each column is one measurement type, and each row contains the data for a given station and date.
So far I can only come up with a manual approach that is terribly slow:
result = pd.DataFrame()
for key, item in df.groupby(['station', 'date']):
group = input_df.get_group(key)
vals = {}
for idx, row in group.iterrows():
vals["station"] = row[0]
vals["date"] = row[1]
vals[row[2]] = row[3]
result = result.append(vals, ignore_index=True)
It works, but surely there must be a more "pandas" way of doing this, and ideally also allowing parallel processing using multiple CPU cores?