How can I compute average monthly value from large daily JSON dataset in pandas dataframe?

Question

I'm relatively new to working with pandas dataframes to read in data and I'm having some trouble working with my dataset. I've been reading many other StackOverflow posts on a similar problem but I've been having trouble applying those solutions to my case possibly because of the structure of my JSON data. My JSON data as arranged into my dataframe df = pd.DataFrame.from_records(data) generally looks like this

             dateTime                          value
0   01/16/20 04:32:42   {'bpm': 70, 'confidence': 0}
1   01/16/20 04:32:57   {'bpm': 70, 'confidence': 0}
2   01/16/20 04:33:12   {'bpm': 70, 'confidence': 1}

and so on for many daily values.

My goal is to read all of this raw daily data and compute a monthly average of "bpm" and plot to a matplot graph. My problem is I'm having trouble using pandas datetime or mean() operations because I don't think pandas accepts my dateTime format as actually in datetime and when I try to use mean() it gives me an pandas.core.base.DataError: No numeric types to aggregate error.

How can I use built in pandas tools to allow me to compute a monthly average by grouping my daily values together based on the month?

for file in os.listdir(data_dir):  # look at every file in the folder
    if file.startswith("heart_rate") and file.endswith(".json"):  # only want heart_rate-date.json files
        with open(os.path.join(data_dir, file)) as f:  # open each file in data_dir
            data = json.load(f)
            df = pd.DataFrame.from_records(data)
            print(df)
            #df.dateTime = pd.to_datetime(df.dateTime)
            #df['Month'] = df['dateTime'].dt.month
            for i, j in enumerate(data):
                if data[i]['value']['confidence'] > 0:
                    daily_avg_bpm += data[i]['value']['bpm']
                    daily_date = data[i]['dateTime'].split()[0]
                    my_date = datetime.datetime.strptime(daily_date, "%m/%d/%y").date()
                    days.append(my_date)
            months.append(daily_date[:2])
            daily_avg_bpm /= len(data)
            dates.append(daily_date)
            avg_bpms.append(round(daily_avg_bpm))
        f.close()
plt.xlabel('Month')
plt.ylabel('Heart Rate')
plt.title("Fitbit Heart Rate")
for i, j in enumerate(dates):
    plt.plot(dates[i], avg_bpms[i])
plt.show()

Hi Prayson, Unfortunately not yet. I was trying to apply your suggestion but I was having errors with the apply() and I've been getting list indices must. Now I'm getting DataError raise DataError("No numeric types to aggregate") pandas.core.base.DataError: No numeric types to aggregate. — bpiekars, Dec 21 '20 at 19:07
It means you do not have numerical values to resample. What do you get by `df.dtypes`? — Prayson W. Daniel, Dec 21 '20 at 21:21
My dataframe values look like the example shown in my question with datetime and value. The numerical values I want are the ['values']['bpm'] values but df.dtypes gives me an error TypeError: 'Series' object is not callable — bpiekars, Dec 21 '20 at 22:34
As you see, you have Series not DataFrame. So the example data are not the same. Can you try `df['value'].apply(pd.Series)` ? Does that expand value? — Prayson W. Daniel, Dec 22 '20 at 06:42

Prayson W. Daniel · Accepted Answer · 2020-12-21T09:08:33.573

The easiest way it to transform your values to columns and cast dateTime to actual datetime. With that you can use resample to aggregate on the frequency desire:

import pandas as pd

data = pd.DataFrame({'dateTime':[ '01/16/20 04:32:42', '01/16/20 04:32:57', '02/16/20 04:33:12', '03/16/20 04:33:12'],
'value': [{'bpm': 70, 'confidence': 0}, {'bpm': 75, 'confidence': 0},  
          {'bpm': 73, 'confidence': 1}, {'bpm': 78, 'confidence': 1}]})

# expland 
df = data['value'].apply(pd.Series)

# to datetime and set index 
df['dateTime'] = pd.to_datetime(data['dateTime'])
df.set_index('dateTime', inplace=True)

# data resample to Monthy with mean 
example = df.resample('M', kind='period').mean()

# plot
example['bpm'].plot(title="Fitbit Heart Rate", xlabel="Month", ylabel="Heart Rate");

Note:

For a graph to show, make sure you have data with N > 1 of the selected frequency. So if the frequency is "M" for Month, make sure there are at least 2 months in the dataset.

Read Pandas resample documentation for more details.

`UserWarning: Attempting to set identical left == right == 600.0 results in singular transformations; automatically expanding. ax.set_xlim(left, right) ` and shows nothing. — Abhishek Rai, Dec 21 '20 at 07:21
That is because we only have data for one month. So you will not see anything. change the date above to `['01/16/20 04:32:42', '01/16/20 04:32:57', '02/16/20 04:33:12']` and you will see the chart. Note: I have update data to reflect it — Prayson W. Daniel, Dec 21 '20 at 08:51

score 0 · Answer 2 · answered Dec 21 '20 at 06:34

0

I do not have your data, so this might not work right away. But except that you should be able to use this method.

Steps

Convert value column's dict values into columns (Reference)
Group By (Reference)
Average (Reference)

import pandas as pd

df = pd.DataFrame.from_records(data)
# Step 1
df = pd.concat([df.drop(['value'], axis=1), df['value'].apply(pd.Series)], axis=1)
# Step 2
df.groupby(by=[df.dateTime])
# Step 3
print(df['bpm'].mean().sort_values())

answered Dec 21 '20 at 06:34

Inyoung Kim 김인영

1,434
1
17
38

I'm confused about the line `print(df['bpm'].mean().sort_values())` because df['bpm'].mean() returns a float value so this code throws the error `AttributeError: 'numpy.float64' object has no attribute 'sort_values'` Do you mean collect these values in a column and then sort? – bpiekars Dec 21 '20 at 07:27

How can I compute average monthly value from large daily JSON dataset in pandas dataframe?

2 Answers2

Note: