0

Data import from csv:

Date Item_1 Item 2
1990-01-01 34 78
1990-01-02 42 19
. . .
. . .
2020-12-31 41 23
df = pd.read_csv(r'Insert file directory')
df.index = pd.to_datetime(df.index)
gb= df.groupby([(df.index.year),(df.index.month)]).mean()

Issue: So basically, the requirement is to group the data according to year and month before processing and I thought that the groupby function would have grouped the data so that the mean() calculate the averages of all values grouped under Jan-1990, Feb-1990 and so on. However, I was wrong. The output result in the average of all values under Item_1

My example is similar to the below post but in my case, it is calculating the mean. I am guessing that it has to do with the way the data is arranged after groupby or some parameters in mean() have to be specified but I have no idea which is the cause. Can someone enlighten me on how to correct the code?

Pandas groupby month and year

Update: Hi all, I have created the sample data file .csv with 3 items and 3 months of data. I am wondering if the cause has to do with the conversion of data into df when it is imported from .csv because I have noticed some weird time data on the leftmost as shown below:

enter image description here

Link to sample file is: https://www.mediafire.com/file/t81wh3zem6vf4c2/test.csv/file

  • Your code seems to work correctly for me. Perhaps you can add a bit more detail to drill down on the problem. – bicarlsen Aug 10 '21 at 15:22
  • Hi bicarlsen, I have recreated the sample data in the same file format (link is provided above) and update the code to show how I import the .csv file into the df. I have tried to run the code again and it still calculates the average of entire values under each item. – user14074078 Aug 10 '21 at 16:35
  • @user14074078 have you tested [my proposed solution](https://stackoverflow.com/a/68730009/16343464)? – mozway Aug 10 '21 at 16:39
  • @mozway yes, I have tried your proposed solution and the result is the same. I am wondering if the conversion of csv, resulted in some unwanted item being included. – user14074078 Aug 10 '21 at 16:45

2 Answers2

1
import pandas as pd

df = pd.read_csv( 'test.csv', index_col = 'date' )
df.index = pd.to_datetime( df.index )

df.groupby([(df.index.year),(df.index.month)]).mean()

Seems to do the trick from the provided data.

bicarlsen
  • 1,241
  • 10
  • 27
  • Hi bicarlsen, inclusion of the argument, index_col = 'date', works! Thanks for your help!!! Really appreciate it. Once again, I am caught-off guard by small details in code, lol. – user14074078 Aug 11 '21 at 04:24
0

IIUC, you want to calculate the mean of all elements. You can use numpy's mean function that operates on the flattened array by default:

df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
gb = df.groupby([(df.index.year),(df.index.month)]).apply(lambda d: np.mean(d.values))

output:

date  date
1990  1       0.563678
      2       0.489105
      3       0.459131
      4       0.755165
      5       0.424466
      6       0.523857
      7       0.612977
      8       0.396031
      9       0.452538
      10      0.527063
      11      0.397951
      12      0.600371
dtype: float64
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Hi mozway, I have recreated the sample data in the same file format (link is provided above) and update the code to show how I import the .csv file into the df. I have tried to run the code again and it still calculates the average of entire values under each item. – user14074078 Aug 10 '21 at 16:39
  • @user14074078 I updated the code. Apparently, this works only on the raw array. – mozway Aug 10 '21 at 16:57
  • Hi mozway, I have tried your updated code, it results in "ValueError: time data '0' does not match format '%d/%m/%Y' (match)" However, bicarlsen's slight alteration to the argument for pd.read_csv seems to do the trick. Once again, thank you for your assistance in this. Really appreciate it! – user14074078 Aug 11 '21 at 04:19