0

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)

I use a for-loop and append(), but it returns the last ones, not the whole. here is my code:

hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
    month = 1
    hour = j
    data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet 
    data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
    s_td = data.iloc[:,4].std()
    meean = data.iloc[:,4].mean()
    final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
    final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
smci
  • 32,567
  • 20
  • 113
  • 146
  • This isn't reproducible, you need to explain/add imports to show here `get_data()` comes from and what it does. Does it read from file, from streaming, filter and existing dataframe? Really all this code does is process onte big dataframe, groupby hour `hh`, append `std` and `mean` columns. So it's aggregating `data` by `hh` into `final`. – smci Mar 14 '22 at 22:56
  • A style tip: to avoid wack variable names like `s_td`, `meean`, I'd append an underscore: `std_`, `mean_`. Or append some _suffix . But anyway you don't need to name them as separate variables if you put all the code in one aggregate/assign statement. – smci Mar 14 '22 at 22:57
  • Possible duplicate: [Naming returned columns in Pandas aggregate function?] -> [Multiple aggregations of the same column using pandas GroupBy.agg()](https://stackoverflow.com/questions/12589481/multiple-aggregations-of-the-same-column-using-pandas-groupby-agg) – smci Mar 14 '22 at 22:59
  • get_data() is a function that return a big list that is converted into a DataFrame. the problem was that I must introduced my intended final Dateframe outside of the loop. – sina hemati Mar 14 '22 at 23:01
  • But I'm asking where is that big list coming from? If it's all coming from one big CSV/Excel/stream/database, why not read `data` in as one big dataframe, then generate `final = data.groupby('hh').agg(...)` ? – smci Mar 14 '22 at 23:03
  • that is a great idea for naming. thanks – sina hemati Mar 14 '22 at 23:03
  • I had a big data in multi excel file, after importing I conducted numerous analysis on that and then I got it as a list. then to further analysis I converted that into a dataframe – sina hemati Mar 14 '22 at 23:07

3 Answers3

1

I am not sure, but I believe you should assign the final.append(... to a variable:

final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)

Update

If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to @stefan_aus_hannover)

TheFaultInOurStars
  • 3,464
  • 1
  • 8
  • 29
  • but this procedure works well for below code: **import pandas as pd df = pd.DataFrame(columns=['A','b']) for i in range(5): x=i*(1/2) g='reza' df = df.append({'A': g,'b':x}, ignore_index=True) print(df) – sina hemati Mar 14 '22 at 22:30
  • 1
    @sinahemati Thanks for the commnet, sina. Without a doubt, you will need to save the `append` results into a datafarme, because actually, it returns a dataframe. I suggest you take a look at [this link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html) – TheFaultInOurStars Mar 14 '22 at 22:32
  • If you append the results to a list instead of the final dataframe at the end of each iteration of the for loop, you can append the list of lists to the dataframe after each iteration of the for loop and it will shave down on runtime – stefan_aus_hannover Mar 14 '22 at 22:33
  • 1
    Look, in this part: `df = df.append({'A': g,'b':x}, ignore_index=True` you are assigning the results of the `df.append(...` to df itself. I suggested the same thing: append the result of `final.append` to final. In your code you are not assigning the `final.append(...` to any other variable. – TheFaultInOurStars Mar 14 '22 at 22:34
  • @stefan_aus_hannover Thanks for the comment, Stefan. I edited the answer as per you comment. – TheFaultInOurStars Mar 14 '22 at 22:36
  • I tested it as well (final=final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)). and my answer includes just the last iteration – sina hemati Mar 14 '22 at 22:38
  • my answer: Month Hour standard deviation average 0 05:00 05:00 0.189272 1.747301 – sina hemati Mar 14 '22 at 22:39
  • 1
    Amirhossein jan, I appreciate your favorable helps, my problem was fixed. – sina hemati Mar 14 '22 at 22:54
  • Just directly aggregate `data` into `final`. With `final = data.agg(...)` – smci Mar 14 '22 at 23:00
  • Just directly aggregate `data`: `final = data.agg(...)` and append that result row to `final`. – smci Mar 15 '22 at 00:32
0

This is what I am referring to in the comments on Amirhossein's answer:

hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
    month=1
    hour = j
    data = get_data(month, hour) ## it works correctly 
    data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
    s_td=data.iloc[:,4].std()
    meean=data.iloc[:,4].mean()
    lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
stefan_aus_hannover
  • 1,777
  • 12
  • 13
  • I got your points. I must exclude '' final = pd.DataFrame(columns=['Month','Hour','standard deviation','average']) '' from the for loop. now it works well without using list.many thnaks for your attention and time – sina hemati Mar 14 '22 at 22:53
0

Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.

for hour in hh:
    # Read in columns-of-interest from individual Excel sheet for this month and day...
    data = get_data(1, hour)
    data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])

    # Compute corresponding row of the aggregate...
    dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
    dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})

final = final.append(dat_hh_aggregate, ignore_index=True)

Notes:

  • pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
  • There's no need to store separate local variables s_td, meean, just directly use .aggregate()
  • There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)
smci
  • 32,567
  • 20
  • 113
  • 146