1

I have a very large data frame from which I would like to pull a subsample, perform some calculation and then write these results into a new data frame. For the sample, please consider:

df_test = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
                    columns=['a', 'b', 'c', 'd', 'e'])
df_test

returning this:

    a   b   c   d   e
0   1   9   0   3   0
1   5   4   1   0   3
2   9   3   6   3   5
3   6   2   5   9   7
4   9   0   7   9   5

Now I would like "extract" always 3 rows, rolling from the beginning and calculate the averages (as an example, other calculations would work too) of each column:

df_1
    a   b   c   d   e
0   1   9   0   3   0
1   5   4   1   0   3
2   9   3   6   3   5

df_2 
    a   b   c   d   e
1   5   4   1   0   3
2   9   3   6   3   5
3   6   2   5   9   7

df_3 
    a   b   c   d   e
2   9   3   6   3   5
3   6   2   5   9   7
4   9   0   7   9   5

the result data frame is then

result
    a   b   c   d   e
0   5   5.3 2.3 3   2.7
1   6.7 3   4   4   5
2   8   1.7 6   7   5.3

How can I do that?

eternity1
  • 651
  • 2
  • 15
  • 31

1 Answers1

3

Use rolling and remove first NaNs rows by iloc or dropna:

N = 3
df = df.rolling(N).mean().iloc[N-1:]

df = df.rolling(3).mean().dropna(how='all')

print (df)
          a         b         c    d         e
2  5.000000  5.333333  2.333333  2.0  2.666667
3  6.666667  3.000000  4.000000  4.0  5.000000
4  8.000000  1.666667  6.000000  7.0  5.666667

If need also mean of first, first + second rows add parameter min_periods:

df1 = df.rolling(3, min_periods=1).mean()
print (df1)
          a         b         c    d         e
0  1.000000  9.000000  0.000000  3.0  0.000000
1  3.000000  6.500000  0.500000  1.5  1.500000
2  5.000000  5.333333  2.333333  2.0  2.666667
3  6.666667  3.000000  4.000000  4.0  5.000000
4  8.000000  1.666667  6.000000  7.0  5.666667

EDIT:

Manual aproach should be create one line DataFrames and then join all together:

dfs = []
N = 3
for x in np.arange(len(df)+1)[N:]:
    df1 = df.iloc[np.arange(x - N, x)]
    #print (df1)
    s = df1.mean().to_frame().T
    #print (s)
    dfs.append(s)

df2 = pd.concat(dfs, ignore_index=True)
print (df2)
          a         b         c    d         e
0  5.000000  5.333333  2.333333  2.0  2.666667
1  6.666667  3.000000  4.000000  4.0  5.000000
2  8.000000  1.666667  6.000000  7.0  5.666667
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I'm assuming that `iloc` is faster than `dropna` if you know how many rows to drop? – user3483203 Apr 08 '18 at 06:29
  • 2
    @chrisz - exactly, so first solution :) – jezrael Apr 08 '18 at 06:29
  • thank you jezareal and chrisz. apologies, your solution points out directly what I asked in my example. however, I was looking for a more manual approach (maybe average was not a good example, given there is an inbuilt function), for instance, a self-defined function: Additionally, imagine if the data is very large such that the entire data cannot be loaded in one dataframe at the time, but could be queried from a SQL database (ie table is 1GB large in a Postgres or SQLite) – eternity1 Apr 08 '18 at 06:39
  • 1
    @eternity1 - Native support is `chunks` reading - check [`this`](https://stackoverflow.com/a/29522443/2901002), but it read first `0-5` rows, then `6-10`, ... Rolling chunks are not supported in read_sql. Maybe solution shoud be create dynamiccaly sql queries and in loop read it, but I think there should be bad performance. – jezrael Apr 08 '18 at 06:50
  • 1
    @jezrael: thanks, good idea with the dynamic sql queries: it brought me some ideas. I am going to open a new question on how that could best be done. Looking forward to your great input as always! – eternity1 Apr 08 '18 at 06:57
  • additionally, how would you tackle the issue with writing the result into a new data frame using a test_function() which uses the subset data frame as input? – eternity1 Apr 08 '18 at 07:00
  • 1
    @eternity1 - Not sure if 100% understand what need, but there should be create list of `DataFrames` and then in loop process each subdataframe and last use `concat` for join all subdataframes together. – jezrael Apr 08 '18 at 07:02
  • Just to understand: When I create a list of all sub data frames to iterate my calculation through each of these sub data frames. Wouldn't I have then the same problem as loading in all the data into one data frame? If the problem is while loading data into python, but once it's in, I can access it without problems, then I would rather do that, so I dont have to dynamic-sql-querying in each sub data frame and process on the whole data frame as you suggested above with iloc. btw, i have 4 tables of 1GB size which I need to access for the analysis. – eternity1 Apr 08 '18 at 07:15
  • 1
    @eternity1 - check edited answer, you are right, need read csv first – jezrael Apr 08 '18 at 07:26
  • @jezrael: thank you, that's very helpful and also I see using np.arange for the count in the loop and inside the iloc the first time. do the dimensions (columns) of the sub data frames have to be the same? I am asking because it could be that there is missing data depending on the sub data frame for one or more of the columns, which I would then disregard in the analysis. or would I need to get some intermediate data frame out of the sub data frames? – eternity1 Apr 08 '18 at 10:10
  • 1
    Hmmm, if use np.average columns names are not important. Only need same dimensions. – jezrael Apr 08 '18 at 10:44