Using np.split_array and then saving each split into dataframes

Question

Appending data to a dataframe but changing rows after certain # of columns

The above is my previous post, where I attempted to convert 1800 row x 1 column dataframe into 300 row x 6 column dataframe through:

i = 0
k = 2
j = 2

result = []
df = pd.DataFrame()
print(data.shape)
while j < data.shape[1]:
    tstat, data_stat = ttest_ind_from_stats(data.loc[i][k], data.loc[i + 1][k], data.loc[i + 2][k], data.loc[i][j],
                                            data.loc[i + 1][j], data.loc[i + 2][j])
    result.append([data_stat])
    #print(i, k, i, j)
    #print(i + 1, k, i + 1, j)
    #print(i + 2, k, i + 2, j)
    j+=1
    if j == data.shape[1]:
        j = 2
        i = i + 3
    if i == data.shape[0]:
        k = k + 1
        i = 0
        if k > data.shape[1]-1:
            break

data_result = pd.DataFrame(result)

a = np.array(data_result)
b = a.reshape(int(data.shape[0]*2),6)
data_result_new = pd.DataFrame(b)
data_result_new.columns = ['col1','col2','col3','col4','col5','col6']

I would then would like to further split the dataframe into six chunks. I was thinking about using np split like:

c = np.array_split(b,6)

This line would be added right after b = a.reshape(int(data.shape[0]*2),6) (I know the data_result_new lines won't work if split is applied).

For example:

The starting data table would look like:

col1    col2   col3    col4    col5    col6
1       0.658  0.1067  0.777   0.459   0.3307
1       0.622  0.4178  0.3158  0.7674  0.7426
1       0.622  0.4178  0.3158  0.7674  0.7426
1       0.622  0.4178  0.3158  0.7674  0.7426
1       0.622  0.4178  0.3158  0.7674  0.7426
.
.
.
.
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234
.
.
.

and so on (please note that the numbers are just random for this post, and for testing, you can use any floating numbers, these are essentially p-values). The rows are in groups of 50 rows and hence why I would like to separate the 300x6 df into 6 df of 50x6. Because of the data size, I wasn't able to insert all of it and had to express the table as above, but for the actual testing, you can probably generate random values with 300x6 shape df (not counting the headers).

what I want is:

[df1]
col1    col2   col3    col4    col5    col6
1       0.658  0.1067  0.777   0.459   0.3307
1       0.622  0.4178  0.3158  0.7674  0.7426
1       0.622  0.4178  0.3158  0.7674  0.7426
1       0.622  0.4178  0.3158  0.7674  0.7426
1       0.622  0.4178  0.3158  0.7674  0.7426

[df2]
col1    col2   col3    col4    col5    col6
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234
0.123   1      0.1222  0.111   0.123   0.1234

and so on. I am not sure how I would iterate over each split from np.array_split then save as separate dataframes. Any help or suggestions would be appreciated.

score 1 · Accepted Answer · answered Jan 20 '20 at 16:50

It might depend on how you are wanting to access the data afterwards, but you could make an extra column in the dataframe to assign group labels, and then group the data by this column and create a list of dataframes from that.

import numpy as np
import pandas as pd

data = np.random.rand(300,6)
df = pd.DataFrame(data)

df["label"] = df.apply(lambda x: x.name//50, axis=1)
gb = df.groupby("label")
df_list = [gb.get_group(x).set_index("label") for x in gb.groups]

df.head(3)

df.tail(3)

for x in df_list: # each dataframe should have 50 rows and 6 columns
    assert x.shape == (50, 6)

# print first dataframe head (rows should be same as head printed above)
df_list[0].head(3) # and access the values/numpy array by df_list[0].values

# print last section (rows should be same as tail printed above)
df_list[5].tail(3) # and access the values/numpy array by df_list[5].values

Using np.split_array and then saving each split into dataframes

1 Answers1