How to split a list of data frames into two lists?

Question

This is my initial data frame df:

col1    col2    col3
  1       0.5     10
  1       0.3     11
  5       1.4     1
  3       1.5     2
  1       0.9     10
  3       0.4     7
  1       1.2     9
  3       0.1     11
  4       0.1     11

I converted it into a list of data frames list_df:

n = 3 # the value of "n" does not matter
list_df = [df[i:i+n] for i in range(0, df.shape[0],n)]

list_df

[
  pd.DataFrame(
    col1    col2    col3
      1       0.5     10
      1       0.3     11
      5       1.4     1),
  pd.DataFrame(
    col1    col2    col3
      3       1.5     2
      1       0.9     10
      3       0.4     7),
  pd.DataFrame(
    col1    col2    col3
      1       1.2     9
      3       0.1     11
      4       0.1     11)
]

How can I randomly split this list into two lists of data frames: list_df1 and list_df2, so that list_df1 would contain 70% of lists of data frames, and list_df2 would contain the rest.

I tried to use masking, but it does not work with a list of data frames.

You want to slice a list into n=2 partitions ? Check that : https://stackoverflow.com/questions/2659900/slicing-a-list-into-n-nearly-equal-length-partitions — Belbahar Raouf, Feb 23 '20 at 22:54
@BelbaharRaouf: Thanks, but I think that it's different from what I need. I have a list of data frames. In fact, the value of `n` (i.e. the number of rows in data frames) does not matter. — Fluxy, Feb 23 '20 at 22:55
Does this one help: https://stackoverflow.com/a/48561916/1534017 ? Also works on list of data frames. — Cleb, Feb 23 '20 at 22:58
@Cleb: yes, seems to be very close to what I need. how can I define the indices at which the list of data frames should be split? — Fluxy, Feb 23 '20 at 23:03
@Cleb: `list_df1, list_df1 = np.split(list_df, [6])` This does not seem to work. — Fluxy, Feb 23 '20 at 23:05
I am in the phone, so cannot play around, but finding the index for `0.7 * len(your_list)` should not be that hard, I think. — Cleb, Feb 23 '20 at 23:05
@Cleb: Yes, it's not the problem to get index. But I get an error `ValueError: cannot copy sequence with size 10 to array axis with dimension 23` and `AttributeError: 'list' object has no attribute 'swapaxes'`. — Fluxy, Feb 23 '20 at 23:09
Does this answer your question? [How to split data into trainset and testset randomly?](https://stackoverflow.com/questions/17412439/how-to-split-data-into-trainset-and-testset-randomly) — AMC, Feb 24 '20 at 01:30
As far as I can tell the fact that the list contains Dataframes is irrelevant. — AMC, Feb 24 '20 at 01:31

score 1 · Accepted Answer · answered Feb 23 '20 at 23:09

You can use random_integers from numpy to get list of indices to keep, and then filter list_df

import numpy as np
import math

# compute what is 70% of the elements of list_df
n_70pct = math.floor(len(list_df)*0.7)

# take a sample of 70% of indexes in list_df
int_sample = np.random.random_integers(0,len(list_df), n_70pct ).tolist()

# keep in list_df1 the indices that are in int_sample
list_df1 = [ list_df[i] for i in int_sample]

# keep in list_df2 the indices that are not in int_sample
list_df2 = [ list_df[i] for i in range(0,len(list_df)) if i not in int_sample]

How to split a list of data frames into two lists?

1 Answers1