-1

This is my initial data frame df:

col1    col2    col3
  1       0.5     10
  1       0.3     11
  5       1.4     1
  3       1.5     2
  1       0.9     10
  3       0.4     7
  1       1.2     9
  3       0.1     11
  4       0.1     11

I converted it into a list of data frames list_df:

n = 3 # the value of "n" does not matter
list_df = [df[i:i+n] for i in range(0, df.shape[0],n)]

list_df

[
  pd.DataFrame(
    col1    col2    col3
      1       0.5     10
      1       0.3     11
      5       1.4     1),
  pd.DataFrame(
    col1    col2    col3
      3       1.5     2
      1       0.9     10
      3       0.4     7),
  pd.DataFrame(
    col1    col2    col3
      1       1.2     9
      3       0.1     11
      4       0.1     11)
]

How can I randomly split this list into two lists of data frames: list_df1 and list_df2, so that list_df1 would contain 70% of lists of data frames, and list_df2 would contain the rest.

I tried to use masking, but it does not work with a list of data frames.

Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • You want to slice a list into n=2 partitions ? Check that : https://stackoverflow.com/questions/2659900/slicing-a-list-into-n-nearly-equal-length-partitions – Belbahar Raouf Feb 23 '20 at 22:54
  • @BelbaharRaouf: Thanks, but I think that it's different from what I need. I have a list of data frames. In fact, the value of `n` (i.e. the number of rows in data frames) does not matter. – Fluxy Feb 23 '20 at 22:55
  • Does this one help: https://stackoverflow.com/a/48561916/1534017 ? Also works on list of data frames. – Cleb Feb 23 '20 at 22:58
  • @Cleb: yes, seems to be very close to what I need. how can I define the indices at which the list of data frames should be split? – Fluxy Feb 23 '20 at 23:03
  • @Cleb: `list_df1, list_df1 = np.split(list_df, [6])` This does not seem to work. – Fluxy Feb 23 '20 at 23:05
  • I am in the phone, so cannot play around, but finding the index for `0.7 * len(your_list)` should not be that hard, I think. – Cleb Feb 23 '20 at 23:05
  • @Cleb: Yes, it's not the problem to get index. But I get an error `ValueError: cannot copy sequence with size 10 to array axis with dimension 23` and `AttributeError: 'list' object has no attribute 'swapaxes'`. – Fluxy Feb 23 '20 at 23:09
  • Does this answer your question? [How to split data into trainset and testset randomly?](https://stackoverflow.com/questions/17412439/how-to-split-data-into-trainset-and-testset-randomly) – AMC Feb 24 '20 at 01:30
  • As far as I can tell the fact that the list contains Dataframes is irrelevant. – AMC Feb 24 '20 at 01:31

1 Answers1

1

You can use random_integers from numpy to get list of indices to keep, and then filter list_df

import numpy as np
import math

# compute what is 70% of the elements of list_df
n_70pct = math.floor(len(list_df)*0.7)

# take a sample of 70% of indexes in list_df
int_sample = np.random.random_integers(0,len(list_df), n_70pct ).tolist()

# keep in list_df1 the indices that are in int_sample
list_df1 = [ list_df[i] for i in int_sample]

# keep in list_df2 the indices that are not in int_sample
list_df2 = [ list_df[i] for i in range(0,len(list_df)) if i not in int_sample]
fmarm
  • 4,209
  • 1
  • 17
  • 29