0

I have a list like this with about 141 entries:

training = [40.0,49.0,77.0,...... 3122.0]

and my goal is to select the first 20% of the list. I did it like this:

testfile_first20 = training[0:int(len(set(training))*0.2)]
testfile_second20 = training[int(len(set(training))*0.2):int(len(set(training))*0.4)]
testfile_third20 = training[int(len(set(training))*0.4):int(len(set(training))*0.6)]
testfile_fourth20 = training[int(len(set(training))*0.6):int(len(set(training))*0.8)]
testfile_fifth20 = training[int(len(set(training))*0.8):]

Is there any way to do this automatically in a loop? This is my way of selecting the Kfold.

Thank you.

petezurich
  • 9,280
  • 9
  • 43
  • 57
raffa_sa
  • 415
  • 2
  • 4
  • 13
  • training[0:(len(training)/5)]. Been a while since I’ve used python but that should work. It will take the length of training, divide it by five (i.e. 20% of training) and return that array of values. – TheEpicPanic Nov 22 '18 at 15:40
  • 1
    Possible duplicate of [How do you split a list into evenly sized chunks?](https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks) – Abhishek Dujari Nov 22 '18 at 16:04

4 Answers4

1

You can use list comprehensions:

div_length = int(0.2*len(set(training)))
testfile_divisions = [training[i*div_length:(i+1)*div_length] for i in range(5)]

This will give you your results stacked in a list:

>>> [testfile_first20, testfile_second20, testfile_third20, testfile_fourth20, testfile_fifth20]

If len(training) does not divide equally into five parts, then you can either have five full divisions with a sixth taking the remainder as follows:

import math

div_length = math.floor(0.2*len(set(training)))
testfile_divisions = [training[i*div_length:min(len(training), (i+1)*div_length)] for i in range(6)]

or you can have four full divisions with the fifth taking the remainder as follows:

import math

div_length = math.ceil(0.2*len(set(training)))
testfile_divisions = [training[i*div_length:min(len(training), (i+1)*div_length)] for i in range(5)]
berkelem
  • 2,005
  • 3
  • 18
  • 36
  • if i try this i get an error code like this: slice indices must be integers or None or have an __index__ method – raffa_sa Nov 22 '18 at 15:41
  • if i run this `for i in range(5): print(len(testfile_divisions[i]))` i get `28 55 82 109 137` but the result should have the same length, i mean every part of the list should have the same entry length – raffa_sa Nov 22 '18 at 15:45
  • Ah okay. I've corrected the code. I think this should work. – berkelem Nov 22 '18 at 15:47
  • just found an error, if `len(training)` is not able to be divided by 5 i loose somethin, which should not happen @berkelem – raffa_sa Nov 27 '18 at 08:49
  • I updated the answer. There are two ways you can handle this, either having five full divisions with a remainder or four full divisions with the fifth division being a remainder. – berkelem Nov 27 '18 at 10:56
1

Here's a simple take with list comprehension

lst = list('abcdefghijkl')
l = len(lst)

[lst[i:i+l//5] for i in range(0, l, l//5)]

# [['a', 'b'], 
#  ['c', 'd'], 
#  ['e', 'f'], 
#  ['g', 'h'], 
#  ['i', 'j'], 
#  ['k', 'l']]

Edit: Actually now that I look at my answer, it's not a true 20% representation as it returns 6 sublists instead of 5. What is expected to happen when the list cannot be equally divided into 5 parts? I'll leave this up for now until further clarifications are given.

r.ook
  • 13,466
  • 2
  • 22
  • 39
0

You can loop this by just storing the "size" of 20% and the current starting point in two variables. Then add one to the other:

start = 0
twenty_pct = len(training) // 5

parts = []
for k in range(5):
    parts.append(training[start:start+twenty_pct])
    start += twenty_pct

However, I suspect there are numpy/pandas/scipy operations that might be a better match for what you want. For example, sklearn includes a function called KFold: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

aghast
  • 14,785
  • 3
  • 24
  • 56
0

Something like this, but maybe you may lose an element due to rounding.

tlen = float(len(training))    
testfiles = [ training[ int(i*0.2*tlen): int((i+1)*0.2*tlen) ] for i in range(5) ]
jlanik
  • 859
  • 5
  • 12