Efficient splitting of data in Python

Question

Consider following code

one, two = sales.random_split(0.5, seed=0)
set_1, set_2 = one.random_split(0.5, seed=0)
set_3, set_4 = two.random_split(0.5, seed=0)

What I am trying to in this code is to randomly split my data in Sales Sframe (which is similar to Pandas DataFrame) into roughly 4 equal parts.

What is a Pythonic/Efficient way to achieve this?

Can you clarify why this isn't Pythonic or efficient as written? One issue I can see is creating a number of folds that isn't a power of two, but that sounds different from what you're asking. — papayawarrior, Dec 17 '15 at 20:19

score 2 · Accepted Answer · answered Dec 17 '15 at 15:03

2

np.random.seed(0)
np.random.shuffle(arr) # in-place
sets = np.array_split(arr, 4)

answered Dec 17 '15 at 15:03

John Zwinck

Would you please share your comments as to why this is more efficient? Also I see that you are using ```NumPy``` Which mean I need to convert ```SFrame``` into ```NumPy Array```. Will it not add overhead due to conversion? – Khurram Majeed Dec 17 '15 at 15:30
@KhurramMajeed: I haven't tested to know if it's faster than your original code, but I consider this code to be efficient and NumPythonic. Give it a try and see if it speeds things up. If not, maybe stick with your original. I'm sure you can convert the `sets` back to `Sframes` at the end if you need. – John Zwinck Dec 18 '15 at 02:04

1 Answers1