5

Consider following code

one, two = sales.random_split(0.5, seed=0)
set_1, set_2 = one.random_split(0.5, seed=0)
set_3, set_4 = two.random_split(0.5, seed=0)

What I am trying to in this code is to randomly split my data in Sales Sframe (which is similar to Pandas DataFrame) into roughly 4 equal parts.

What is a Pythonic/Efficient way to achieve this?

Community
  • 1
  • 1
Khurram Majeed
  • 2,291
  • 8
  • 37
  • 59
  • Can you clarify why this isn't Pythonic or efficient as written? One issue I can see is creating a number of folds that isn't a power of two, but that sounds different from what you're asking. – papayawarrior Dec 17 '15 at 20:19

1 Answers1

2
np.random.seed(0)
np.random.shuffle(arr) # in-place
sets = np.array_split(arr, 4)
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Would you please share your comments as to why this is more efficient? Also I see that you are using ```NumPy``` Which mean I need to convert ```SFrame``` into ```NumPy Array```. Will it not add overhead due to conversion? – Khurram Majeed Dec 17 '15 at 15:30
  • @KhurramMajeed: I haven't tested to know if it's faster than your original code, but I consider this code to be efficient and NumPythonic. Give it a try and see if it speeds things up. If not, maybe stick with your original. I'm sure you can convert the `sets` back to `Sframes` at the end if you need. – John Zwinck Dec 18 '15 at 02:04