1

I have a pandas dataframe of shape (7761940, 16). I converted it into a list of 7762 numpy arrays using np.array_split, each array of shape (1000, 16) .

Now I need to take a slice of the first 50 elements from each array and create a new array of shape (388100, 16) from them. The number 388100 comes from 7762 arrays multiplied by 50 elements.

I know it is a sort of slicing and indexing but I could not manage it.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • 1
    List comprehension: `np.vstack([arr[:50,:] for arr in split_list])` should work. Alternatively reshape the original array to (7762, 1000,16), and the slice with `[:,:50,:]` and reshape back to 2d with (-1,16). – hpaulj Dec 17 '19 at 21:21
  • Not all your arrays will have 1000 rows. You may be better off padding. – Mad Physicist Dec 24 '19 at 21:51

3 Answers3

1

If you split the array, you waste memory. If you pad the array to allow a nice reshape, you waste memory. This is not a huge problem, but it can be avoided. One way is to use the arcane np.lib.stride_tricks.as_strided function. This function is dangerous, and we would break some rules with it, but as long as you only want the 50 first elements of a chunk, and the last chunk is longer than 50 elements, everything will be fine:

x = ... # your data as a numpy array
chunks = int(np.ceil(x.shape[0] / 1000))
view = np.lib.stride_tricks.as_strided(x, shape=(chunks, 1000, x.shape[-1]), strides=(np.max(*x.strides) * 1000, *x.strides))

This will create a view of shape (7762, 1000, 16) into the original memory, without making a copy. Since your original array does not have a multiple of 1000 rows, the last plane will have some memory that doesn't belong to you. As long as you don't try to access it, it won't hurt you.

Now accessing the first 50 elements of each plane is trivial:

data = view[:, :50, :]

You can unravel the first dimensions to get the final result:

data.reshape(-1, x.shape[-1])

A much healthier way would be to pad and reshape the original.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • I do really appreciate your reply, saving memory has a vital impact in the process. Great thanks @Mad Physicist. – saad shaban Dec 25 '19 at 05:55
  • @saad. You should test my answer since I didn't have access to a desktop when I wrote it. Feel free to select it if it works. – Mad Physicist Dec 25 '19 at 09:16
0

After getting benefit from friends comments and some survey, i came up with a solution:

my_data = np.array_split(dataframe, 7762) #split dataframe to a list of 7762 ndarray
                                          #each of 1000x16 dimension   
my_list = []                          #define new list object
for i in range(0,7762):               #loop to iterate over the 7762 ndarrays
  my_list.append(my_data[i][0:50, :]) #append first 50 rows from each adarray into my_list
  • You can iterate directly over `my_data`, and therefore reduce the last three lines to a list comprehension. This won't create your array for you, and isn't very efficient. – Mad Physicist Dec 24 '19 at 21:54
-1

You can do something like this:

  1. Split the data of size (7762000 x 16) to (7762 x 1000 x 16)

    data_first_split = np.array_split(data, 7762)
    
  2. Slice the data to 7762 x 50 x 16, to get the first 50 elements of data_first_split

    data_second_split = data_first_split[:, :50, :]
    
  3. Reshape to get 388100 x 16

    data_final = np.reshape(data_second_split, (7762 * 50, 16))
    

as @hpaulj mentioned, you can also do it using np.vstack. IMO you should also give numpy.strides a look.

Rick M.
  • 3,045
  • 1
  • 21
  • 39