0

Preamble

On another question I understood that python constructors routines does not make a copy of the provided numpy array only if the data-type of the array is the same for all the entries. In case the constructor is fed with a structured numpy array with different types on columns, it makes a copy.

Implementation reference

df_dict = {}
for i in range(5):
     obj = Object(1000000)
     arr = obj.getNpArr()
     print(arr[:10])
     df_dict[i] = pandas.DataFrame.from_records(arr)
print("The DataFrames are :")
for i in range(5):
     print(df_dict[i].head(10))

In this case Object(N) constructs an instance of Object which internally allocates and initializes a 2D array of shape (N,3), with dtypes 'f8','i4','i4' on each row. Object manages the life of these data, deallocating it on destruction. The function Object.getNpArr() returns a np.recarray pointing to the internal data and it has the above mentioned dtype. Importantly, the returned array does not own the data, it is just a view.

Problem

The DataFrame printed at the end show corrupted data (with respect to the printed array inside the first loop). I am not expecting such behaviour, since the array fed to the pandas construction function is copied (I separately checked this behaviour).

I have not many ideas about the cause and solutions to avoid data corruption. The only guess I can make is:

  • the constructor starts allocating the memory for its own data, which takes long because of the big size, and then copies
  • before/during the allocation/copy, the GIL is released and it is taken back to the for loop
  • for loop proceed before the copy of the array is completed, going to the next iteration
  • at the next iteration the obj name is moved to the new Object and the memory is deallocated, which causes data corruption in the copy of the DataFrame at the previous iteration, which is probably still running.

If this is really the cause of the issue, how can I find a workaround? Is there a way to let the GIL go through only when the copy of the array is effectively done?

Or, if my guess is wrong, what is the cause of the data corruption?

dariobaron
  • 45
  • 5

0 Answers0