Preamble
On another question I understood that python constructors routines does not make a copy of the provided numpy array only if the data-type of the array is the same for all the entries. In case the constructor is fed with a structured numpy array with different types on columns, it makes a copy.
Implementation reference
df_dict = {}
for i in range(5):
obj = Object(1000000)
arr = obj.getNpArr()
print(arr[:10])
df_dict[i] = pandas.DataFrame.from_records(arr)
print("The DataFrames are :")
for i in range(5):
print(df_dict[i].head(10))
In this case Object(N)
constructs an instance of Object
which internally allocates and initializes a 2D array of shape (N,3)
, with dtypes 'f8','i4','i4'
on each row. Object
manages the life of these data, deallocating it on destruction. The function Object.getNpArr()
returns a np.recarray
pointing to the internal data and it has the above mentioned dtype. Importantly, the returned array does not own the data, it is just a view.
Problem
The DataFrame printed at the end show corrupted data (with respect to the printed array inside the first loop). I am not expecting such behaviour, since the array fed to the pandas construction function is copied (I separately checked this behaviour).
I have not many ideas about the cause and solutions to avoid data corruption. The only guess I can make is:
- the constructor starts allocating the memory for its own data, which takes long because of the big size, and then copies
- before/during the allocation/copy, the GIL is released and it is taken back to the for loop
- for loop proceed before the copy of the array is completed, going to the next iteration
- at the next iteration the
obj
name is moved to the newObject
and the memory is deallocated, which causes data corruption in the copy of the DataFrame at the previous iteration, which is probably still running.
If this is really the cause of the issue, how can I find a workaround? Is there a way to let the GIL go through only when the copy of the array is effectively done?
Or, if my guess is wrong, what is the cause of the data corruption?