I am running some Pandas/numpy data manipulation code as shown below with a random sample dataframe:
import pandas as pd;
import numpy as np;
nrows = 200
df = pd.DataFrame(np.random.randint(0,25,size=(nrows, 8)), columns=list('ABCDEFGH'))
array_val = df.values
array_obj = ((array_val == array_val[:,None]).any(axis=-1))
print(array_obj.dtype)
print(array_obj.shape)
The code is supposed to return an array with the shape: (nrow, nrow). So for example a dataframe with nrows of 500 would return a result with shape(500,500).
The code runs successfully for lower values of nrows such as 5,000 or 20,000. (You may need more than 16gb RAM to run the logic for nrows above 10,000 though).
However, I've noticed an issue when I've increased nrows above 75,000 / 80k. This line
array_obj = ((array_val == array_val[:,None]).any(axis=-1))
throws an error:
AttributeError: 'bool' object has no attribute 'any'.
I've already check whether it may be exceeding the max array size but it looks like 75k rows should be under the limit How Big can a Python List Get?.
If this isn't a memory-data structure problem, what's the root cause and the appropriate fix?
Edit: I've searched around and some similar posts make mention off the issue being dependent on your machine/OS and Pandas/Numpy package versions. Would be curious to see if anyone manages to get the sample code running for nrows = 80000 on their env.