1

I am running some Pandas/numpy data manipulation code as shown below with a random sample dataframe:

import pandas as pd;
import numpy as np;

nrows = 200
df = pd.DataFrame(np.random.randint(0,25,size=(nrows, 8)), columns=list('ABCDEFGH'))
array_val = df.values
array_obj = ((array_val == array_val[:,None]).any(axis=-1))
print(array_obj.dtype)
print(array_obj.shape)

The code is supposed to return an array with the shape: (nrow, nrow). So for example a dataframe with nrows of 500 would return a result with shape(500,500).

The code runs successfully for lower values of nrows such as 5,000 or 20,000. (You may need more than 16gb RAM to run the logic for nrows above 10,000 though).

However, I've noticed an issue when I've increased nrows above 75,000 / 80k. This line

array_obj = ((array_val == array_val[:,None]).any(axis=-1))

throws an error:

AttributeError: 'bool' object has no attribute 'any'.

I've already check whether it may be exceeding the max array size but it looks like 75k rows should be under the limit How Big can a Python List Get?.

If this isn't a memory-data structure problem, what's the root cause and the appropriate fix?

Edit: I've searched around and some similar posts make mention off the issue being dependent on your machine/OS and Pandas/Numpy package versions. Would be curious to see if anyone manages to get the sample code running for nrows = 80000 on their env.

bigchungus
  • 21
  • 2
  • At some point `((array_val == array_val[:,None])` must evaluate to either `True` or `False` when you are expecting it to be a numpy ndarray or pandas Series. – wwii Sep 15 '22 at 22:34
  • @wwii afaik the code I provided runs as-is. You can quickly verify it on an online environment like https://www.online-python.com/. However, the logic throws an error if you increase nrows = 200 to a higher value like nrows = 80000, assuming your machine has enough RAM. – bigchungus Sep 15 '22 at 22:39
  • What happens with just one column? – wwii Sep 15 '22 at 22:50
  • One or two columns does appear to work for 80000 rows. But if it's a memory issue I'd expect Python to throw a more specific exception pertaining to that – bigchungus Sep 15 '22 at 22:58
  • If you compare 2 arrays that differ in shape, you will get a `False`, and in recent enough versions a `DeprecationWarning`. But if they match in length you get a boolean array of the same length. But I can't think of a case where this `outer` type of comparison would produce the single `False`. – hpaulj Sep 15 '22 at 23:50
  • No problem running with nrows=80000, although it did take well over a minute. Python needed >50GB to run this. Increasing nrows much past the point it consumed all my RAM caused it to throw the same `AttributeError` (on my 128GB machine that was at about nrows=160000). python 3.10.11, pandas 1.5.3, numpy 1.24.3. Presumably it is therefore somehow memory-related. – fantabolous Jun 02 '23 at 06:05

1 Answers1

0

I ended up getting the code to run without an error on 75k/80k by using another environment with a different Pandas/numpy version. Though I'm still not sure why the issue is tied to package version

bigchungus
  • 21
  • 2
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 20 '22 at 21:17