I'm trying to make function that partitions a pandas dataframe into two subsets based on a feature vector.
My dataframe consists of two columns containing an ndarray[10000]
which is my feature vector and an integer which represents the label for the vector.
question just checks if an index of the feature vector >= 1
I have tried this approach, and it works, but it is way to slow for my use case.
def partition( dataset, question):
true_rows, false_rows =[],[]
for row in dataset.iterrows():
if question.match(row[1][0]):
true_rows.append(row[1])
else:
false_rows.append(row[1])
return pd.DataFrame.from_dict(true_rows), pd.DataFrame.from_dict(false_rows)
I have found an approach I think might work but I get the following error when I am calling g.get_group()
TypeError: unhashable type: 'numpy.ndarray
np.Dot between the feature vector and the question vector should do the same job as match
def partition(dataset, question):
df = dataset
# making a mask dataframe with label True or False
mask = df.apply(lambda x: np.dot(x[0], question.vector)>= 1)
df['mask'] = mask
g = df.groupby('mask')
true_rows = g.get_group(True)
false_rows = g.get_group
It seems like this should work if I just can find a way for it to give me the rows in the groups.