0

I'm trying to make function that partitions a pandas dataframe into two subsets based on a feature vector.

My dataframe consists of two columns containing an ndarray[10000] which is my feature vector and an integer which represents the label for the vector.

question just checks if an index of the feature vector >= 1

I have tried this approach, and it works, but it is way to slow for my use case.

def partition( dataset, question):
  true_rows, false_rows =[],[]
  for row in dataset.iterrows():
    if question.match(row[1][0]): 
        true_rows.append(row[1])
    else:
        false_rows.append(row[1])
  return pd.DataFrame.from_dict(true_rows), pd.DataFrame.from_dict(false_rows)

I have found an approach I think might work but I get the following error when I am calling g.get_group()

TypeError: unhashable type: 'numpy.ndarray

np.Dot between the feature vector and the question vector should do the same job as match

def partition(dataset, question):
  df = dataset

  # making a mask dataframe with label True or False
  mask = df.apply(lambda x: np.dot(x[0], question.vector)>= 1)
  df['mask'] = mask

  g = df.groupby('mask')

  true_rows = g.get_group(True)
  false_rows = g.get_group

It seems like this should work if I just can find a way for it to give me the rows in the groups.

eskillx
  • 1
  • 1
  • 2
    Welcome to stack overflow! Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a sample of your input and expected output so that we can better understand what you're trying to do – G. Anderson Oct 19 '21 at 23:06

1 Answers1

0

Ok, i figured it out. For some reason it would not work when my columns had deafault names (numbers).

df = df.rename(columns={0:'vector', 1:'label'})

Did this to the dataset i was sending in and it worked.

eskillx
  • 1
  • 1