0

I'm working on a project where my original dataframe is:

      A     B    C   label
0     1     2    2    Nan
1     2     4    5    7
2     3     6    5    Nan
3     4     8    7    Nan
4     5    10    3    8
5     6    12    4    8

But, I have an array with new labels for certain points (for that I only used columns A and B) in the original dataframe. Something like this:

X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]

My goal is to add the new labels to the original dataframe. I know that the combination of A and B unique is. What is the fastest way to assign the new label to the correct row?

This is my try:

y_labeled = np.array(y).astype('float64')

    current_position = 0
    for point in X_labeled:
        row = df.loc[(df['A'] == point[0]) & (df['B'] == point[1])]
        df.at[row.index, 'label'] = y_labeled[current_position]
        current_position += 1

Wanted output (rows with index 1 and 2 are changed):

       A     B    C   label
0     1     2    2    Nan
1     2     4    5    5
2     3     6    5    9
3     4     8    7    Nan
4     5    10    3    8
5     6    12    4    8

For small datasets may this be okay with I'm currently using it for datasets with more than 25000 labels. Is there a way that is faster?

Also, in some cases I used all columns expect the column 'label'. That dataframe exists out of 64 columns so my method can not be used here. Has someone an idea to improve this?

Thanks in advance

yatu
  • 86,083
  • 12
  • 84
  • 139
Sarah De Cock
  • 87
  • 1
  • 2
  • 10

2 Answers2

2

Best solution is to make your arrays into a dataframe and use df.update():

new = pd.DataFrame(X_labeled, columns=['A', 'B'])
new['label'] = y_labeled
new = new.set_index(['A', 'B'])
df = df.set_index(['A', 'B'])
df.update(new)
df = df.reset_index()
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
1

Here's a numpy based approach aimed at performance. To vectorize this we want a way to check membership of the rows in X_labeled in columns A and B. So what we can do, is view these two columns as 1D arrays (based on this answer) and then we can use np.in1d to index the dataframe and assign the values in y_labeled:

import numpy as np

X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]

a = df.values[:,:2].astype(int) #indexing on A and B

def view_as_1d(a):
    a = np.ascontiguousarray(a)
    return a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[-1])))

ix = np.in1d(view_as_1d(a), view_as_1d(X_labeled))
df.loc[ix, 'label'] = y_labeled

print(df)

   A   B  C label
0  1   2  2   Nan
1  2   4  5     5
2  3   6  5     9
3  4   8  7   Nan
4  5  10  3     8
5  6  12  4     8
yatu
  • 86,083
  • 12
  • 84
  • 139