Match on multiple columns using array

Question

I'm working on a project where my original dataframe is:

      A     B    C   label
0     1     2    2    Nan
1     2     4    5    7
2     3     6    5    Nan
3     4     8    7    Nan
4     5    10    3    8
5     6    12    4    8

But, I have an array with new labels for certain points (for that I only used columns A and B) in the original dataframe. Something like this:

X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]

My goal is to add the new labels to the original dataframe. I know that the combination of A and B unique is. What is the fastest way to assign the new label to the correct row?

This is my try:

y_labeled = np.array(y).astype('float64')

    current_position = 0
    for point in X_labeled:
        row = df.loc[(df['A'] == point[0]) & (df['B'] == point[1])]
        df.at[row.index, 'label'] = y_labeled[current_position]
        current_position += 1

Wanted output (rows with index 1 and 2 are changed):

       A     B    C   label
0     1     2    2    Nan
1     2     4    5    5
2     3     6    5    9
3     4     8    7    Nan
4     5    10    3    8
5     6    12    4    8

For small datasets may this be okay with I'm currently using it for datasets with more than 25000 labels. Is there a way that is faster?

Also, in some cases I used all columns expect the column 'label'. That dataframe exists out of 64 columns so my method can not be used here. Has someone an idea to improve this?

Thanks in advance

No one knows your desired output. – David Smolinski Apr 18 '20 at 17:35 — David Smolinski, Apr 18 '20 at 17:35
What's X in your code? – Robert Navado Apr 18 '20 at 17:36 — Robert Navado, Apr 18 '20 at 17:36
I have edited the post. Thanks for the comments – Sarah De Cock Apr 18 '20 at 17:43 — Sarah De Cock, Apr 18 '20 at 17:43

score 2 · Answer 1 · answered Apr 18 '20 at 18:19

2

Best solution is to make your arrays into a dataframe and use df.update():

new = pd.DataFrame(X_labeled, columns=['A', 'B'])
new['label'] = y_labeled
new = new.set_index(['A', 'B'])
df = df.set_index(['A', 'B'])
df.update(new)
df = df.reset_index()

answered Apr 18 '20 at 18:19

Josh Friedlander

10,870
5
35
75

score 1 · Accepted Answer · answered Apr 18 '20 at 18:10

Here's a numpy based approach aimed at performance. To vectorize this we want a way to check membership of the rows in X_labeled in columns A and B. So what we can do, is view these two columns as 1D arrays (based on this answer) and then we can use np.in1d to index the dataframe and assign the values in y_labeled:

import numpy as np

X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]

a = df.values[:,:2].astype(int) #indexing on A and B

def view_as_1d(a):
    a = np.ascontiguousarray(a)
    return a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[-1])))

ix = np.in1d(view_as_1d(a), view_as_1d(X_labeled))
df.loc[ix, 'label'] = y_labeled

print(df)

   A   B  C label
0  1   2  2   Nan
1  2   4  5     5
2  3   6  5     9
3  4   8  7   Nan
4  5  10  3     8
5  6  12  4     8

Match on multiple columns using array

2 Answers2