2

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.

I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.

I have a dummy example of the dataframe and series in question:

In [6]: df
Out[6]:
   idxA  idxB  var2
0     0     1   2.0
1     0     2   3.0
2     2     4   2.0
3     2     1   1.0
In [8]: labels
Out[8]:
0    A
1    B
2    C
3    D
4    E
Name: label, dtype: object

Currently, I'm converting the series to a dictionary and using replace:

label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)

which does give me the expected result:

In [12]: df
Out[12]:
  idxA idxB  var2
0    A    B   2.0
1    A    C   3.0
2    C    E   2.0
3    C    B   1.0

However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?

Thanks!

EDIT: I accepted @coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:

In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Matt Stone
  • 185
  • 1
  • 1
  • 8

1 Answers1

4

You can call map for each column using apply:

df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df

  idxA idxB  var2
0    A    B   2.0
1    A    C   3.0
2    C    E   2.0
3    C    B   1.0

This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do

cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)

map is faster than replace, depending on the number of columns to replace. Your mileage may vary.

df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)

%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
    df[c].map(labels)

6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 2
    Thanks! That was much more efficient. I actually saw an even more extreme speed-up on my machine of about an order of magnitude. – Matt Stone Jan 29 '19 at 03:44
  • 1
    it's worth noting that in the case where the value to be replaced is not a key in the lookup, `replace` will leave the value to be replaced intact whereas the above method using `map` will insert `NaN`. – quantum285 Dec 20 '22 at 11:51