0

I have a python pandas dataframe with multiple columns. One of the columns has a value that matters. Which column matters changes for each row. So I decided to add a column that gives the value that matters. By using numpy.digitize it is fast to find which column to use for me. I store the array in a new column, let's call it 'assigned columns column'. But I struggle to find a time efficient way to store the actual value into a new column.
I tried

data['name'] = data.apply(lambda row: row[row['assigned columns column']], axis=1)

but it is very slow with my number of rows.

value_that_matters = [data.loc[i,strip] for i, strip in enumerate(expected_strips) ]  
data['name'] = value_that_matters

was much quicker with expected_strips being the numpy array returned by numpy.digtize and content of the 'assigned columns column'. But there must be a quicker, better way to grab specific column's value for each row... Please tell me :)

Ferris
  • 5,325
  • 1
  • 14
  • 23
Alice
  • 9
  • 2

1 Answers1

0

you can create a multi_index map obj_map to get the values.

df_str = '''
   a  b  c  d target
0  1  2  3  4      a
1  1  2  3  4      b
2  1  2  3  4      c
3  1  2  3  4      d
4  1  2  3  4      a
5  1  2  3  4      c
6  1  2  3  4      a
7  1  2  3  4      b
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+')

cols = ['a', 'b', 'c', 'd']
obj_map = df[cols].stack()
idx = df.set_index('target', append=True).index
# obj_map.loc[idx]

df['name'] = obj_map.loc[idx].tolist()

df

   a  b  c  d target  name
0  1  2  3  4      a     1
1  1  2  3  4      b     2
2  1  2  3  4      c     3
3  1  2  3  4      d     4
4  1  2  3  4      a     1
5  1  2  3  4      c     3
6  1  2  3  4      a     1
7  1  2  3  4      b     2
Ferris
  • 5,325
  • 1
  • 14
  • 23
  • Thank you very much ! I had an issue of having lots of NaN in the columns which apparently leads to the values being dropped and hence creates a KeyError but once replaced it worked fast and smoothly. Very educational for a newbie like me to get into multi-indexing. – Alice Mar 03 '21 at 09:48
  • if you have "NaN" values, then instead of using `.loc[idx]`, you can consider using pandas `join` or `merge`, see https://stackoverflow.com/questions/53645882/pandas-merging-101 – Ferris Mar 03 '21 at 09:53