2

I have the following dataframe:

import pandas as pd
import numpy as np
d={'P':['A[55]','B','C[98]"'],
   'Q':['C[89]','F[98]','K[97]'],
    'S':['B[89]',67,98],
   'id':['a','b','c']
  }
df=pd.DataFrame(data=d)

How to place only data inside [] for selected columns, i.e. "P" and "Q".

ID should be left as such.

Expected output:

P      Q       S         id
55      89   89           a
np.nan  98    np.nan      b
98      97      np.nan    c
panda
  • 615
  • 6
  • 12

2 Answers2

4

You can handle all columns with one str.extract operation if you stack the data first.

df.loc[:,'P':'S'] = (
    df.loc[:,'P':'S'].astype(str).stack().str.extract(r'\[(\d+)\]').unstack()[0])
df

     P   Q    S id
0   55  89   89  a
1  NaN  98  NaN  b
2   98  97  NaN  c

Details
First, convert the data to string and stack it,

df.loc[:,'P':'S'].astype(str).stack()

0  P     A[55]
   Q     C[89]
   S     B[89]
1  P         B
   Q     F[98]
   S        67
2  P    C[98]"
   Q     K[97]
   S        98
dtype: object

You can then extract numbers with str.extract:

df.loc[:,'P':'S'].astype(str).stack().str.extract(r'\[(\d+)\]')

       0
0 P   55
  Q   89
  S   89
1 P  NaN
  Q   98
  S  NaN
2 P   98
  Q   97
  S  NaN 

Then, unstack and assign back.

df.loc[:,'P':'S'].astype(str).stack().str.extract(r'\[(\d+)\]').unstack()[0]

     P   Q    S
0   55  89   89
1  NaN  98  NaN
2   98  97  NaN
cs95
  • 379,657
  • 97
  • 704
  • 746
  • it gives syntax error when number of columns df.loc[:,'P':'Q':'S':'P1':'Q1':'S1':'P2':'P3':'Q2':'Q3':'S2':'S3'] = ( df.loc[:,'P':'Q':'S':'P1':'Q1':'S1':'P2':'P3':'Q2':'Q3':'S2':'S3'].astype(str).stack().str.extract(r'\[(\d+)\]').unstack()[0]) – panda Feb 27 '19 at 07:19
  • @panda Unfortunately, that isn't the correct syntax for selecting multiple columns. I suggest specifying a list of columns to `loc`. – cs95 Feb 27 '19 at 07:22
  • @panda Can you run just `df.loc[:,'P':'S'].astype(str).stack().str.extract(r'\[(\d+)\]').unstack()` and see if you can extract the data you need in the format? You can see the Details section of my answer for an example. – cs95 Feb 27 '19 at 07:30
  • df.loc[:,'P':'S','Q'].astype(str).stack().str.extract(r'\[(\d+)\]').unstack()[0] if I run it with [0] for df it gives indexing error without [0] also the same – panda Feb 27 '19 at 07:34
  • can you help for columns P,Q,S on df – panda Feb 27 '19 at 07:35
  • @panda `df.loc[:,['P', 'Q', 'S']] = ( df.loc[:,['P', 'Q', 'S']].astype(str)....)` – cs95 Feb 27 '19 at 07:45
3

Use extract to fetch digits and change others to NaN. You can iterate over each column, if you have more number of columns:

df.loc[:, 'P'] = df.loc[:, 'P'].str.extract('(\d+)')
df.loc[:, 'Q'] = df.loc[:, 'Q'].str.extract('(\d+)')
df.loc[:, 'S'] = df.loc[:, 'S'].str.extract('(\d+)')
df


+---+-----+----+-----+----+
|   |  P  | Q  |  S  | id |
+---+-----+----+-----+----+
| 0 | 55  | 89 | 89  | a  |
| 1 | NaN | 98 | NaN | b  |
| 2 | 98  | 97 | NaN | c  |
+---+-----+----+-----+----+
meW
  • 3,832
  • 7
  • 27
  • it works instead of three lines of df.loc[:, 'P'] = df.loc[:, 'P'].str.extract('(\d+)') for column values can it be changed to singe line – panda Feb 27 '19 at 07:04