2

Please refer to the below runnable demo code. It is trying to sr.str.extract() the "a" column into multiple columns, and insert those columns into the origianl df immediately after the "a" column.
Please combine steps [1] and [2] in the below code in a better way.

import re
import pandas as pd

df = pd.DataFrame({
     "a" : {1: 'a', 2: 'aa', 3: 'aaa'} ,
     "b" : {1: 'b', 2: 'bb', 3: 'bbb'} ,
     "c" : {1: 'b', 2: 'bb', 3: 'bbb'} ,
})

"""
df==
+----+-----+-----+-----+
|    | a   | b   | c   |
|----+-----+-----+-----|
|  1 | a   | b   | b   |
|  2 | aa  | bb  | bb  |
|  3 | aaa | bbb | bbb |
+----+-----+-----+-----+
"""
# step [1] sr.str.extract
rex = re.compile(r'(?P<firstletter>\w) (?P<secondletter>\w+)', re.X)
cols = df['a'].str.extract(rex)

# step [2] insert extracted columns back into the original df
df['firstletter'], df['secondletter'] = 0, 0
df['firstletter'] = cols['firstletter']
df['secondletter'] = cols['secondletter']
df = df['a firstletter secondletter b c'.split()]

"""
# Or, a more concise step [2], but too hard to glance thru and remember, also prone to mistake:
for col in cols.columns[::-1]:
    df.insert(df.columns.get_loc('a')+1, col, cols[col])
"""

# result:
"""
df==
+----+-----+---------------+----------------+-----+-----+
|    | a   | firstletter   | secondletter   | b   | c   |
|----+-----+---------------+----------------+-----+-----|
|  1 | a   | nan           | nan            | b   | b   |
|  2 | aa  | a             | a              | bb  | bb  |
|  3 | aaa | a             | aa             | bbb | bbb |
+----+-----+---------------+----------------+-----+-----+
"""
SeaBean
  • 22,547
  • 3
  • 13
  • 25
eliu
  • 2,390
  • 1
  • 17
  • 29
  • I am hoping for a `inplace` operation that looks like: `df.str.extract(col='a', rex, inplace=True)`, which will default insert location after col `'a'`. – eliu May 29 '21 at 19:20
  • See my edited solutions with 2 options below. – SeaBean May 29 '21 at 20:18
  • Note that `inplace` solution is generally considered a bad practice and is of high chance it will be depreciated in future Pandas version. See [this post](https://stackoverflow.com/a/60020384/15070697) and also [this article](https://www.dataschool.io/future-of-pandas/#inplace) – SeaBean May 29 '21 at 20:26

2 Answers2

2

There are 2 possible solutions:

Solution 1:

This solution would most resemble your ideal solution while recommend you make some slight (but constructive) change to the naming of the extracted columns. E.g. instead of firstletter and secondletter, for the extraction of column a, we name it with prefix a_ as a_firstletter and a_secondletter. Then we can use one statement as follows:

import re

df = df.assign(**df['a'].str.extract(r'(?P<a_firstletter>\w) (?P<a_secondletter>\w+)', re.X)).sort_index(axis=1)

Result:

print(df)


     a a_firstletter a_secondletter    b    c
1    a           NaN            NaN    b    b
2   aa             a              a   bb   bb
3  aaa             a             aa  bbb  bbb

Solution 2:

This solution you can keep on using firstletter and secondletter as the extracted column names.

You can split the columns into 2 portions by .iloc : left portion df_left from first column till column a and right portion df_right from the columns after column a till the end. Then concat the left portion df_left, the newly extracted columns cols, and right portion df_right together along the columns by pd.concat(), as follows:

df_left = df.iloc[:, 0: df.columns.get_loc('a')+1]
df_right = df.iloc[:, df.columns.get_loc('a')+1:]

df = pd.concat([df_left, cols, df_right], axis=1)

Result:

print(df)


     a firstletter secondletter    b    c
1    a         NaN          NaN    b    b
2   aa           a            a   bb   bb
3  aaa           a           aa  bbb  bbb
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • solution 1's syntax is brand new for me, very nice one. I guess the rearranging of columns will always require some extra effort. – eliu May 29 '21 at 20:27
  • @eliu Yes, it's quite tricky to use `df.assign(**df...)` with `**` for unpacking the columns. Also the column re-arranging here is by `.sort_index()` to sort the 2 extracted columns back together with the column for extraction. – SeaBean May 29 '21 at 20:29
  • couldn't really use `.sort_index()` in the real world, the other columns will be rearranged. But the `assign` is really nice – eliu May 29 '21 at 20:32
  • @eliu Since you have already maken good use of regex with `Named Capturing Group` so that the extracted columns by `.str.extract()` already carries the column names, you don't need to explicitly code the 2 column names again in the subsequent codes. – SeaBean May 29 '21 at 20:32
  • @eliu If you can't use `.sort_index()` to sort the column names, then probably you have to use my 2nd solution then. As you require specific locations of the extracted columns, I am afraid this would be the last resort if you don't want to use `.insert()` – SeaBean May 29 '21 at 20:37
1

I believe you need.

import re

df = pd.DataFrame({
     "a" : {1: 'a', 2: 'aa', 3: 'aaa'} ,
     "b" : {1: 'b', 2: 'bb', 3: 'bbb'} ,
     "c" : {1: 'b', 2: 'bb', 3: 'bbb'} ,
})

df[['firstletter','secondletter']] = df['a'].str.extract(r"(\w)(\w+)", expand=True)
print(df)

Output:

     a    b    c firstletter secondletter
1    a    b    b         NaN          NaN
2   aa   bb   bb           a            a
3  aaa  bbb  bbb           a           aa
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • 1
    I does get around the problem of typing 'firstletter' 'secondletter' twice, both in regex and in the code. but it still just a bit short on putting the columns at the desired location. – eliu May 29 '21 at 19:31