Option 1
str.extractall
-
(df.set_index('Item').Origin
.str.extractall(r'\b(?P<Origin>\w+)\b')
.reset_index()
.drop('match', 1))
Item Origin
0 Weapons US
1 Weapons UK
2 Weapons AUS
3 Drugs US
4 Data Canada
The pattern extracts all words into a separate row -
\b # word boundary
(?P<Origin> # named capture group called "Origin"
\w+ # any string of letters/digits (you can change this to [^\s]+ instead to match punctuation
) # closing 1st capture group
\b # word boundary
Note that you can squeeze out a bit more performance by removing the drop
call and replacing it with a del
call -
i = (df.set_index('Item').Origin
.str.extractall(r'\b(?P<Origin>\w+)\b')
.reset_index())
del i['match']
Option 2
str.split
+ stack
-
(df.set_index('Item').Origin
.str.split(expand=True)
.stack()
.reset_index(level=0, name='Origin'))
Item Origin
0 Weapons US
1 Weapons UK
2 Weapons AUS
0 Drugs US
0 Data Canada
Performance
df = pd.concat([df] * 10000, ignore_index=True)
%%timeit
(df.set_index('Item').Origin
.str.extractall(r'\b(?P<Origin>\w+)\b')
.reset_index()
.drop('match', 1))
1 loop, best of 3: 226 ms per loop
%%timeit
(df.set_index('Item').Origin
.str.split(expand=True)
.stack()
.reset_index(level=0, name='Origin'))
10 loops, best of 3: 142 ms per loop
Note that performance on your actual data might differ based on the nature of your data, etc.