2

I have this darknet data, and the data is like below: The origin column contains multiple countries in one row.

df = pd.DataFrame({'Item': ['Weapons', 'Drugs', 'Data'], 'Origin': ['US UK AUS', 'US', 'Canada']})

I am trying to convert the first row into multiple rows. In this case, three!

df = pd.DataFrame({'Item': ['Weapons', 'Weapons', 'Weapons','Drugs', 'Data'], 'Origin': ['US', 'UK', 'AUS', 'US', 'Canada']})

Is there a simple code that I could utilize?

If that's not the case, how can I make it possible??

cs95
  • 379,657
  • 97
  • 704
  • 746
gieun
  • 29
  • 1

1 Answers1

5

Option 1
str.extractall -

(df.set_index('Item').Origin
   .str.extractall(r'\b(?P<Origin>\w+)\b')
   .reset_index()
   .drop('match', 1))

      Item  Origin
0  Weapons      US
1  Weapons      UK
2  Weapons     AUS
3    Drugs      US
4     Data  Canada

The pattern extracts all words into a separate row -

\b           # word boundary
(?P<Origin>  # named capture group called "Origin"
\w+          # any string of letters/digits (you can change this to [^\s]+ instead to match punctuation
)            # closing 1st capture group
\b           # word boundary

Note that you can squeeze out a bit more performance by removing the drop call and replacing it with a del call -

i = (df.set_index('Item').Origin
       .str.extractall(r'\b(?P<Origin>\w+)\b')
       .reset_index())

del i['match']

Option 2
str.split + stack -

(df.set_index('Item').Origin
  .str.split(expand=True)
  .stack()
  .reset_index(level=0, name='Origin'))

      Item  Origin
0  Weapons      US
1  Weapons      UK
2  Weapons     AUS
0    Drugs      US
0     Data  Canada

Performance

df = pd.concat([df] * 10000, ignore_index=True)

%%timeit 
(df.set_index('Item').Origin
   .str.extractall(r'\b(?P<Origin>\w+)\b')
   .reset_index()
   .drop('match', 1))

1 loop, best of 3: 226 ms per loop

%%timeit 
(df.set_index('Item').Origin
  .str.split(expand=True)
  .stack()
  .reset_index(level=0, name='Origin'))

10 loops, best of 3: 142 ms per loop

Note that performance on your actual data might differ based on the nature of your data, etc.

cs95
  • 379,657
  • 97
  • 704
  • 746