0

I'm trying to make a transition from R to Python. One package that I heavily relied on was the data.table package. I am struggling to replicate this in Py/Pandas or just Python.

Update: included dummy data - thank you @cmaher for suggestion

import pandas
d = {'id': [1, 2, 3], 'x1': ['1_a', '1_b', 'NX']}
df = pd.DataFrame(data=d)
df

# R solution
library(data.table)
library(stringr)

df <- data.table(id = c(1,2,3), x1=c('1_a', '1_b', 'NX'))

df[str_detect(x1, '\\d') & !str_detect(x1, 'NX'), c("x2", "x3") := tstrsplit(x1, "_", fixed=TRUE)][!str_detect(x1, '\\d'), 'x3' := x1]

df
> df
   id  x1 x2 x3
1:  1 1_a  1  a
2:  2 1_b  1  b
3:  3  NX NA NX

# python-pandas attempt
df['x2'], df['x2'] = df['x1'].apply(
    lambda x: df['x1'].str.split('_', 1).str if (df['x1'].str.contains('\\d')) & 
    ~(df['x1'].str.contains('NX')) else df['x1'])
user2340706
  • 361
  • 2
  • 12
  • 3
    Please read [how to make a good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Questions such as this one are much more constructive if they include sample data & desired output, rather than just a code chunk to translate. – cmaher Mar 23 '18 at 18:10
  • Do you want string to be separated by underscore or want to extract the number part of the string in x2 and string part in x3? – Vaishali Mar 23 '18 at 20:32
  • split by underscore mainly to do what you mentioned: x2 = number and x3 =string. – user2340706 Mar 23 '18 at 20:51

2 Answers2

1

So are you looking for something like this?

import pandas as pd
import numpy as np

df = pd.DataFrame({'id': [1, 2, 3], 'x1': ['1_a', '1_b', 'NX']})
df['x2'], df['x3'] = df['x1'].str.split('_', 1).str
df.loc[df['x3'].isnull(),'x3'] = df['x1']
df['x2'] = df['x2'].replace(df['x1'],np.nan)
df

out:

    id  x1  x2  x3
0   1   1_a 1   a
1   2   1_b 1   b
2   3   NX  NaN NX
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • Sorry the `NA` is R's equivalent to 'missing data'. – user2340706 Mar 23 '18 at 20:20
  • @user2340706 this should work for you. it separates each string in `df[x1]` on `'_'` the default for `df['x3']` is `df['x1']` df['x2'] is NULL if there is no `_` on which to split. – It_is_Chris Mar 23 '18 at 22:23
1

As I see in your comments, your intend is to separate numbers in x2 and strings in x3. Maybe the next code fit your requirements, using the 're' package:

import pandas as pd
import re
d = {'id': [1, 2, 3], 'x1': ['1_a', '1_b', 'NX']}
df = pd.DataFrame(data=d)
print(df)

def findPattern(pattern, string):
    m= re.search(pattern,string)
    if m:
        return m.group()
    else:
        return None

df['x2'] = df.x1.apply(lambda x: findPattern(r"\d+",x)) 
df['x3'] = df.x1.apply(lambda x: findPattern(r"[a-zA-Z]+",x))

print(df)

The output:

   id   x1    x2  x3
0   1  1_a     1   a
1   2  1_b     1   b
2   3   NX  None  NX
migjimen
  • 551
  • 1
  • 4
  • 6