0

I am new to Pandas and looking for some inputs on if there is a better way to achieve the following:

I have potentially millions of records of the form:

>>> s=pd.DataFrame({"col A": pd.Categorical(["typeA", "typeB", "typeC"]), 
... "col B": pd.Series(["a.b/c/d/e", "a:b:c:d:e", "a.b.c.d.e"])})
>>> s
   col A      col B
0  typeA  a.b/c/d/e
1  typeB  a:b:c:d:e
2  typeC  a.b.c.d.e

I need to add column C to the dataframe which will be a.b for typeA, a for typeB and a for type C. Here is what I have now:

>>> def parseColB(s):
...     col_split=re.split('[:,/,.]',s)
...     if len(col_split) < 2:
...             return ""
...     return col_split[0]
... 

I add the new column with the following apply call:

>>> s = s.assign(ColC = s["col B"].apply(parseColB))
>>> s
   col A      col B ColC
0  typeA  a.b/c/d/e    a
1  typeB  a:b:c:d:e    a
2  typeC  a.b.c.d.e    a

Problem with this approach is I get "a" in ColC for typeA rather than "a.b". Is there a way to add the ColC efficiently based on "col A" value?

Tried the suggestion on this being a possible duplicate based on Henry's comment. I have it almost working:

>>> s=pd.DataFrame({"col A": pd.Categorical(["typeA", "typeB", "typeC"]),
...  "col B": pd.Series(["a.b/c/d/e", "a:b:c:d:e", "a.b.c.d.e"])})
>>> s
   col A      col B
0  typeA  a.b/c/d/e
1  typeB  a:b:c:d:e
2  typeC  a.b.c.d.e
>>> choices = [s['col B'].str.split("/"), s['col B'].str.split(":"), s['col B'].str.split(".")]
>>> conditions = [s['col A'] == 'typeA', s['col A'] == 'typeB', s['col A'] == 'typeC']
>>> s['col C'] = np.select(conditions, choices, default="")
>>> s
   col A      col B            col C
0  typeA  a.b/c/d/e   [a.b, c, d, e]
1  typeB  a:b:c:d:e  [a, b, c, d, e]
2  typeC  a.b.c.d.e  [a, b, c, d, e]

Updated choices to use apply and that gives the desired results. Is this the right way or any other optimization is possible?

>>> choices = [s['col B'].str.split("/").apply(lambda x : x[0]), s['col B'].str.split(":").apply(lambda x : x[0]), s['col B'].str.split(".").apply(lambda x : x[0])]
>>> s['col C'] = np.select(conditions, choices, default="")
>>> s
   col A      col B col C
0  typeA  a.b/c/d/e   a.b
1  typeB  a:b:c:d:e     a
2  typeC  a.b.c.d.e     a
Manoj Guglani
  • 134
  • 1
  • 11

1 Answers1

0

You can do it this way.

s["col C"] = s["col B"].str.split('/|:').apply(lambda x: x[0]).apply(lambda x: ''.join([x.split('\.')[0][0] if (len(x)>3) else x]))

Output

     col A     col B    col C
0   typeA   a.b/c/d/e   a.b
1   typeB   a:b:c:d:e   a
2   typeC   a.b.c.d.e   a
moys
  • 7,747
  • 2
  • 11
  • 42
  • this solution does work but I was hoping to do a split based on the "col A" value and not just based on length. I am trying Henry's suggestion but haven't been able to make it work yet. – Manoj Guglani Feb 18 '20 at 04:20