I am new to Pandas and looking for some inputs on if there is a better way to achieve the following:
I have potentially millions of records of the form:
>>> s=pd.DataFrame({"col A": pd.Categorical(["typeA", "typeB", "typeC"]),
... "col B": pd.Series(["a.b/c/d/e", "a:b:c:d:e", "a.b.c.d.e"])})
>>> s
col A col B
0 typeA a.b/c/d/e
1 typeB a:b:c:d:e
2 typeC a.b.c.d.e
I need to add column C to the dataframe which will be a.b for typeA, a for typeB and a for type C. Here is what I have now:
>>> def parseColB(s):
... col_split=re.split('[:,/,.]',s)
... if len(col_split) < 2:
... return ""
... return col_split[0]
...
I add the new column with the following apply call:
>>> s = s.assign(ColC = s["col B"].apply(parseColB))
>>> s
col A col B ColC
0 typeA a.b/c/d/e a
1 typeB a:b:c:d:e a
2 typeC a.b.c.d.e a
Problem with this approach is I get "a" in ColC for typeA rather than "a.b". Is there a way to add the ColC efficiently based on "col A" value?
Tried the suggestion on this being a possible duplicate based on Henry's comment. I have it almost working:
>>> s=pd.DataFrame({"col A": pd.Categorical(["typeA", "typeB", "typeC"]),
... "col B": pd.Series(["a.b/c/d/e", "a:b:c:d:e", "a.b.c.d.e"])})
>>> s
col A col B
0 typeA a.b/c/d/e
1 typeB a:b:c:d:e
2 typeC a.b.c.d.e
>>> choices = [s['col B'].str.split("/"), s['col B'].str.split(":"), s['col B'].str.split(".")]
>>> conditions = [s['col A'] == 'typeA', s['col A'] == 'typeB', s['col A'] == 'typeC']
>>> s['col C'] = np.select(conditions, choices, default="")
>>> s
col A col B col C
0 typeA a.b/c/d/e [a.b, c, d, e]
1 typeB a:b:c:d:e [a, b, c, d, e]
2 typeC a.b.c.d.e [a, b, c, d, e]
Updated choices to use apply and that gives the desired results. Is this the right way or any other optimization is possible?
>>> choices = [s['col B'].str.split("/").apply(lambda x : x[0]), s['col B'].str.split(":").apply(lambda x : x[0]), s['col B'].str.split(".").apply(lambda x : x[0])]
>>> s['col C'] = np.select(conditions, choices, default="")
>>> s
col A col B col C
0 typeA a.b/c/d/e a.b
1 typeB a:b:c:d:e a
2 typeC a.b.c.d.e a