Vectorizing splitting string in a column into multiple column pandas

Question

MRE:

df = pd.DataFrame({"title":["Canada,Chris,Data Scientist", "Korea,Kim,Analyst", "HK,Lai,Scientist"],
                   "R":[0.7, 0.2, 0.3]})

My goal is to vectorize separating title column into country, name, job column.

Current method is:

df["country"] = df["title"].apply(lambda x:x.split(",")[0])
df["name"] = df["title"].apply(lambda x:x.split(",")[1])
df["job"] = df["title"].apply(lambda x:x.split(",")[2])

successfully outputs

    title                        R     country  name    job
0   Canada,Chris,Data Scientist  0.7    Canada  Chris   Data Scientist
1   Korea,Kim,Analyst            0.2    Korea   Kim     Analyst
2   HK,Lai,Scientist             0.3    HK      Lai     Scientist

However operation is not vectorized.

Usually vectorizing string operations would be:

df["title"].str.split(",")

but I cannot select one element for each list in Series.

`df[["country", "name", "job"]] = df['title'].str.split(',', expand=True)` should do it. — Ch3steR, Oct 12 '21 at 14:20
@Ch3steR This is beautiful! Thank you. I should've looked more into what parameters split method offers, my bad. — haneulkim, Oct 12 '21 at 14:22
Ch3steR's comment would do. However, please remember that string operations `.str` are not vectorized. — Quang Hoang, Oct 12 '21 at 14:23
what is your mean: `but I cannot select one element for each list in Series.` — I'mahdi, Oct 12 '21 at 14:26
@QuangHoang You sure it's not a vectorized operation? https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html says that .str.split() is a vectorized string method. — haneulkim, Oct 12 '21 at 14:26
@user1740577 Each row in series is a list. I cannot access first, second, third elements. — haneulkim, Oct 12 '21 at 14:27
@haneulkim what do you want? any question do you have now? or is your problem solving? — I'mahdi, Oct 12 '21 at 14:28
From my experience, they are not *vectorized* as in *parallelized*. It may be *vectorized* as in `np.vectorize`. I did try running a `for` loop and compared the string operations, albeit couple years ago, I don't think much has changed in this aspect. — Quang Hoang, Oct 12 '21 at 14:31
Alright, just run a quick test on the concatenation of 1000 copies of your data, `[x.split(',') for x in df.title]` is almost twice as fast as `df.title.str.split(',')`. — Quang Hoang, Oct 12 '21 at 14:36
@QuangHoang You are right, it is indeed faster. I need to look more deeper, whether speed bottleneck is because operation is not vectorizied(parallelized) or due to something else. — haneulkim, Oct 13 '21 at 02:24
There are several questions on SO about this topic. [Here](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care/54028200#54028200) is an example. I can't quite recalled others. — Quang Hoang, Oct 13 '21 at 02:32

Vectorizing splitting string in a column into multiple column pandas

0 Answers0