0

MRE:

df = pd.DataFrame({"title":["Canada,Chris,Data Scientist", "Korea,Kim,Analyst", "HK,Lai,Scientist"],
                   "R":[0.7, 0.2, 0.3]})

My goal is to vectorize separating title column into country, name, job column.

Current method is:

df["country"] = df["title"].apply(lambda x:x.split(",")[0])
df["name"] = df["title"].apply(lambda x:x.split(",")[1])
df["job"] = df["title"].apply(lambda x:x.split(",")[2])

successfully outputs

    title                        R     country  name    job
0   Canada,Chris,Data Scientist  0.7    Canada  Chris   Data Scientist
1   Korea,Kim,Analyst            0.2    Korea   Kim     Analyst
2   HK,Lai,Scientist             0.3    HK      Lai     Scientist

However operation is not vectorized.

Usually vectorizing string operations would be:

df["title"].str.split(",")

but I cannot select one element for each list in Series.

haneulkim
  • 4,406
  • 9
  • 38
  • 80
  • 2
    `df[["country", "name", "job"]] = df['title'].str.split(',', expand=True)` should do it. – Ch3steR Oct 12 '21 at 14:20
  • @Ch3steR This is beautiful! Thank you. I should've looked more into what parameters split method offers, my bad. – haneulkim Oct 12 '21 at 14:22
  • Ch3steR's comment would do. However, please remember that string operations `.str` are not vectorized. – Quang Hoang Oct 12 '21 at 14:23
  • what is your mean: `but I cannot select one element for each list in Series.` – I'mahdi Oct 12 '21 at 14:26
  • @QuangHoang You sure it's not a vectorized operation? https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html says that .str.split() is a vectorized string method. – haneulkim Oct 12 '21 at 14:26
  • @user1740577 Each row in series is a list. I cannot access first, second, third elements. – haneulkim Oct 12 '21 at 14:27
  • @haneulkim what do you want? any question do you have now? or is your problem solving? – I'mahdi Oct 12 '21 at 14:28
  • From my experience, they are not *vectorized* as in *parallelized*. It may be *vectorized* as in `np.vectorize`. I did try running a `for` loop and compared the string operations, albeit couple years ago, I don't think much has changed in this aspect. – Quang Hoang Oct 12 '21 at 14:31
  • Alright, just run a quick test on the concatenation of 1000 copies of your data, `[x.split(',') for x in df.title]` is almost twice as fast as `df.title.str.split(',')`. – Quang Hoang Oct 12 '21 at 14:36
  • @QuangHoang You are right, it is indeed faster. I need to look more deeper, whether speed bottleneck is because operation is not vectorizied(parallelized) or due to something else. – haneulkim Oct 13 '21 at 02:24
  • 1
    There are several questions on SO about this topic. [Here](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care/54028200#54028200) is an example. I can't quite recalled others. – Quang Hoang Oct 13 '21 at 02:32

0 Answers0