4

I have a pandas data frame with one of its column containing some string. I want to split that column into an unknown number of columns according to word count.

Suppose, I have DataFrame df :

Index        Text
0          He codes
1          He codes well in python
2          Python is great language
3          Pandas package is very handy 

Now I want to divide the text column into multiple columns, each containing 2 words each.

Index         0                 1                 2
0          He codes          NaN               NaN
1          He codes          well in           python
2          Python is         great language    NaN
3          Pandas package    is very           handy 

How can I do this in python? Please help. Thanks in advance.

  • Are you sure that the given example captures what you describe? – RunTheGauntlet Jun 29 '20 at 09:36
  • What do you mean by unknown number of columns? You meant `n` number of columns i.e. number of columns that can set and dictated by you. – DaveIdito Jun 29 '20 at 09:37
  • @DaveIdito By the unknown number of columns, I meant that if any sentence is containing a maximum of 10 words then the data frame will contain 5 new columns. I don't know what is the maximum number of words a sentence may contain because I will be scraping web data. – Tushar Agrawal Jun 29 '20 at 09:41

2 Answers2

7

Given a dataframe df where in the Text column we have sentences that need to be split by two words:

import pandas as pd

def splitter(s):
    spl = s.split()
    return [" ".join(spl[i:i+2]) for i in range(0, len(spl), 2)]

df_new = pd.DataFrame(df["Text"].apply(splitter).to_list())

#           0        1       2
# 0  He codes     well    None
# 1  He codes  well in  Python
mabergerx
  • 1,216
  • 7
  • 19
  • Thanks for the solution. What changes should I make if I want to change the number of words in each column from 2 to any other number? – Tushar Agrawal Jun 29 '20 at 10:12
  • 1
    You would then have to adapt the `splitter` function and include an `n` parameter that is then substituted instead of the 2 in the function. Don't forget to add the argument to the function call later as well :) – mabergerx Jun 29 '20 at 10:15
  • 1
    I'd try to avoid the `apply` it might work well for tiny datasets but won't scale. Try to use vectorised solutions in the pandas api. see : [Should I ever use Apply](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code) – Umar.H Jun 29 '20 at 10:16
2

IIUC, we can str.split groupby cumcount with floor division and unstack

s = (
    df["Text"]
    .str.split("\s", expand=True)
    .stack()
    .to_frame("words")
    .reset_index(1, drop=True)
)
s["count"] = s.groupby(level=0).cumcount() // 2
final = s.rename_axis("idx").groupby(["idx", "count"])["words"].agg(" ".join).unstack(1)

print(final)

count               0               1       2
idx                                          
0            He codes             NaN     NaN
1            He codes         well in  python
2           Python is  great language     NaN
3      Pandas package         is very   handy
Umar.H
  • 22,559
  • 7
  • 39
  • 74