1

Let's suppose that I have the following in a DataFrame in pandas:

id  text
1   I am the first document and I am very happy.
2   Here is the second document and it likes playing tennis.
3   This is the third document and it looks very good today.

and I want to split the text of each id in tokens of 3 words so I finally want to have the following:

id  text
1   I am the
1   first document and
1   I am very
1   happy
2   Here is the
2   second document and
2   it likes playing
2   tennis
3   This is the
3   third document and
3   it looks very
3   good today

Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above.

What is the most efficient way to do this?

I reckon that the solution to my question is quite close to the solution given here: Tokenise text and create more rows for each row in dataframe.

This may help too: Python: Split String every n word in smaller Strings.

Outcast
  • 4,967
  • 5
  • 44
  • 99

2 Answers2

2

You can use something like:

def divide_chunks(l, n): 
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 

Then using unnesting:

df['text_new']=df.text.apply(lambda x: list(divide_chunks(x.split(),3)))
df_new=unnesting(df,['text_new']).drop('text',1)
df_new.text_new=df_new.text_new.apply(' '.join)
print(df_new)

              text_new  id
0             I am the   1
0   first document and   1
0            I am very   1
0               happy.   1
1          Here is the   2
1  second document and   2
1     it likes playing   2
1              tennis.   2
2          This is the   3
2   third document and   3
2        it looks very   3
2          good today.   3

EDIT:

m=(pd.DataFrame(df.text.apply(lambda x: list(divide_chunks(x.split(),3))).values.tolist())
.unstack().sort_index(level=1).apply(' '.join).reset_index(level=1))
m.columns=df.columns
print(m)

   id                 text
0   0             I am the
1   0   first document and
2   0            I am very
3   0               happy.
0   1          Here is the
1   1  second document and
2   1     it likes playing
3   1              tennis.
0   2          This is the
1   2   third document and
2   2        it looks very
3   2          good today.
anky
  • 74,114
  • 11
  • 41
  • 70
  • Hey thanks; it looks interesting (upvote). To be honest, I think that there will be a slightly simpler solution but I may be wrong eventually. By the way, is `unnesting` a function that you can call as you do above? I cannot find it for now. – Outcast May 31 '19 at 13:59
  • Ah ok so you probably mean the function at the end of the answer of this link. By the way, did you see this link which I posted above: https://stackoverflow.com/a/56290477/9024698? I thought that the answer there could be modified to accommodate my problem above (or not). Also @jezrael seems to be away for now so I cannot have his magic at my disposal. – Outcast May 31 '19 at 14:07
  • Also what is the computation complexity of your solution? I think that with a nested `for` loop you could it too but perhaps it would be pretty computationally expensive. I also think that we can consider things like this: https://stackoverflow.com/questions/40425033/python-split-string-every-n-word-in-smaller-strings (if it works). – Outcast May 31 '19 at 14:13
  • 1
    Ok let's see. To be honest, I have used the solution above (https://stackoverflow.com/questions/56290155/tokenise-text-and-create-more-rows-for-each-row-in-dataframe/56290477#56290477) and it worked but I did not absolutely understand it so I cannot modify it so easily to try solve my current problem. – Outcast May 31 '19 at 14:24
  • @PoeteMaudit i have added another solution(a different approach), you can test which one is faster – anky May 31 '19 at 14:53
  • Cool thanks even though the one below is significantly more self-contained and it does not seem to be that slow at my laptop. Yours may be pretty faster but still the one below is not that slow. – Outcast May 31 '19 at 15:22
  • @PoeteMaudit Great :) i think the second approach will be faster than the first. Anyways, you are the best judge – anky May 31 '19 at 15:27
  • 1
    Cool thank you in any case :) (upvoted again after your edit) – Outcast May 31 '19 at 15:28
1

A self contained solution, maybe a little slower:

# Split every n words
n = 3

# incase id is not index yet
df.set_index('id', inplace=True)

new_df = df.text.str.split(' ', expand=True).stack().reset_index()

new_df = (new_df.groupby(['id', new_df.level_1//n])[0]
                .apply(lambda x: ' '.join(x))
                .reset_index(level=1, drop=True)
         )

new_df is a series:

id
1               I am the
1     first document and
1              I am very
1                 happy.
2            Here is the
2    second document and
2       it likes playing
2                tennis.
3            This is the
3     third document and
3          it looks very
3            good today.
Name: 0, dtype: object
Outcast
  • 4,967
  • 5
  • 44
  • 99
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Hey thanks (upvote). It certainly looks more self-contained. However, yes, it is about how computationally expensive is too. – Outcast May 31 '19 at 14:37
  • By the way, what happens if the dataframe has other columns too (which should be only copied as the column `id` to the new dataframe) as I write at my post? Does it still work? – Outcast May 31 '19 at 14:38
  • it's still works if you set index as all the other columns, and replace the `reset_index(level=1)` with `reset_index(level=-1)`. – Quang Hoang May 31 '19 at 14:40
  • Ok cool let me see (even though I just realised that I do not really need to carry these columns to my new dataframe for now so it is ok). – Outcast May 31 '19 at 14:41
  • Hey, I think that it actually works (!!). Great stuff. However, it returns a series (with the id as an index I think) and not a dataframe. Can you properly reset the index and create a dataframe? The dataframe should have a index column which is reset, the `id` column and the `text` column. (I do not know if this is related to this but at my actual data my `id` column is not a number but a string) – Outcast May 31 '19 at 14:52
  • `reset_index()` on the returned series gives you a dataframe. – Quang Hoang May 31 '19 at 14:53
  • Cool I know this but I was just wondering if there was something that could be changed at your code above when you do this for example: `.reset_index(level=1, drop=True)`. If not then apparently doing `reset_index()` at the end is reasonable. – Outcast May 31 '19 at 14:54
  • I don't think so. `reset_index()` does not allow dropping selected levels while keeping others. Chaining is the only option. – Quang Hoang May 31 '19 at 14:56