1

My question is related to this past of question of mine: Split text in cells and create additional rows for the tokens.

Let's suppose that I have the following in a DataFrame in pandas:

id  text
1   I am the first document and I am very happy.
2   Here is the second document and it likes playing tennis.
3   This is the third document and it looks very good today.

and I want to split the text of each id in tokens of random number of words (varying between two values e.g. 1 and 5) so I finally want to have something like the following:

id  text
1   I am the
1   first document
1   and I am very
1   happy
2   Here is
2   the second document and it
2   likes playing
2   tennis
3   This is the third
3   document and
3   looks very
3   very good today

Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above.

What is the most efficient way to do this?

Outcast
  • 4,967
  • 5
  • 44
  • 99

1 Answers1

2

Define a function to extract chunks in a random fashion using itertools.islice:

from itertools import islice
import random

lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
    chunks = []
    while True:
        chunk = list(islice(it, random.choice(range(lo, hi+1))))
        if not chunk:
            break
        chunks.append(' '.join(chunk))

    return chunks

Call the function through a list comprehension to ensure least possible overhead, then stack to get your output:

pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()

id   
1   0                    I am the
    1        first document and I
    2              am very happy.
2   0                 Here is the
    1         second document and
    2    it likes playing tennis.
3   0           This is the third
    1       document and it looks
    2            very good today.

You can extend the extract_chunks function to perform tokenisation. Right now, I use a simple splitting on whitespace which you can modify.


Note that if you have other columns you don't want to touch, you can do something like a melting operation here.

u = pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']])

(pd.concat([df.drop('text', 1), u], axis=1)
   .melt(df.columns.difference(['text'])))
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thank you, it looks interesting :) (upvote) . By the way, I would like the final output to be a common dataframe with index_reset and two columns: id and text - no multiindex etc. You could fix that for the sake of completeness. – Outcast Jun 07 '19 at 10:44
  • @PoeteMaudit add .reset_index(level=1, name="text", drop=True) – cs95 Jun 07 '19 at 11:59
  • By the way, are you sure about your answer to my last question? I think that the answer is `.reset_index(level=1, drop=True).reset_index(name="text")`. Your answer returns a series which does not have the `text` title anywhere. – Outcast Jun 07 '19 at 13:21
  • @PoeteMaudit the name argument assigns a name to the series and then resets the index so the non-index column gets that name. But looking at it now you probably nailed it better. Sorry, not at my workstation. – cs95 Jun 07 '19 at 13:32