How can I split the output of str.split() column in pandas?

Question

Here's the thing, I have this sort of dataset (let's call it df):

id       text
A1       How was your experience?: Great\nWhat did you buy?: A book\n
B1       How was your experience?: Good\nWhat did you buy?: A pen\n
C2       How was your experience?: Awful\nWhat did you buy?: A pencil\n

As you can see, this is a table containing a survey and I'm trying to get only the answers from the column text. My first tought was to try to split the text, just like this:

df['text_splitted'] = df.text.str.split('\n')

And then I would do something like this:

df['final_text'] = df. text_splitted.str.split(':')

However, final_text is returning NaN. What just happened? Why is the new column returning null? Is there any way I can fix this (or a better way to do what I'm trying to do here)?

It's because `df['text_splitted']` is not a string but a list (`split`) — Corralien, May 17 '21 at 22:59
@Corralien is correct using `.str.split(':')` on a list is going to result in null/NaN. However, to improve this and future questions your data should be an easily __copyable__ piece of code that can be used to build your dataframe easily. See [MRE - Minimal, Reproducible, Example](https://stackoverflow.com/help/minimal-reproducible-example), and [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888). If you include your expected output then you're likely to get not only where you went wrong, but also a working solution. — Henry Ecker, May 17 '21 at 23:03

score 1 · Answer 1 · answered May 17 '21 at 23:45

As you wrote you need to split two times your column text. Afterward you can create a dataframe with 3 columns:

id from your original dataframe
question (even rows) from the previous split
answer (odd rows) from the previous split

text = df["text"].str.strip().str.split("\n").explode().str.split(": ").explode()

out = pd.merge(df["id"], pd.DataFrame({"question": text[0::2], "answer": text[1::2]}),
               left_index=True, right_index=True).reset_index(drop=True)

What do you think about this format?

>>> out
   id                  question    answer
0  A1  How was your experience?     Great
1  A1         What did you buy?    A book
2  B1  How was your experience?      Good
3  B1         What did you buy?     A pen
4  C2  How was your experience?     Awful
5  C2         What did you buy?  A pencil

score 0 · Answer 2 · answered May 17 '21 at 23:19

You can use a combination of .apply() and .split() to get the answers

df = pd.DataFrame({'text': ['How was your experience?: Great\nWhat did you buy?: A book\n']})

Input DF

    text
0   How was your experience?: Great\nWhat did you ..

Split into questions and answers

df['questions'] = df['text'].apply(lambda x: [y.split(":")[0] for y in x.split("\n")])
df['answers'] = df['text'].apply(lambda x: [y.split(":")[1] for y in x.split("\n") if len(y)>1])

Output DF

    answers              questions
0   [ Great, A book]    [How was your experience?, What did you buy?, ]

Not sure why the answers is returning "IndexError: list index out of range". any idea? — dsbr__0, May 18 '21 at 22:00

score 0 · Answer 3 · answered May 18 '21 at 00:15

You can try this:

df.set_index('id')['text'].str.replace(r'\\n$', '').str.split(r'\\n').explode().str.split(': ', expand=True)

                           0         1
id                                    
A1  How was your experience?     Great
A1         What did you buy?    A book
B1  How was your experience?      Good
B1         What did you buy?     A pen
C2  How was your experience?     Awful
C2         What did you buy?  A pencil

How can I split the output of str.split() column in pandas?

3 Answers3