1

Here's the thing, I have this sort of dataset (let's call it df):

id       text
A1       How was your experience?: Great\nWhat did you buy?: A book\n
B1       How was your experience?: Good\nWhat did you buy?: A pen\n
C2       How was your experience?: Awful\nWhat did you buy?: A pencil\n

As you can see, this is a table containing a survey and I'm trying to get only the answers from the column text. My first tought was to try to split the text, just like this:

df['text_splitted'] = df.text.str.split('\n')

And then I would do something like this:

df['final_text'] = df. text_splitted.str.split(':')

However, final_text is returning NaN. What just happened? Why is the new column returning null? Is there any way I can fix this (or a better way to do what I'm trying to do here)?

dsbr__0
  • 241
  • 1
  • 3
  • 3
    It's because `df['text_splitted']` is not a string but a list (`split`) – Corralien May 17 '21 at 22:59
  • 2
    @Corralien is correct using `.str.split(':')` on a list is going to result in null/NaN. However, to improve this and future questions your data should be an easily __copyable__ piece of code that can be used to build your dataframe easily. See [MRE - Minimal, Reproducible, Example](https://stackoverflow.com/help/minimal-reproducible-example), and [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888). If you include your expected output then you're likely to get not only where you went wrong, but also a working solution. – Henry Ecker May 17 '21 at 23:03

3 Answers3

1

As you wrote you need to split two times your column text. Afterward you can create a dataframe with 3 columns:

  • id from your original dataframe
  • question (even rows) from the previous split
  • answer (odd rows) from the previous split
text = df["text"].str.strip().str.split("\n").explode().str.split(": ").explode()

out = pd.merge(df["id"], pd.DataFrame({"question": text[0::2], "answer": text[1::2]}),
               left_index=True, right_index=True).reset_index(drop=True)

What do you think about this format?

>>> out
   id                  question    answer
0  A1  How was your experience?     Great
1  A1         What did you buy?    A book
2  B1  How was your experience?      Good
3  B1         What did you buy?     A pen
4  C2  How was your experience?     Awful
5  C2         What did you buy?  A pencil
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

You can use a combination of .apply() and .split() to get the answers

df = pd.DataFrame({'text': ['How was your experience?: Great\nWhat did you buy?: A book\n']})

Input DF

    text
0   How was your experience?: Great\nWhat did you ..

Split into questions and answers

df['questions'] = df['text'].apply(lambda x: [y.split(":")[0] for y in x.split("\n")])
df['answers'] = df['text'].apply(lambda x: [y.split(":")[1] for y in x.split("\n") if len(y)>1])

Output DF

    answers              questions
0   [ Great, A book]    [How was your experience?, What did you buy?, ]
Shubham Periwal
  • 2,198
  • 2
  • 8
  • 26
0

You can try this:

df.set_index('id')['text'].str.replace(r'\\n$', '').str.split(r'\\n').explode().str.split(': ', expand=True)

                           0         1
id                                    
A1  How was your experience?     Great
A1         What did you buy?    A book
B1  How was your experience?      Good
B1         What did you buy?     A pen
C2  How was your experience?     Awful
C2         What did you buy?  A pencil
Andreas
  • 8,694
  • 3
  • 14
  • 38