Hi I'm trying to create multiple columns in my dataframe based on multiple lines within the [comment
] column cell. The source data is a .csv
file
This is my dataset sample
+---------+-----------------------------------------+
| id | comment |
+---------+-----------------------------------------+
| 123ab12 | DATE: 2/3/21 10:23:42 AM CST |
| | STAGE: 1 |
| | SCORE: 2,321 |
| | NAME: Sally |
| | HOBBY: Swimming |
| | NOTES: But she doesn't like: sun, fish |
+---------+-----------------------------------------+
| 123ab12 | DATE: 4/3/21 8:15:20 AM CST |
| | STAGE: 1 |
| | SCORE: 500 |
| | NAME: Tom |
| | HOBBY: Running |
| | AGE: 26 |
| | NOTES: He needs new pair of sport shoes |
+---------+-----------------------------------------+
This is what I want to get
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| id | date | stage | score | name | hobby | age | notes |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 2/3/21 10:23:42 AM CST | 1 | 2,321 | Sally | Swimming | | But she doesn't like: sun, fish |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 4/3/21 8:15:20 AM CST | 1 | 500 | Tom | Running | 26 | He needs new pair of sport shoes |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
Note that :
- Some comments may have an additional line for
AGE
- Colon
:
may appear twice inNOTES
in the [comment
] column, e.gNOTES: bla bla bla : further sentence
ID
can be duplicated- There are different
ID
s and thousands of rows
My initial thought was to :
- somehow use a regex to use the line breaks
\n
beforeNOTES:
as column separator (but theAGE
line that sometimes appear seem to mess it up, or my brain just isn't working...)
Your help is much appreciated. Thank you!