2

Hi I'm trying to create multiple columns in my dataframe based on multiple lines within the [comment] column cell. The source data is a .csv file

This is my dataset sample

+---------+-----------------------------------------+
| id      | comment                                 | 
+---------+-----------------------------------------+
| 123ab12 | DATE: 2/3/21 10:23:42 AM CST            |  
|         | STAGE: 1                                | 
|         | SCORE: 2,321                            |  
|         | NAME: Sally                             |  
|         | HOBBY: Swimming                         |  
|         | NOTES: But she doesn't like: sun, fish  |
+---------+-----------------------------------------+
| 123ab12 | DATE: 4/3/21 8:15:20 AM CST             |  
|         | STAGE: 1                                |  
|         | SCORE: 500                              |  
|         | NAME: Tom                               |  
|         | HOBBY: Running                          |  
|         | AGE: 26                                 |  
|         | NOTES: He needs new pair of sport shoes |
+---------+-----------------------------------------+

This is what I want to get

+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| id      | date                   | stage | score | name  | hobby    | age | notes                            |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 2/3/21 10:23:42 AM CST | 1     | 2,321 | Sally | Swimming |     | But she doesn't like: sun, fish  |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 4/3/21 8:15:20 AM CST  | 1     | 500   | Tom   | Running  | 26  | He needs new pair of sport shoes |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+

Note that :

  • Some comments may have an additional line for AGE
  • Colon : may appear twice in NOTES in the [comment] column, e.g NOTES: bla bla bla : further sentence
  • ID can be duplicated
  • There are different IDs and thousands of rows

My initial thought was to :

  • somehow use a regex to use the line breaks \n before NOTES: as column separator (but the AGE line that sometimes appear seem to mess it up, or my brain just isn't working...)

Your help is much appreciated. Thank you!

buddemat
  • 4,552
  • 14
  • 29
  • 49
zumba jig
  • 25
  • 3
  • 1
    What is your source data. It is best to solve it before you create the initial dataframe. If we need to work with this dataframe, we can do so. Also, have you tried anything to solve this? If so what haver you trtied – Joe Ferndz Mar 09 '21 at 02:14
  • Hi joe, the source data is a csv file, and I will be outputting the data into a csv as well after transforming it. I tried regex but didn't make much sense on my end :( – zumba jig Mar 09 '21 at 09:15
  • can you share sample records from the source csv file please? It will be best to solve it at source than to move it into a dataframe and then rearrange then convert it back to csv – Joe Ferndz Mar 10 '21 at 17:49
  • 1
    Please don't add your data as images, but instead copy it into the post. This makes it easier for others to work on your problem. There are [online table generators](https://www.tablesgenerator.com/) that can help you make pretty text or markdown tables in no time. – buddemat Mar 10 '21 at 23:45

1 Answers1

1

You can use str.extract and a regex with named capture groups to directly capture the extracted data into dataframe columns with the respective group names (see this answer to a question on pandas split list into columns with regex).

You can use the fixed parts of your comments column (i.e. the labels and newlines) as anchors and make the AGE: part optional.

Regex:

DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(\n)?NOTES:(?P<notes>[\s\S]+)

Explanation:

  1. Each column is extracted by the pattern of the form ANCHOR: (?P<groupname>[\s\S]+)\n
  • ANCHOR: - This are just your plain text labels, i.e. DATE: , STAGE: , etc.
  • (?P<groupname>- This starts a named capture group. The <groupname> directly becomes the dataframe column name.
  • [\s\S]+ - greedy match any series of (at least one) characters (including newlines, see this answer)
  1. For the column age, we need some changes, as the AGE: anchor may be present or not:
  • [\s\S]+? - the last group right before the AGE: anchor is matched lazily, else it would greedily include the whole AGE: part that follows in the match
  • (?:AGE: )? - the AGE: anchor itself is enclosed in an optional non-capturing group, since it may be there or not
  • (?P<age>[\s\S]*?) - the named capture group for age is permitted to be empty, unlike the others
  • (?:\n)? - the trailing newline character is of course also optional and should not be captured

Altogether, this will find matches in your string whether the AGE: part is there (https://regex101.com/r/tn6ixo/2/) or not (https://regex101.com/r/tn6ixo/1/).

Full example:

Input CSV file (comments.csv):

id;comments
123ab12;"DATE: 2/3/21 10:23:42 AM CST
STAGE: 1
SCORE: 2,321
NAME: Sally
HOBBY: Swimming
NOTES: But she doesn't like: sun, fish"
123ab12;"DATE: 4/3/21 8:15:20 AM CST
STAGE: 1
SCORE: 500
NAME: Tom
HOBBY: Running
AGE: 26
NOTES: He needs new pair of sport shoes"

Python script:

import pandas as pd

df = pd.read_csv('comments.csv', delimiter=';')

ef =  df['comments'].str.extract('DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(?:\n)?NOTES:(?P<notes>[\s\S]+)', expand=True)

Outcome:

                     date stage  score   name     hobby age                              notes
0  2/3/21 10:23:42 AM CST     1  2,321  Sally  Swimming        But she doesn't like: sun, fish
1   4/3/21 8:15:20 AM CST     1    500    Tom   Running  26   He needs new pair of sport shoes

Supplemental

Please note that this will yield a dataframe where all columns have dtype: object. You may want to convert some of your columns, e.g.

ef[['stage', 'age']] = ef[['stage', 'age']].apply(pd.to_numeric)
ef['score'] = ef['score'].str.replace(',', '').astype(int)
ef[['name', 'hobby', 'notes']] = ef[['name', 'hobby', 'notes']].astype('string')
ef['date'] = pd.to_datetime(ef['date'])

Beware, the latter command will not be able to automatically identify your timezone correctly, as CST is an ambiguous timezone. Instead, you will end up with naive timestamps.

To add your timezone information, you can e.g. add a timezone with pytz :

import pytz

ef['date'] = ef['date'].apply(lambda x: x.replace(tzinfo=pytz.timezone('America/Chicago')))

You can also create a DatetimeIndex and localize or convert to your timezone.

buddemat
  • 4,552
  • 14
  • 29
  • 49