Python parsing unstructured data

Question

I have a file that contains following data in it. I am trying to parse the data.

08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda   model: civic
year: 2019
trim: "lx"
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota  model: highlander
year: 2021
trim: "Platinum"

I want to see the data like this:

Make, Model, Year, trim, Owner
Honda, civic, 2019, lx, phillip
toyota, highlander, 2021, platinum, Rex

here is my code: I was tryin to create dictionary and then load to panda dataframe. I think i am not on right direction.

def fix_line(record):
    #split every field and value into a seperate line
    results = []
    mini_collection = []
    if not record.startswith("/*"):
        #for data in record.rstrip('\n').strip().split('   '):
        for data in record.rstrip('\n').split('   '):
            if ':' not in data:
                mini_collection.append(data)
            else:
                results.append(data)
    return results
                    
def create_dictionary(data):   
    record = {}                
    for line in fix_line(data):
        line = line.strip()
        name, value = line.split(':', 1)
        record[name.strip()] = value.strip()
    return record

Can you elaborate on this format? Is this a standard output from some other program or system? Can we make guarantees about the dataset like that data always starts on line 4, and every block has values for all keys? Or do we have to interpret everything on the fly? — Henry Ecker, Sep 19 '21 at 02:24
@HenryEcker yes, it is a file from another system. Yes, every data starts from line 4. Some blocks may have more keys and values, and some may have less keys and value. — st_bones, Sep 19 '21 at 11:53

score 2 · Answer 1 · answered Sep 19 '21 at 04:06

2

Heres one way:

import re
import yaml #python -m pip install pyyaml
import pandas as pd 

s = """08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda
model: civic
year: 2019
trim: lx
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota
model: highlander
year: 2021
trim: Platinum
owner: Rex
"""

lines = re.split("/*\s*/", s)
records = [yaml.load(line) for line in lines if "make:" in line]
df = pd.DataFrame(records)

output:

     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum      Rex

answered Sep 19 '21 at 04:06

anon01

10,618
8
35
58

how do I apply re.split if I have data like this? /* ----------------- data1----------------- */ make: honda model: civic year: 2019 trim: lx owner: phillip /* ----------------- data2----------------- */ make: toyota model: highlander year: 2021 trim: Platinum owner: Rex – st_bones Sep 20 '21 at 00:29
@annon01 - I have made some changes to the input file, trim has value with double quote and owner does not exist data2. what changes I should do in your code? – st_bones Sep 20 '21 at 15:42
you will have to write a custom cleaning function if your data is dirty - there is no easy solution for that – anon01 Sep 20 '21 at 16:08

Raymond Toh · Answer 2 · 2021-09-21T12:32:10.417

0

Try using re.finditer with the following pattern to create dictionary based on finding. Then append to a dataframe.

import re

pattern = """
    (?P<make>(?<=(make:\ ))\w+) #use lookbehind regex to get make
    (\s + model: \ )            #Skip to model
    (?P<model>\w+)              #Get Model
    (\s year: \ )               #Skip to year
    (?P<year>\d+)               #Get year
    (\s + trim: \ ")            #Skip to trim
    (?P<trim>\w+)               #Get trim
    ("\s)                       #Skip to owner
    (?P<owner>.*)               #Get owner
"""

df = pd.DataFrame([item.groupdict() for item in re.finditer(pattern, data, re.VERBOSE)])
df["owner"] = df["owner"].str.replace("owner: ", "")
df
Out[563]: 
     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum

edited Sep 21 '21 at 12:32

answered Sep 19 '21 at 03:02

Raymond Toh

779
1
8
27

2

Not the DV, but [never grow a DataFrame](https://stackoverflow.com/a/56746204/15497888). `DataFrame.append` in a loop has quadratic time complexity. It would be better to add all of those dictionaries to a list then build the DataFrame at the end. `df = pd.DataFrame([item.groupdict() for item in re.finditer(pattern, data, re.VERBOSE)])` – Henry Ecker Sep 19 '21 at 03:21
@HenryEcker good idea. i learn something new today. Thanks for feedback! – Raymond Toh Sep 19 '21 at 03:29
@RaymondToh - I have modified my input file data little bit, like trim has value with double quote and owner does not exist data2 and make and model are on one line. how do I write the pattern for this? – st_bones Sep 21 '21 at 01:52
@st_bones are you able to provide me a sample? Also, do use `list comprehension` method provided by Henry and mark a tick to the solution. – Raymond Toh Sep 21 '21 at 02:11
@RaymondToh sample data is in the original request description, I have modified the sample data in the description box. Thank you for your help. – st_bones Sep 21 '21 at 12:05
@st_bones ur data2 output should not have rex anymore? – Raymond Toh Sep 21 '21 at 12:21
1

@st_bones this is the best results I can get using `regex`. Because `data1` and `data2` do not have the same pattern. so the `str.replace` is needed. If this work for you, do upvote and approve the solution :) – Raymond Toh Sep 21 '21 at 12:34

Python parsing unstructured data

2 Answers2