splitting non-obvious messy strings in python

Question

I have this string:

Model:                ARIMA                                       BIC:                 417.2273
Dependent Variable:   D.Sales of shampoo over a three year period Log-Likelihood:      -196.17
Date:                 2018-09-24 13:20                            Scale:               1.0000
No. Observations:     35                                          Method:              css-mle
Df Model:             6                                           Sample:              02-01-1901
Df Residuals:         29                                                               12-01-1903
Converged:            1.0000                                      S.D. of innovations: 64.241
No. Iterations:       19.0000                                     HQIC:                410.098
AIC:                  406.3399

and I want to make it into a dictionary. I already use: split("\n") and i get

Model: ARIMA BIC: 417.2273
Dependent Variable: D.Sales of shampoo over a three year period Log-Likelihood: -196.17
Date: 2018-09-24 13:20 Scale: 1.0000
No. Observations: 35 Method: css-mle
Df Model: 6 Sample: 02-01-1901
Df Residuals: 29 12-01-1903
Converged: 1.0000 S.D. of innovations: 64.241
No. Iterations: 19.0000 HQIC: 410.098
AIC: 406.3399

but I don't see a good way to split to put it into a dictionary. Maybe I'm missing something obvious?

also, note the formatting of the dates next to 'Sample:'

I want something like : {"Model": "ARIMA", "BIC": 417.2273, ...}

You haven't shown what the resulting dictionary should look like — roganjosh, Sep 24 '18 at 20:59
Do you have options on how to import the string? You could look at the example [here](https://stackoverflow.com/questions/4914008/how-to-efficiently-parse-fixed-width-files) on parsing a fixed-width file, which this seems to be. — smp55, Sep 24 '18 at 21:21
no i don't believe so. at the end of the day it's a string parsing question. — user3662456, Sep 24 '18 at 21:34
Do you have a list of _all_ the possible keys? (model, Bic..)? Can the ':' character appear in the _values_? — eddiewould, Sep 24 '18 at 21:43
Try splitting on ':' + 0 or more whitespace characters (use regex). Assert you have an even number of items. Then assign odd ones to keys, even ones to values. — eddiewould, Sep 24 '18 at 21:45
I'm not really familiar with regex, and i'm particularly stumped with: 'Date: 2018-09-24 13:20 Scale: 1.0000' or these 'No. Observations: 35 ' — user3662456, Sep 24 '18 at 22:01

score 0 · Answer 1 · answered Sep 25 '18 at 09:34

The primary problem is that there are several columns side by side. Since both keys and values contain whitespace, you cannot split on that. Instead, you have to first separate the columns, then parse the data.

If the length of columns is unknown

Use the first line to identify the length of columns. Once you have the columns separated, you can easily separate keys and values at the colon.

If the placement of keys is stable, you can exploit that the first line only has keys without spaces.

lines = input_string.splitlines()
key_values = lines[0].split()  # split first line into keys and values
column_keys = key_values[::2]  # extract the keys by taking every second element
column_starts = [lines[0].find(key) for key in column_keys]  # index of each key

Once you are at this point, proceed as if the length of columns were known.

If the length of columns is known

Separate the columns on their start indices.

column_ends = column_starts[1:] + [None]
# separate all key: value lines
key_values = [
    line[start:end]
    # ordering is important - need to parse column-first for the next step
    for start, end in zip(column_starts, column_ends)
    for line in lines
]

Since Sample uses a multi-line value, we cannot neatly split keys from values on the colon. Instead, we must track the previously seen key to insert it for key-less values.

data = {}
for line in key_values:
    if not line:
        continue
    # check if there is a key at the start of the line
    if line[0] != ' ':
        # insert key/value pairs
        key, value = line.split(':', 1)
        data[key.strip()] = value.strip()
    else:
        # append dangling values
        value = line
        data[key.strip()] += '\n' + value.strip()

This gives you a key: value dictionary of strings:

{'Model': 'ARIMA',
 'Dependent Variable': 'D.Sales of shampoo over a three year period',
 'Date': '2018-09-24 13:20',
 'No. Observations': '35',
 'Df Model': '6',
 'Df Residuals': '29',
 'Converged': '1.0000',
 'No. Iterations': '19.0000',
 'AIC': '406.3399',
 'BIC': '417.2273',
 'Log-Likelihood': '-196.17',
 'Scale': '1.0000',
 'Method': 'css-mle',
 'Sample': '02-01-1901\n12-01-1903',
 'S.D. of innovations': '64.241',
 'HQIC': '410.098'}

If you need to convert the values into non-strings, I suggest explicitly converting each field. You can use a dispatch table for each key to define the conversion.

import time

converters = {
 'Model': str, 'Dependent Variable': str,
 'Date': lambda field: time.strptime(field, '%Y-%m-%d %H:%M'),
 'No. Observations': int, 'Df Model': int, 'Df Residuals': int,
 'Converged': float, 'No. Iterations': float, 'AIC': float,
 'BIC': float, 'Log-Likelihood': float, 'Scale': float,
 'Method': str, 'Sample': str, 'S.D. of innovations': float,
 'HQIC': float
}
converted_data = {key: converters[key](data[key]) for key in data}

splitting non-obvious messy strings in python

1 Answers1

If the length of columns is unknown

If the length of columns is known