2

I have this string:

Model:                ARIMA                                       BIC:                 417.2273
Dependent Variable:   D.Sales of shampoo over a three year period Log-Likelihood:      -196.17
Date:                 2018-09-24 13:20                            Scale:               1.0000
No. Observations:     35                                          Method:              css-mle
Df Model:             6                                           Sample:              02-01-1901
Df Residuals:         29                                                               12-01-1903
Converged:            1.0000                                      S.D. of innovations: 64.241
No. Iterations:       19.0000                                     HQIC:                410.098
AIC:                  406.3399

and I want to make it into a dictionary. I already use: split("\n") and i get

Model: ARIMA BIC: 417.2273
Dependent Variable: D.Sales of shampoo over a three year period Log-Likelihood: -196.17
Date: 2018-09-24 13:20 Scale: 1.0000
No. Observations: 35 Method: css-mle
Df Model: 6 Sample: 02-01-1901
Df Residuals: 29 12-01-1903
Converged: 1.0000 S.D. of innovations: 64.241
No. Iterations: 19.0000 HQIC: 410.098
AIC: 406.3399

but I don't see a good way to split to put it into a dictionary. Maybe I'm missing something obvious?

also, note the formatting of the dates next to 'Sample:'

I want something like : {"Model": "ARIMA", "BIC": 417.2273, ...}

user3662456
  • 267
  • 2
  • 11
  • 2
    You haven't shown what the resulting dictionary should look like – roganjosh Sep 24 '18 at 20:59
  • 2
    Do you have options on how to import the string? You could look at the example [here](https://stackoverflow.com/questions/4914008/how-to-efficiently-parse-fixed-width-files) on parsing a fixed-width file, which this seems to be. – smp55 Sep 24 '18 at 21:21
  • no i don't believe so. at the end of the day it's a string parsing question. – user3662456 Sep 24 '18 at 21:34
  • Does the first line always contain the Model and BIC keys? – MisterMiyagi Sep 24 '18 at 21:43
  • Do you have a list of _all_ the possible keys? (model, Bic..)? Can the ':' character appear in the _values_? – eddiewould Sep 24 '18 at 21:43
  • Try splitting on ':' + 0 or more whitespace characters (use regex). Assert you have an even number of items. Then assign odd ones to keys, even ones to values. – eddiewould Sep 24 '18 at 21:45
  • I'm not really familiar with regex, and i'm particularly stumped with: 'Date: 2018-09-24 13:20 Scale: 1.0000' or these 'No. Observations: 35 ' – user3662456 Sep 24 '18 at 22:01

1 Answers1

0

The primary problem is that there are several columns side by side. Since both keys and values contain whitespace, you cannot split on that. Instead, you have to first separate the columns, then parse the data.


If the length of columns is unknown

Use the first line to identify the length of columns. Once you have the columns separated, you can easily separate keys and values at the colon.

If the placement of keys is stable, you can exploit that the first line only has keys without spaces.

lines = input_string.splitlines()
key_values = lines[0].split()  # split first line into keys and values
column_keys = key_values[::2]  # extract the keys by taking every second element
column_starts = [lines[0].find(key) for key in column_keys]  # index of each key

Once you are at this point, proceed as if the length of columns were known.


If the length of columns is known

Separate the columns on their start indices.

column_ends = column_starts[1:] + [None]
# separate all key: value lines
key_values = [
    line[start:end]
    # ordering is important - need to parse column-first for the next step
    for start, end in zip(column_starts, column_ends)
    for line in lines
]

Since Sample uses a multi-line value, we cannot neatly split keys from values on the colon. Instead, we must track the previously seen key to insert it for key-less values.

data = {}
for line in key_values:
    if not line:
        continue
    # check if there is a key at the start of the line
    if line[0] != ' ':
        # insert key/value pairs
        key, value = line.split(':', 1)
        data[key.strip()] = value.strip()
    else:
        # append dangling values
        value = line
        data[key.strip()] += '\n' + value.strip()

This gives you a key: value dictionary of strings:

{'Model': 'ARIMA',
 'Dependent Variable': 'D.Sales of shampoo over a three year period',
 'Date': '2018-09-24 13:20',
 'No. Observations': '35',
 'Df Model': '6',
 'Df Residuals': '29',
 'Converged': '1.0000',
 'No. Iterations': '19.0000',
 'AIC': '406.3399',
 'BIC': '417.2273',
 'Log-Likelihood': '-196.17',
 'Scale': '1.0000',
 'Method': 'css-mle',
 'Sample': '02-01-1901\n12-01-1903',
 'S.D. of innovations': '64.241',
 'HQIC': '410.098'}

If you need to convert the values into non-strings, I suggest explicitly converting each field. You can use a dispatch table for each key to define the conversion.

import time

converters = {
 'Model': str, 'Dependent Variable': str,
 'Date': lambda field: time.strptime(field, '%Y-%m-%d %H:%M'),
 'No. Observations': int, 'Df Model': int, 'Df Residuals': int,
 'Converged': float, 'No. Iterations': float, 'AIC': float,
 'BIC': float, 'Log-Likelihood': float, 'Scale': float,
 'Method': str, 'Sample': str, 'S.D. of innovations': float,
 'HQIC': float
}
converted_data = {key: converters[key](data[key]) for key in data}
MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119