0

I have a data frame that looks like this

    account_id  result
0   588930      {"symbol": "MSFT", "balance": 0.00...

and when I print a single cell value of the result column I get the following, which seems to be a string of dicts:

'{"symbol": "MSFT", "balance": 0.00, "transactionId": 10496491},{"symbol": "AAPL", "balance": 300.12, "transactionId": 10509620},{"symbol": "TSLA", "balance": 40.4, "transactionId": 10632589}'

Other users may have different symbol assets.

I want to access the content in result as dictionaries in order to expand the content into multiple columns where the column names are the symbols (ex: MSFT, TSLA...) and the values are the balance numbers.

I haven't been able to transform the string into dictionaries to be able to access the contents.

Thanks!

UPDATE

I tried the following

def string_to_dict(dict_string):
    # Convert to proper json format
    dict_string = dict_string.replace("'", '"').replace('u"', '"')
    return json.loads(dict_string)

df.result = f.result.apply(string_to_dict)

But I get the following error

JSONDecodeError: Extra data: line 1 column 93 (char 92)

which I believe means that json.loads cannot decode multiple dictionaries at the same time As described here

finstats
  • 1,349
  • 4
  • 19
  • 31
  • It's not a string of dicts, it's a string of JSON. Use `json.loads` to convert it to a data structure. I believe pandas has a way to do that on a whole column. – Tim Roberts Jun 19 '23 at 03:48
  • Hi @TimRoberts, have tried using json.loads however I wasn't able to convert the strings as there are varying number of dictionaries in each row – finstats Jun 19 '23 at 03:59
  • Then perhaps you need to convert your data to a suitable format BEFORE you shove it into a dataframe. Many people hurt themselves by moving to pandas too early. Where did this data come from? How are you going to handle cases where different clients have different holdings? The columns won't line up. – Tim Roberts Jun 19 '23 at 04:06
  • it is a proprietary data set from a client, and I cannot ask for too much customization. For cases with different holdings, the client would have a 0 or NaN for that column – finstats Jun 19 '23 at 04:17
  • We're straying into XY territory, would it be possible to update the question to describe the actual problem you're trying to solve rather than how to solve a problem with an attempted solution please? What is the data source, how are you querying for the data prior to the population of the data frame please? Can you a show a contrived sample of the raw data? – ymas Jun 19 '23 at 04:18
  • The data did not arrive to you as a dataframe. That's my point. You sucked something up into a dataframe, and you did it too early. – Tim Roberts Jun 19 '23 at 06:40

1 Answers1

0

Your data are not valid json, you have to enclose them into a list.

You can use:

import json

def process(row):
    data = json.loads(f'[{row}]')  # HERE '[' + string_of_dict + ']'
    return pd.DataFrame({rec['symbol']: [rec['balance']] for rec in data})

balance = pd.concat(df['result'].map(process).tolist()).set_index(df.index)
out = pd.concat([df.drop(columns='result'), balance], axis=1)

Output:

>>> out
  account_id  MSFT    AAPL  TSLA
0     588930   0.0  300.12  40.4

An example with multiple rows:

data = {'account_id': ['588930', '588931'],
        'result': ['{"symbol": "MSFT", "balance": 0.00, "transactionId": 10496491},{"symbol": "AAPL", "balance": 300.12, "transactionId": 10509620},{"symbol": "TSLA", "balance": 40.4, "transactionId": 10632589}',
                   '{"symbol": "MSFT", "balance": 10.3, "transactionId": 10496491},{"symbol": "CSCO", "balance": 0.0, "transactionId": 10509620},{"symbol": "ABNB", "balance": 26.8, "transactionId": 10632589}']}
df = pd.DataFrame(data)

Output:

>>> out.fillna(0)
  account_id  MSFT    AAPL  TSLA  CSCO  ABNB
0     588930   0.0  300.12  40.4   0.0   0.0
1     588931  10.3    0.00   0.0   0.0  26.8
Corralien
  • 109,409
  • 8
  • 28
  • 52