-1

I can't use Pyspark as an FYI!

My data looks like this:

0   { "CountryOfManufacture": "China", "Tags": ["U...
1   { "CountryOfManufacture": "China", "Tags": ["U...
2   { "CountryOfManufacture": "China", "Tags": [] }
3   { "CountryOfManufacture": "Japan", "Tags": ["3...
4   { "CountryOfManufacture": "Japan", "Tags": ["1...
... ...
222 { "CountryOfManufacture": "USA", "ShelfLife": ...
223 { "CountryOfManufacture": "USA", "ShelfLife": ...
224 { "CountryOfManufacture": "USA", "ShelfLife": ...
225 { "CountryOfManufacture": "USA", "ShelfLife": ...
226 { "CountryOfManufacture": "USA", "ShelfLife": .

So there are different values contained in the dictionary. I am only interested in the first one (Country of Manufacture) and want to split this out and then add to another dataframe.

Thanks!

  • Does this answer your question? [Convert list of dictionaries to a pandas DataFrame](https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe) – Pranav Hosangadi Oct 12 '21 at 15:47
  • Welcome to Stack Overflow! Please take the [tour], read [what's on-topic here](/help/on-topic), [ask], and the [question checklist](//meta.stackoverflow.com/q/260648/843953). You'd find an answer quicker with a web-search for the title of your question (which would probably take you to the link I gave above) than asking a new question. [How much research effort is expected of Stack Overflow users?](//meta.stackoverflow.com/a/261593/843953) – Pranav Hosangadi Oct 12 '21 at 15:49

2 Answers2

0

If all your dictionaries have the same keys (or even if they don't! see comment from Pranav below!), then pandas.DataFrame.from_records will work well (link to docs page).

import pandas as pd

data = [{'CountryOfManufacture': 'China', 'col_2': 'a'},
        {'CountryOfManufacture': 'Japan', 'col_2': 'b'},
        {'CountryOfManufacture': 'China', 'col_2': 'c'},
        {'CountryOfManufacture': 'USA', 'col_2': 'd'}]

df = pd.DataFrame.from_records(data)
print(df.head())

#   CountryOfManufacture col_2
# 0                China     a
# 1                Japan     b
# 2                China     c
# 3                  USA     d

if you just want one column, you can select that column after, df["CountryOfManufacture"], or use the exclude keyword and provide a list of all the columns you don't want df = pd.DataFrame.from_records(data, exclude=['col_2'])

chris
  • 1,267
  • 7
  • 20
  • 1
    This works even if they _don't_ have the same keys, you'll just end up with `NA`s in the missing keys – Pranav Hosangadi Oct 12 '21 at 15:46
  • Thanks for your help! I have tried the from_records approach and unfortunately can't seem to get it to work either on the data as originally shown to you, or in list form. Will post an answer showing what is happening! – CharleeBee Oct 12 '21 at 19:36
0

When I try to use from_records my result looks like this:

                                        CustomFields
0  { "CountryOfManufacture": "China", "Tags": ["U...
1  { "CountryOfManufacture": "China", "Tags": ["U...
2    { "CountryOfManufacture": "China", "Tags": [] }
3  { "CountryOfManufacture": "Japan", "Tags": ["3...
4  { "CountryOfManufacture": "Japan", "Tags": ["1...

I think this is because my data is in an unusual format. My data was provided in a CSV file originally, and this was one of the columns. All other columns were in integer/float/object format, whilst this column was already in dictionary format when you viewed it in Excel.

The data you used for your example below is formatted as I would expect, but this is what mine looks like when converted into a list:

['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": [] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["Comedy"] }', ...

As you can see, I have additional quotes outside of each dictionary list, illustrated with a single line here: ['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }'.

Is there a way to get around this without pyspark?

Thanks!