0

i have the following dataframe and i want to normalize the column Ambience, to get one column for touristy, romantic etc. Im probably missing something obvious, so i really need some help here

                 business_id                                           Ambience
0       6iYb2HFDywm3zjuRg0shjw  {'touristy': False, 'hipster': False, 'romanti...
1       tCbdrRPZA0oiIYSmHG3J0w  {'romantic': False, 'intimate': False, 'touris...
2       bvN78flM8NLprQ1a1y5dRg                                                NaN
3       oaepsyvc0J17qwi8cfrOWg                                                NaN
4       PE9uqAjdw0E4-8mjGl3wVA                                                NaN
...                        ...                                                ...
160580  D2mHoIDXx9N8mS1pGoKV9Q                                                NaN
160581  bQX-kwVTyZgcdZGEPzce6Q                                                NaN
160582  wvFZ06nmPmQ2-IVoPqVYLA                                                NaN
160583  GB75wPibj3IjNauaoCxyGA                                                NaN
160584  ngmLL5Y5OT-bYHKU0kKrYA  {'romantic': False, 'intimate': False, 'classy...

I tried multiple solutions on StackOverflow, but they just dont work for me, for example:

df = df["Ambience"].apply(pd.Series)

or

df["Ambience"] = df["Ambience"].apply(
    literal_eval
)
df2 = pd.json_normalize(df["Ambience"])

i tried filling in the NaN beforehand, but that didnt help also:

df.Ambience = df.Ambience.fillna(
    {i: {} for i in df.index}
)

or some things i didnt understand to the fullest:

pd.concat(list(df.Ambience.map(json_normalize)))

at this point im just clueless, every solution either doesnt work and or i get some result like:

                                            0
0       {'touristy': False, 'hipster': False, 'romanti...
1       {'romantic': False, 'intimate': False, 'touris...
2                                                     NaN
3                                                     NaN
4                                                     NaN
...                                                   ...
160580                                                NaN
160581                                                NaN
160582                                                NaN
160583                                                NaN
160584  {'romantic': False, 'intimate': False, 'classy...

whats even more confusing, for another similar dataframe the df.apply(pd.Series) worked fine but a little bit slow:

   business_id                                              hours
0       6iYb2HFDywm3zjuRg0shjw  {'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'...
1       tCbdrRPZA0oiIYSmHG3J0w  {'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ...
2       bvN78flM8NLprQ1a1y5dRg  {'Thursday': '11:0-18:0', 'Friday': '11:0-18:0...
3       oaepsyvc0J17qwi8cfrOWg                                               None
4       PE9uqAjdw0E4-8mjGl3wVA  {'Monday': '16:0-19:0', 'Tuesday': '16:0-19:0'...
...                        ...                                                ...
160580  D2mHoIDXx9N8mS1pGoKV9Q  {'Monday': '9:0-19:0', 'Tuesday': '9:0-19:0', ...
160581  bQX-kwVTyZgcdZGEPzce6Q  {'Monday': '10:0-18:0', 'Tuesday': '10:0-18:0'...
160582  wvFZ06nmPmQ2-IVoPqVYLA  {'Monday': '0:0-0:0', 'Wednesday': '12:0-17:0'...
160583  GB75wPibj3IjNauaoCxyGA                                               None

df = df["hours"].apply(pd.Series)


           Monday    Tuesday  Wednesday   Thursday     Friday    Saturday      Sunday
0       11:0-23:0  11:0-23:0  11:0-23:0  11:0-23:0  11:0-23:0   11:0-23:0   11:0-23:0
1        5:0-18:0   5:0-17:0   5:0-18:0   5:0-18:0   5:0-18:0    5:0-18:0    5:0-18:0
2             NaN        NaN        NaN  11:0-18:0  11:0-18:0   11:0-18:0   11:0-18:0
3             NaN        NaN        NaN        NaN        NaN         NaN         NaN
4       16:0-19:0  16:0-19:0  16:0-19:0  16:0-19:0  16:0-19:0    9:0-11:0         NaN
...           ...        ...        ...        ...        ...         ...         ...
160580   9:0-19:0   9:0-19:0   9:0-19:0   9:0-19:0   9:0-19:0    9:0-19:0    9:0-19:0
160581  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0    9:0-17:0         NaN
160582    0:0-0:0        NaN  12:0-17:0  19:0-21:0  19:0-21:0  15:30-21:0   16:0-18:0
160583        NaN        NaN        NaN        NaN        NaN         NaN         NaN
160584        NaN  17:0-21:0  17:0-21:0  17:0-21:0  17:0-21:0  11:30-21:0  12:0-19:30
mcmaddud
  • 1
  • 2
  • The values in the Ambience column are Python dictionaries or strings? – aaossa Feb 11 '22 at 19:38
  • 2
    This question would really benefit from a [mre], there is a huge amount of data here, but none of it is actually working data that we can use in Python. You can [edit] the question to supply a minimal input and the expected output. – Kraigolas Feb 11 '22 at 19:42
  • You have NaN in your column, so you need a little more steps before being able to do `json_normalize`. Take a look at this and check the column type and see which case it fits you. https://stackoverflow.com/q/63876637/2956135 – Emma Feb 11 '22 at 20:03
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 22 '22 at 11:39

1 Answers1

0

Thanks for the input, actually i found the solution. I never thought of it but some values in the column were floats, some strings etc.

stackoverflow.com/a/58155933/17328150 this solved my problem.

The types of the column were all over the place, so converting them first to strings and then to dicts, solved the problem!

df.Ambience = df.Ambience.fillna(
    {i: {} for i in df_attributes_series.index}
)

df = df["Ambience"].astype("str")
df = df.apply(lambda x: ast.literal_eval(x))
df = df.apply(pd.Series)
mcmaddud
  • 1
  • 2