0

I've dataset as in following format :

df = pd.read_csv("data_processing.csv")
df
    user_id volume
0   a       {"BTCUSDT":1000,"USDTINR":20}
1   b       {"BTCINR":30,"USDTINR":10,"ETHINR":15}
2   c       {"XRPINR":10,"ETHUSDT":500,"XRPUSDT":200}
3   d       {"ETHINR":5}

I want to convert the above dataset in following format :

df
   user_id  symbol  volume
0   a       BTCUSDT 1000.0
1   a       USDTINR 20.0
2   b       USDTINR 10.0
3   b       BTCINR  30.0
4   b       ETHINR  15.0
5   c       XRPINR  10.0
6   c       ETHUSDT 500.0
7   c       XRPUSDT 200.0
8   d       ETHINR  5.0'

What I've tried till now :

Converted string to dict for "volume" column

df['volume'] = df['volume'].map(eval)

converted volume column to from dict to all the keys in one column and all the values in another column

df2 = pd.json_normalize(df['volume']).stack().to_frame(name='volume').reset_index()

But now I'm finding it difficult to map the user_id's to the output of above dataframe.

3 Answers3

1

One Way:

df = pd.DataFrame({'user_id': {0: 'a', 1: 'b', 2: 'c', 3: 'd'},
 'volume': {0: {'BTCUSDT': 1000, 'USDTINR': 20},
  1: {'BTCINR': 30, 'USDTINR': 10, 'ETHINR': 15},
  2: {'XRPINR': 10, 'ETHUSDT': 500, 'XRPUSDT': 200},
  3: {'ETHINR': 5}}})

df = df.join(df.pop('volume').apply(pd.Series)).stack().reset_index()

OUTPUT:

 level_0  level_1  0
0       a  BTCUSDT  1000.0
1       a  USDTINR    20.0
2       b  USDTINR    10.0
3       b   BTCINR    30.0
4       b   ETHINR    15.0
5       c   XRPINR    10.0
6       c  ETHUSDT   500.0
7       c  XRPUSDT   200.0
8       d   ETHINR     5.0
Nk03
  • 14,699
  • 2
  • 8
  • 22
1

First you can convert the volume dict to a list of key-value pairs like [(BTCUSDT, 1000), (USDTINR, 20)] for each row, then you use explode to put them on different rows and convert them to 2 columns. Finally join it back to the original df.

(
    df.drop('volume', 1)
    .join(df.volume.apply(lambda x: list(x.items())).explode().apply(lambda x: pd.Series(x, ['symbol', 'volume'])))
)
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
0

You can also use pd.DataFrame() to expand the dict into columns. Then, use .stack() to convert the expanded columns into rows, as follows:

df_out = (df.drop('volume', axis=1).join(pd.DataFrame(df['volume'].tolist(), index=df.index))
            .set_index('user_id', append=True).stack().reset_index([1,2])
         )
df_out.columns = ['user_id', 'symbol', 'volume']

Expansion by pd.DataFrame generally has faster execution than expansion via pd.Series. See this answer of another post Split / Explode a column of dictionaries into separate columns with pandas for details.

Result:

print(df_out)

  user_id   symbol  volume
0       a  BTCUSDT  1000.0
0       a  USDTINR    20.0
1       b  USDTINR    10.0
1       b   BTCINR    30.0
1       b   ETHINR    15.0
2       c   XRPINR    10.0
2       c  ETHUSDT   500.0
2       c  XRPUSDT   200.0
3       d   ETHINR     5.0
SeaBean
  • 22,547
  • 3
  • 13
  • 25