0

I have a problem in splitting the values in a list in a series into columns

Background

I made a table in postgresql and read the data in pandas

# Read the table from postgresql 
sql = "select * from music_store;"
dat = pd.read_sql_query(sql, conn)
dat

And the data looks like this: enter image description here

The data type is as follows:

print(dat.dtypes) 

enter image description here

I would like to normalize the table and what I want to achieve is: enter image description here

Issue

To tackle this, I tried a couple of methods First, I tried to use set_index and explode method as follows

dat.set_index(['transaction_id', 'customer_name', 'cashier_name', 'year']).\
    apply(lambda x: x.str.split(',').explode()).reset_index()
dat

It returns the same result as original

Then I tried to use the stack methods as follows:

dat.set_index(['transaction_id', 'customer_name', 'cashier_name', 'year']).stack()
dat

And I got the same result.

Finally, since it is shown as string value by default, I suspected that the issue is related to the brackets, so I tried to remove the brackets first by using the replace function as follows:

dat['albums_purchased'] = dat['albums_purchased'].str.replace('[', '').replace(']', '')
dat

Interesting, I got a nan result enter image description here

Questions

  1. What is my issue here? And How can I solve the issue?
  2. Why would I get a NaN when I tried to replace a char in the string value?
  3. The task is actually to normalize the tables in the database. In case there are more then 10 million roles, would this approach still works or how to I proceed in order to speed up the process?

Thank you for all of your help in advance!

smci
  • 32,567
  • 20
  • 113
  • 146
Pak Hang Leung
  • 389
  • 5
  • 15

0 Answers0