I have a problem in splitting the values in a list in a series into columns
Background
I made a table in postgresql and read the data in pandas
# Read the table from postgresql
sql = "select * from music_store;"
dat = pd.read_sql_query(sql, conn)
dat
The data type is as follows:
print(dat.dtypes)
I would like to normalize the table and what I want to achieve is:
Issue
To tackle this, I tried a couple of methods First, I tried to use set_index and explode method as follows
dat.set_index(['transaction_id', 'customer_name', 'cashier_name', 'year']).\
apply(lambda x: x.str.split(',').explode()).reset_index()
dat
It returns the same result as original
Then I tried to use the stack methods as follows:
dat.set_index(['transaction_id', 'customer_name', 'cashier_name', 'year']).stack()
dat
And I got the same result.
Finally, since it is shown as string value by default, I suspected that the issue is related to the brackets, so I tried to remove the brackets first by using the replace function as follows:
dat['albums_purchased'] = dat['albums_purchased'].str.replace('[', '').replace(']', '')
dat
Interesting, I got a nan result
Questions
- What is my issue here? And How can I solve the issue?
- Why would I get a NaN when I tried to replace a char in the string value?
- The task is actually to normalize the tables in the database. In case there are more then 10 million roles, would this approach still works or how to I proceed in order to speed up the process?
Thank you for all of your help in advance!