1

I'm trying to split array values to columns.

I've created a Google Colab notebook and you can find my code here.

Here is a screenshot of the data (Hashtags):

Here is a representation of the data.

    codes
1   [71020]
2   [77085]
3   [36415]
4   [99213, 99287]
5   [99233, 99233, 99233]

I want to split this arrays into different columns.

To something like this (screenshot - Hashtags split to columns):

Here is a representation of it.

                   code_1      code_2      code_3   
1                  71020
2                  77085
3                  36415
4                  99213       99287
5                  99233       99233       99233

I tried the following code which I got form this Stack Overflow post, but it doesn't give the expected results:

df_hashtags_splitted = pd.DataFrame(df['hashtags'].tolist())

What am I doing wrong?

Nimantha
  • 6,405
  • 6
  • 28
  • 69
  • 1
    What are the unexpected results you are getting? – Zorgoth May 22 '22 at 05:07
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community May 22 '22 at 05:57

1 Answers1

1

The reason is the lists are still stored as strings in the hashtags column when you read them with read_csv. You can convert them upon reading of the data (follwing code taken from the Colab notebook):

import pandas as pd
from ast import literal_eval

url = "https://raw.githubusercontent.com/hashimputhiyakath/datasets/main/hashtags10.csv"

# Notice the added converter to turn strings into lists.
df = pd.read_csv(url, converters={'hashtags': literal_eval})

And then the solution you mentioned will work as expected.

df_hashtags_splitted = pd.DataFrame(df['hashtags'].tolist(), index=df.index).add_prefix('hashtag_')
print(df_hashtags_splitted.head(10))
          hashtag_0     hashtag_1         hashtag_2       hashtag_3           hashtag_4       hashtag_5    hashtag_6         hashtag_7  hashtag_8       hashtag_9 hashtag_10 hashtag_11
0         longcovid     covidhelp              None            None                None            None         None              None       None            None       None       None
1            mumbai         covid      hospitalbeds  covidemergency           mahacovid       oxygenbed  mumbaicovid  covid19indiahelp  covidhelp  covidresources       None       None
2   kawahcoffeeshop   coffeelover             kawah       costarica            puravida         heredia       oxygen              None       None            None       None       None
3           lucknow        mumbai         hyderabad           delhi            verified  covidresources    covidhelp  covid19indiahelp       None            None       None       None
4            oxygen          None              None            None                None            None         None              None       None            None       None       None
5  covid19indiahelp        mahara              None            None                None            None         None              None       None            None       None       None
6            oxygen       amadoda              None            None                None            None         None              None       None            None       None       None
7  plasmadonordelhi  plasmamumbai  covid19indiahelp       covidhelp  covidemergency2021            None         None              None       None            None       None       None
8            oxygen  conservation           wilding       rewilding         environment  sustainability  restorative       agriculture   wildlife    biodiversity      water   wildswim
9             covid      verified            mumbai          oxygen  covidemergency2021         covid19    covidhelp    covidresources       None            None       None       None

Alternatively, to convert the lists to strings after you read the csv you can do:

df['hashtags'] = df['hashtags'].map(literal_eval)
user2246849
  • 4,217
  • 1
  • 12
  • 16
  • 1
    from ast import literal_eval ` df['hashtags'] = df['hashtags'].map(literal_eval) df_hashtags_splitted = pd.DataFrame(df['hashtags'].tolist()).add_prefix('hashtag_') df_hashtags_splitted.head() – Hashim Hamza Puthiyakath May 22 '22 at 13:15
  • @HashimHamzaPuthiyakath right, I had forgot about the `hashtag_` prefix. I added that to the answer as well, thanks! – user2246849 May 22 '22 at 13:18