2

I have a dataframe that looks like this:

df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/ask_git.csv')

    Channel_ID  Video_Category_Name score_pct
0   UC--bUZc5c9WseZNqGR6KLxA    Autos & Vehicles    0.213702
1   UC--bUZc5c9WseZNqGR6KLxA    Entertainment   0.786298
2   UC-B1L3oT81XgeeGh6S12qgQ    People & Blogs  1.000000
3   UC-N_7HFKrSsYxCSA_kfdRSA    People & Blogs  0.137261
4   UC-N_7HFKrSsYxCSA_kfdRSA    Pets & Animals  0.862739
... ... ... ...
819 UCzsNLZ9GrGXRjt0QmvWFm2Q    Entertainment   0.945243
820 UCzsNLZ9GrGXRjt0QmvWFm2Q    Film & Animation    0.002046
821 UCzsNLZ9GrGXRjt0QmvWFm2Q    Music   0.002797
822 UCzsNLZ9GrGXRjt0QmvWFm2Q    News & Politics 0.000433
823 UCzsNLZ9GrGXRjt0QmvWFm2Q    People & Blogs  0.000358

There are 15 distinct values in Video_Category_Name:

df.Video_Category_Name.unique()

gives

array(['Autos & Vehicles', 'Entertainment', 'People & Blogs',
       'Pets & Animals', 'Howto & Style', 'Education', 'Gaming', 'Music',
       'Comedy', 'Travel & Events', 'Science & Technology',
       'Nonprofits & Activism', 'Sports', 'Film & Animation',
       'News & Politics'], dtype=object)
 In [3]: iwantthis
  Out[3]:
     Channel_ID  Autos & Vehicles Entertainment People & Blogs ...
  0  UC--bUZc5c9WseZNqGR6KLxA  0.213702 0.786298 0 ...
  1  UC-B1L3oT81XgeeGh6S12qgQ  0        0        1.0000 ...

How do I create a column for each of these 15 and fill value from score_pct (0 if it does not exist)? Not sure how to use unstack/melt/pivot or something else

AmanArora
  • 2,379
  • 6
  • 19
  • 22
  • Would be really helpful if you provided input and desired output as code or in copy-pasteable form. You'd already have an answer from me if you didn't expect me to type in the data from your picture by hand. – timgeb Mar 04 '22 at 09:18
  • You can check out [this](https://stackoverflow.com/a/20159305/3620003) post for help with asking pandas questions. – timgeb Mar 04 '22 at 09:20
  • @timgeb fixed, with link. Thanks for your input – AmanArora Mar 04 '22 at 09:32
  • That df has 824 rows. The post I linked to suggests 5. You still did not provide the output you want. – timgeb Mar 04 '22 at 09:34

1 Answers1

2

I think pivot() is the right function for your problem. It takes the categorial values of Video_Category_Name and creates new columns, which are filled with the value of score_pct. Non existing values are replaced by zero with `filna(0):

df = df.pivot(index='Channel_ID', columns='Video_Category_Name', values='score_pct').fillna(0).reset_index()

Output:

Video_Category_Name Channel_ID  Autos & Vehicles    Comedy  Education   Entertainment   Film & Animation    Gaming  Howto & Style   Music   News & Politics Nonprofits & Activism   People & Blogs  Pets & Animals  Science & Technology    Sports  Travel & Events
0   UC--bUZc5c9WseZNqGR6KLxA    0.213702    0.0 0.0 0.786298    0.0 0.0 0.0 0.0 0.0 0.0 0.000000    0.000000    0.0 0.0 0.0
1   UC-B1L3oT81XgeeGh6S12qgQ    0.000000    0.0 0.0 0.000000    0.0 0.0 0.0 0.0 0.0 0.0 1.000000    0.000000    0.0 0.0 0.0
2   UC-N_7HFKrSsYxCSA_kfdRSA    0.000000    0.0 0.0 0.000000    0.0 0.0 0.0 0.0 0.0 0.0 0.137261    0.862739    0.0 0.0 0.0
3   UC-T4JheeuNl2DVg-B-v7McA    0.000000    0.0 0.0 0.000000    0.0 0.0 1.0 0.0 0.0 0.0 0.000000    0.000000    0.0 0.0 0.0
4   UC-WG1VP4am6NaUtANEJxRQw    0.000000    0.0 0.0 0.000000    0.0 0.0 0.0 0.0 0.0 0.0 1.000000    0.000000    0.0 0.0 0.0

Edit 1: As the comment mentions this only works if Channel_ID is unique. If it is not (or to be safe) you can also include the index in the pivot operation. Afterwards restore the index again:

df = df.reset_index().pivot(index=['index', 'Channel_ID'], columns='Video_Category_Name', values='score_pct').fillna(0).reset_index(level=1)

Edit 2: The Video_Category_Name in the dataframe is only the label of the columns and should not change anything. However, you can easily remove it with this line:

df = df.rename_axis(None, axis=1) 

For the solution of Edit 1 you might also want to remove the index name, which can be done with the same operation and a different axis:

df = df.rename_axis(None, axis=0)
JANO
  • 2,995
  • 2
  • 14
  • 29
  • Note that this will raise a `ValueError` if there are any index, columns combinations with multiple values. Does not happen for OP's sample data though. – timgeb Mar 04 '22 at 09:42
  • @JANO, how do I remove that "Video_Category_Name" above the indices though? – AmanArora Mar 04 '22 at 09:49
  • @timgeb You`re right! I added a solution that uses the index as well. – JANO Mar 04 '22 at 09:55
  • @AmanArora I added a second edit with the line that removes "Video_Category_Name" – JANO Mar 04 '22 at 09:55