0

I'm trying to split a column called "category" containing strings into two new columns "category" and "subcategory"

It's based on a kickstarter dataset we collected from webrobots.io. The "category" fields contain instances that look like this:

In: frame.category[1]
Out: {"id":325,"name":"Calendars","slug":"publishing/calendars","position":4,"parent_id":18,"color":14867664,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/publishing/calendars"}}} 

I'm trying to get the first part of all rows after '"slug":"' before the slash (publishing) into a new column "category", and the part after slash before the quotation mark (calendars) into a new column "subcategory". I've tried with str.split and str.extract and presume that extract is what i need, but I'm very new to regular expressions so all my attempts have failed

This is what I've tried for now, it just gives me two columns both containing NaN all the way through

frame["category"].str.extract(r'(slug":")(/)')

It would be great if the result came out as two new columns with each one of the two words separated by the slash after "slug":"

Edit: Thanks to Nev1111's idea of treating the column as its own dataframe and joris on this thread I've come to the following code which works perfectly, although it might not be the best solution

#Assigning 'category' to its own dataframe and reading it as a dictionary with each key as its own column
df=frame['category'].map(eval).apply(pd.Series)
#splitting "slug" and creating new columns based on the category and subcategory
frame[['category','subcategory']]=df['slug'].str.split('/',expand=True)

When printing "frame" i get the two new columns with category and subcategory

2 Answers2

0

Base on what you show to us , that columns is type is object--dict

frame["category"].str.get('slug') 
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks a lot, although I get an output, it just returns "NaN" throughout all the rows however. ```python Out: 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN 10 NaN ``` – Philip Tissot May 23 '19 at 10:10
0
from pandas import DataFrame

df=DataFrame( {"id":325,"name":"Calendars","slug":"publishing/calendars","position":4,"parent_id":18,"color":14867664,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/publishing/calendars"}}} ) 

df[['Category','Subcategory']]=df['slug'].str.split('/',expand=True)
Nev1111
  • 1,039
  • 9
  • 13
  • Getting ValueError: Columns must be same length as key, can you help? :) ```python -> 3113 self._setitem_array(key, value) 3114 else: 3115 # set column ~\Anaconda3\lib\site-packages\pandas\core\frame.py in _setitem_array(self, key, value) 3133 if isinstance(value, DataFrame): 3134 if len(value.columns) != len(key): -> 3135 raise ValueError('Columns must be same length as key') 3136 for k1, k2 in zip(key, value.columns): 3137 self[k1] = value[k2] ``` – Philip Tissot May 22 '19 at 19:27
  • Sorry if the comment doesn't make sense, not super familiar with the way we do it on stackoverflow :) – Philip Tissot May 22 '19 at 19:28
  • Hi, it's ok. Did you call DataFrame constructor to create a dataframe? I'm going to update the answer now – Nev1111 May 22 '19 at 20:15
  • Hi, it works completely fine when I make a new dataframe from the content of one columns, but if I want to read the whole column 'category' as df and run it I get key error slug. Please check my editted post :) – Philip Tissot May 23 '19 at 10:17
  • Try this: frame[['category','subcategory']]=df.str.split('/',expand=True) instead of frame[['category','subcategory']]=df['slug'].str.split('/',expand=True) – Nev1111 May 23 '19 at 14:03