0

So I have a DataFrame which looks like this:

df = pd.DataFrame({'feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"]})

df
       feature1  feature2                     unparsed_features
    0        34         1  neoclassical, heavy, $2, old, bronze
    1        45         0                 romanticism, gold, $5
    2        52         1                  baroque, xs, $3, new

I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.

I have a general idea of what each column can possibly be as shown below:

main_dict = {
 'weight': ['heavy','light'],
 'age': ['new','old'],
 'colour': ['gold','silver','bronze'],
 'size': ['xs','s','m','l','xl','xxl','xxxl'],
 'price': ['$'],
 'period': ['renaissance','baroque','rococo','neoclassical','romanticism']
}

Ideally I would like my Dataframe to look like the following:

df
   feature1  feature2                     unparsed_features weight price  age  \
0        34         1  neoclassical, heavy, $2, old, bronze  heavy    $2  old   
1        45         0                 romanticism, gold, $5           $5        
2        52         1                  baroque, xs, $3, new           $3  new   

  size  colour        period  
0       bronze  neoclassical  
1         gold   romanticism  
2   xs               baroque

I know the first step would be to split the string by comma but I am lost after that.

df['unparsed_features'].str.split(',')

Thank you for your help.

user3374113
  • 593
  • 1
  • 7
  • 20
  • You may need create you own dict , for those certain item belong to which field – BENY Nov 13 '18 at 15:11
  • Hi, this question may help: https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows Answer (scroll down a bit): `df.unparsed_features.str.split(",", expand=True)` Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok. – Evan Nov 13 '18 at 15:27
  • 1
    Possible duplicate of [pandas: How do I split text in a column into multiple rows?](https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows) – Evan Nov 13 '18 at 15:28

2 Answers2

0

Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:

for key, list_item in main_dict.items():
    if key =='price':
        df[key] = df.unparsed_features.str.extract('(\$\d+)').fillna('')
    else:
        df[key] = df.unparsed_features.str.extract('((^|\W)' +'|(^|\W)'.join(list_item) + ')').fillna('')

\$\d+ allows to look for any digit after the symbol $ and (^|\W) look for a space or the beginning of the line before any word in list_item.

And you get as expected:

   feature1  feature2                     unparsed_features  weight   age  \
0        34         1  neoclassical, heavy, $2, old, bronze   heavy   old   
1        45         0                 romanticism, gold, $5                 
2        52         1                  baroque, xs, $3, new           new   

    colour size price        period  
0   bronze         $2  neoclassical  
1     gold         $5   romanticism  
2            xs    $3       baroque  
Ben.T
  • 29,160
  • 6
  • 32
  • 54
0

Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach

for keys in main_dict:
    data_list = []
    for value in df.unparsed_features: # for every row
        for l_data in main_dict[keys]:
            if keys == 'price':
                matching = [v for v in value.split(',') if l_data in v]
            else:
                matching = [v for v in value.split(',') if l_data == v.strip()]

            if matching:
                break

        if matching:
            data_list.append(matching[0])
        else:
            data_list.append(None)

        matching = ''  
    df[keys] = data_list

Output

   feature1  feature2                     unparsed_features  weight   age  \
0        34         1  neoclassical, heavy, $2, old, bronze   heavy   old   
1        45         0                 romanticism, gold, $5    None  None   
2        52         1                  baroque, xs, $3, new    None   new   

    colour  size price        period  
0   bronze  None    $2  neoclassical  
1     gold  None    $5   romanticism  
2     None    xs    $3       baroque  
iamklaus
  • 3,720
  • 2
  • 12
  • 21