2

I have a dataset with a complex column in pandas. One of column product_info has various types of contents :

#Input type1
df['productInfo'][0]
#Output type1
'Salt & pepper shakers,Material: stoneware,Dimensions: 
H6.5cm,Dachshund designs,1x black and tan, 1x brown,Hand 
painted,Dishwasher safe'
#Output type2
'Pineapple string lights,Dimensions: 400x6x10cm,10 pineapple shaped LED lights,In a gold hue,3x AA batteries required (not included)'
#Output type 3
''

So essentially my productInfo column contains the above 3 kinds of contents. What i want is to get the Material for groupby analysis: extracted from the productInfo column of the dataframe, of course only when these values exist, if they don't, just set these values as null/None or whatever

I have tried boolean masks but can't seem to make them work, anyone with any suggestion is highly appreciated.

Thanks in advance

Edit: this was my original df: original df

My df after extracting Material from ProductInfo: df after extracting Material from ProductInfo

My df after extracting Material and Dimensions from ProductInfo: enter image description here

Hopefully, you guys get an idea of what I'm trying to achieve. Most of my task is to do text extraction from complex text blobs inside each column. If I find the relevant columns from the text clumps using regex then I update the columns else make them null. It has proven to be a big challenge, please if any of you guys can help me extract the useful info like Material and Dimensions from the productInfo text clump to their own columns, that'd be very helpful of you guys.

Thanks in Advance for anyone who tries to help me and sorry for my vague question without providing relevant information.

Happy Panda-ing(If that's a word!!) :)

  • I don't know what your expected output is... can you give me an idea? – cs95 Dec 27 '17 at 05:46
  • Is this what you are after? https://stackoverflow.com/questions/37600711/pandas-split-column-into-multiple-columns-by-comma – Keith Dec 27 '17 at 06:12
  • Thanks, Keith but the problem of mine is a bit trickier, I don't just have comma separation, As you can see my columns have the above 3 different output patterns. For output 1, I want to extract the Material and Dimensions using regex probably and put in Dimensions and Material Column, for output 2 I just want Dimensions and put in Dimensions column, for output 3, I want Null value in Dimensions and Material Column. Is there some way to do it elegantly in Pandas, my method is too convoluted – Siddharth Paul Dec 30 '17 at 04:26
  • What is wrong with your current approach using the `apply` function? Is it too slow? – Ryan Jan 07 '18 at 03:50
  • Its not Slow, Its too complicated! Too many regular expressions, Too much intermediary steps to get each data unit – Siddharth Paul Jan 15 '18 at 12:45

1 Answers1

0

I imported pandas and re

import pandas as pd
import re

I created a helper function that does a simple regex to get the material and dimensions. I delete the material and dimension strings from the original string, returning a Series with the updated description, material, and dimensions.

def get_material_and_dimensions(row):
    description = row['productInfo']

    material = re.search(r'Material: (.*?),', description)
    if material:
        material = material.group(1)
        description = description.replace(f'Material: {material},', '')

    dimensions = re.search(r'Dimensions: (.*?),', description)
    if dimensions:
        dimensions = dimensions.group(1)
        description = description.replace(f'Dimensions: {dimensions},', '')

    return pd.Series([description, material, dimensions], index=['description', 'material', 'dimensions'])

Apply the function to the DataFrame

myseries = df.apply(get_material_and_dimensions, axis=1)

Then add the series to the original DataFrame, replacing df['productInfo'] with the clean df['description']

df = df.join(myseries)
df['productInfo'] = df['description']
df.drop('description', inplace=True, axis=1)
Paulina Khew
  • 397
  • 4
  • 13