I exploded a dataset:
data_frame = pd.DataFrame({'Name': ['1a-c' '2a-f],
'Dimensions': ['a: 5cm, b: 6cm, c: 7cm'], ['a: 4cm, c: 3cm, f: 5cm']})
or
Name Dimensions
1a-c a: 5cm b: 6cm c: 7cm
2a-f a: 4cm c: 3cm e: 5cm
to look like this:
Name Dimensions
1a a: 5cm b: 6cm c: 7cm
1b a: 5cm b: 6cm c: 7cm
1c a: 5cm b: 6cm c: 7cm
2a a: 4cm c: 3cm f: 5cm
2b a: 4cm c: 3cm e: 5cm
2c a: 4cm c: 3cm e: 5cm
2d a: 4cm c: 3cm e: 5cm
2e a: 4cm c: 3cm e: 5cm
But now I want to create a function so that if 1a is in column 'Name', then replace 'Dimensions' just with content after a: and before b:; likewise, if 1b, replace row in 'Dimensions' with content after b:
Name Dimensions
1a 5cm
1b 6cm
1c 7cm
2a 4cm
2b nan
2c 3cm
2d nan
2e 5cm
i.e.
data_frame= pd.DataFrame({'Name': ['1a', '1b', '1c'] Dimensions=['5c', '6cm', '7cm']})
I think the correct regex expression for a: would be
(?<=a:)(.*?)(?=[b-z]):This will capture the content between a: and b: (or, in some cases, a: and c: or a: and z: if those values are not supplied in the dataframe)
But I am not sure how to write out a function that will extract that expression and apply it only to rows where 'a' is in column 'Name' and 'a' is also in column 'Dimensions' (the reason for this condition is that, in the dataframe, the dimensions for a-z are not always given).
I wrote out this function using .loc but I think I am misusing .extract()
data_frame.loc[(data_frame.Name.str.contains('a', na=False)) & (data_frame.Dimensions.str.contains('a:', na=False)), 'Dimensions']=data_frame.loc[(data_frame.Name.str.contains('a', na=False)) & (data_frame.Dimensions.str.contains('a:', na=False)), 'Dimensions'].str.extract('(?<=a:)(.*?)(?=[b-z]:)')
.extract() does not write over the row 'Name: 1a'.
Many thanks for any suggestions or help!