0

I exploded a dataset:

data_frame = pd.DataFrame({'Name': ['1a-c' '2a-f],
    'Dimensions': ['a: 5cm, b: 6cm,  c: 7cm'], ['a: 4cm, c: 3cm, f: 5cm']})

or

Name Dimensions
1a-c a: 5cm b: 6cm c: 7cm
2a-f a: 4cm c: 3cm e: 5cm

to look like this:

Name     Dimensions
1a     a: 5cm b: 6cm c: 7cm
1b     a: 5cm b: 6cm c: 7cm
1c     a: 5cm b: 6cm c: 7cm
2a     a: 4cm c: 3cm f: 5cm
2b     a: 4cm c: 3cm e: 5cm
2c     a: 4cm c: 3cm e: 5cm
2d     a: 4cm c: 3cm e: 5cm
2e     a: 4cm c: 3cm e: 5cm

But now I want to create a function so that if 1a is in column 'Name', then replace 'Dimensions' just with content after a: and before b:; likewise, if 1b, replace row in 'Dimensions' with content after b:

Name     Dimensions
1a       5cm
1b       6cm
1c       7cm
2a       4cm
2b       nan
2c       3cm
2d       nan
2e       5cm

i.e.
data_frame= pd.DataFrame({'Name': ['1a', '1b', '1c'] Dimensions=['5c', '6cm', '7cm']})

I think the correct regex expression for a: would be

(?<=a:)(.*?)(?=[b-z]):
This will capture the content between a: and b: (or, in some cases, a: and c: or a: and z: if those values are not supplied in the dataframe)

But I am not sure how to write out a function that will extract that expression and apply it only to rows where 'a' is in column 'Name' and 'a' is also in column 'Dimensions' (the reason for this condition is that, in the dataframe, the dimensions for a-z are not always given).

I wrote out this function using .loc but I think I am misusing .extract()

data_frame.loc[(data_frame.Name.str.contains('a', na=False)) & (data_frame.Dimensions.str.contains('a:', na=False)), 'Dimensions']=data_frame.loc[(data_frame.Name.str.contains('a', na=False)) & (data_frame.Dimensions.str.contains('a:', na=False)), 'Dimensions'].str.extract('(?<=a:)(.*?)(?=[b-z]:)')

.extract() does not write over the row 'Name: 1a'.

Many thanks for any suggestions or help!

kylemaxim
  • 95
  • 5
  • It's rather difficult to determine what's in the df. `Name` and `Dimensions` are different columns in your example code, but in the df it looks like a single column `Name Dimensions` with a long string in it. Can you show [minimal code](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) to create this df and show desired output as a valid df or readable print that illustrates all needed information (types, etc)? Thanks. – ggorlen Aug 25 '20 at 21:38
  • @ggorlen Sorry about that. I edited it -- hopefully it is clearer now. – kylemaxim Aug 25 '20 at 21:50
  • Thanks but the df in the top example is invalid. Is it an array of strings? If the syntax was valid I could punch this into my repl and start helping you out but as it stands I'm confused as to what sort of data we're dealing with. – ggorlen Aug 25 '20 at 21:54
  • @ggorlen I edited it again. Hopefully it is now valid (I did not create the dataframe by code, so I am not entirely sure how to code that--basically, it is a dataframe with two columns 'Name' and 'Dimensions' and I have exploded an axis 1a-c into three axes 1a, 1b, 1c) – kylemaxim Aug 25 '20 at 22:06

1 Answers1

0

Here is a start to get you close to your desired output:

data_frame = pd.DataFrame({'Name': ['1a-c'], 'Dimensions': ['a: 5cm, b: 6cm,  c: 7cm']})

data_frame = data_frame.set_index('Name')

d = data_frame['Dimensions'].str.split(', |: ', expand=True)\
        .rename(columns=lambda x: f'Meas_{x//2}' if x%2 else f'Dim_{x//2}')

df_out = pd.wide_to_long(d.reset_index(), ['Dim', 'Meas'], 'Name', 'No', sep='_' )\
           .reset_index()

print(df_out)

Output:

  Name  No Dim Meas
0  1a-c   0   a  5cm
1  1a-c   1   b  6cm
2  1a-c   2   c  7cm

Update using new data: Where data_frame,

   Name              Dimensions
0  1a-c  a: 5cm, b: 6cm, c: 7cm
1  2a-f  a: 4cm, c: 3cm, e: 5cm

Use this:

data_frame = data_frame.set_index('Name')

d = data_frame['Dimensions'].str.split(', |: ', expand=True)\
        .rename(columns=lambda x: f'Meas_{x//2}' if x%2 else f'Dim_{x//2}')

df_out = pd.wide_to_long(d.reset_index(), ['Dim', 'Meas'], 'Name', 'No', sep='_' )\
           .reset_index()

df_out['Name'] = df_out['Name'].str[0] + df_out['Dim']

df_out[['Name','Meas']].sort_values('Name')

Output:

  Name Meas
0   1a  5cm
2   1b  6cm
4   1c  7cm
1   2a  4cm
3   2c  3cm
5   2e  5cm
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • I see that this would work for a simplified dataset as I presented, but the problem is, in some cases, a row might have dimensions for 1a-1z; sometimes it might just have dimensions for a, c and d, or just d. So the index won't realign because I will have 1a, 1b and 1c but maybe only two values. This is why the solution I was aiming for used .loc[] so that if '1a' is in 'Name', a: height would be extracted for that row, but my code doesn't achieve that... – kylemaxim Aug 26 '20 at 00:08
  • 1
    Add some varying data to our questions. I think we can handled your case. – Scott Boston Aug 26 '20 at 00:41
  • Thanks for this. It's not quite what I had in mind. I don't want to drop a row (like 2b and 2d in this scenario) but I'll think about how I can apply your solution! – kylemaxim Aug 27 '20 at 11:47