0

I have an input -Dimension and I want my output in a specific form extracted from the text that I have in the dimension column.

I have used

df['output'] = df['Dimension'].str.extractall(r'(\d*\.?\d+)').astype(float).unstack().prod(axis=1) 

but i'm not able to print the desired output for Model E. Please help me here in python.

Model Dimension output
A 4.31 m x 2 m x 3.222 m 27.77364
B 220m 220
C 'St 473m 473
D rangeZng 2x250m 500
E Original 250ml 2s 35% 500
F Qstd 550ml 1+1 550
G very good cream 250ml 2s 35% 500
H very good cream 250ml 2s 45% 500
Fedor
  • 17,146
  • 13
  • 40
  • 131

1 Answers1

0

You could maybe change your regex to:

df['output'] = (df['Dimension'].str.extractall(r'(?<![+])(\d*\.?\d+)(?![%+])(?=\D|$)')
                .astype(float).unstack().prod(axis=1)
               )

Output:

  Model                     Dimension     output
0     A        4.31 m x 2 m x 3.222 m   27.77364
1     B                          220m  220.00000
2     C                      'St 473m  473.00000
3     D               rangeZng 2x250m  500.00000
4     E         Original 250ml 2s 35%  500.00000
5     F                Qstd 550ml 1+1  550.00000
6     G  very good cream 250ml 2s 35%  500.00000

regex demo

mozway
  • 194,879
  • 13
  • 39
  • 75
  • I ran this but it says ValueError: Index contains duplicate entries, cannot reshape. – Chayan Banerjee Jun 23 '23 at 10:59
  • @CBCB please provide a [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) that reproduces the error, the code works well with the current input. Very likely you first need to reset your index: `df = df.reset_index(drop=True)` – mozway Jun 23 '23 at 11:18
  • You can also replace `.unstack().prod(axis=1)` by `.groupby(level=0).prod()` (this will aggregate by index, so be careful if you have duplicates!) – mozway Jun 23 '23 at 11:24
  • Yes I have many duplicates in my dataset and these are on different columns, how to get the output if I have duplicates? @mozway – Chayan Banerjee Jul 03 '23 at 08:42
  • What matters is that your index is not duplicated (run `df = df.reset_index(drop=True)` before my code), but please focus your question on a specific issue. It looks to me that the core problem here (extracting the product of numbers) is solved, no? Also, why don't you provide a reproducible example (output of `df.to_dict('tight')` as [edit](https://stackoverflow.com/posts/76420393/edit) to your question) to avoid wasting time? – mozway Jul 03 '23 at 08:48