0

I'm getting verry confused with the regex and I need help. I have the following string:

x='def{{{12.197835/// -0.001172, 12.19788 7.3E-5, //+{{12.196705 -1.7E-5, 12.196647 -0.001189///}}}Def'

This string is part of cell in specific column in pandasdataframe. each cell has different unwanted characters, mainly letters and "/" or "{".

I want to have this output:

x='12.197835,-0.001172, 12.19788,7.3E-5,12.196705 ,-1.7E-5, 12.196647 -0.001189'

(get rid of anything that is not a digit, beside if is a number with "-" before or E- which is "E-" with digit before.

I have used this expression in order to ger inly the digits:

print(re.findall(r"\d+\.*\d*",x))
>>>['12.197835', '0.001172', '12.19788', '7.3', '5', '12.196705', '1.7', '5', '12.196647', '0.001189']

but my problem is that this expression does not preserve the '-' or the 'E'. I have tried to save them by the following expression:

print(re.findall(r"\d+\.*\d*",x) or (r"^-?[0-9]\d+\.*\d+*\[E-]",x))

but I get the same output:


>>>['12.197835', '0.001172', '12.19788', '7.3', '5', '12.196705', '1.7', '5', '12.196647', '0.001189']

I thought maybe is because i'm using or and then it alreay satisfy the first condition so I tried also "and" but that gives very weird results:

>>>('^-?[0-9]\\d+\\.*\\d+*\\[E-]', 'def{{{12.197835/// -0.001172, 12.19788 7.3E-5, //+{{12.196705 -1.7E-5, 12.196647 -0.001189///}}}Def')

My end goal is to get the first string with only digits, '-' and E that has after it '-' (the desired output)

x='12.197835,-0.001172, 12.19788,7.3E-5,12.196705 ,-1.7E-5, 12.196647 -0.001189'
Reut
  • 1,555
  • 4
  • 23
  • 55
  • You can use OR like `re.findall(r'pattern1|pattern2', x)`. Not like `re.findall(...) or (r'...', x)` – Wiktor Stribiżew Nov 26 '20 at 14:02
  • Why not use a proper parser for that format? – superb rain Nov 26 '20 at 14:03
  • @superb rain this is part of lambda function running on very long pandas dataframe. if you have any idea how to do it better I would love to hear :) – Reut Nov 26 '20 at 14:03
  • @WiktorStribiżew I have tried now re.findall(r'\d+\.*\d*|^-?[0-9]\d+\.*\d+*\[E-]', x) but got error "error: multiple repeat at position 27" , maybe I don't understand you correct – Reut Nov 26 '20 at 14:07
  • Replace `+*` with `*`. Also, `^` matches at the start of the string, you need to remove the anchors. – Wiktor Stribiżew Nov 26 '20 at 14:10
  • Does this answer your question? [Regex to match scientific notation](https://stackoverflow.com/questions/41668588/regex-to-match-scientific-notation) – Egal Nov 26 '20 at 14:18

2 Answers2

1

You may use

import re
x='def{{{12.197835/// -0.001172, 12.19788 7.3E-5, //+{{12.196705 -1.7E-5, 12.196647 -0.001189///}}}Def'
print(re.findall(r'[+-]?\d*\.?\d+(?:[eE][+-]?\d+)?', x))  # Extracting all numbers into a list
# => ['12.197835', '-0.001172', '12.19788', '7.3E-5', '12.196705', '-1.7E-5', '12.196647', '-0.001189']
print(",".join(re.findall(r'[+-]?\d*\.?\d+(?:[eE][+-]?\d+)?', x))) # Creating a comma-separated string
# => 12.197835,-0.001172,12.19788,7.3E-5,12.196705,-1.7E-5,12.196647,-0.001189

See the Python demo and the regex demo.

Regex details

  • [+-]? - an optional + or -
  • \d* - zero or more digits
  • \.? - an optional .
  • \d+ - one or more digits
  • (?:[eE][+-]?\d+)? - an optional occurrence of e or E followed with an optional + or - and then one or more digits.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Hope this help you (without using regex).

x='def{{{12.197835/// -0.001172, 12.19788 7.3E-5, //+{{12.196705 -1.7E-5, 12.196647 -0.001189///}}}Def'

x=x.replace('{','').replace('}','').replace('def','').replace('Def','').replace('/','').replace('  ',' ').replace(' ',',').replace(',,',',')

print(x)

[Result]:

12.197835,-0.001172,12.19788,7.3E-5,+12.196705,-1.7E-5,12.196647,-0.001189

AziMez
  • 2,014
  • 1
  • 6
  • 16
  • thank you for your answer, the reason I don't use replace is because the characters are not always the same in my case and that was only one example, i'll edit my post – Reut Nov 26 '20 at 14:12