6

I'm working with the following string:

'"name": "Gnosis", \n        "symbol": "GNO", \n        "rank": "99", \n        "price_usd": "175.029", \n        "price_btc": "0.0186887", \n        "24h_volume_usd": "753877.0"'

and I have to use re.sub() in python to replace only the double quotes (") that are enclosing the numbers, in order to parse it later in JSON. I've tried with some regular expressions, but without success. Here is my best attempt:

exp = re.compile(r': (")\D+\.*\D*(")', re.MULTILINE)
response = re.sub(exp, "", string)

I've searched a lot for a similar problem, but have not found another similar question.

EDIT:

Finally I've used (thanks to S. Kablar):

fomatted = re.sub(r'"(-*\d+(?:\.\d+)?)"', r"\1", string)
parsed = json.loads(formatted)

The problem is that this endpoint returns a bad formatted string as JSON.

Other users answered "Parse the string first with json, and later convert numbers to float" with a for loop and, I think, is a very inneficient way to do it, also, you will be forced to select between int or float type for your response. To get out of doubt, I've wrote this gist where I show you the comparations between the different approachs with benchmarking, and for now I'm going to trust in regex in this case.

Thanks everyone for your help

  • Where do the double quotes (which you call commas are coming in) - have you tried something adhoc like: `json.loads('{' + yourstring + '}')` - and check that? – Jon Clements Feb 03 '18 at 18:23
  • Using regex is not the correct way of doing such things. As stated in the comments to S.Kablar's answer, a string of `"\"43"` will get corrupted by your regular expression replacement. Never use regular expressions on structured data; use the intended parser. – Daniel Feb 03 '18 at 23:06

3 Answers3

8

Regex: "(-?\d+(?:[\.,]\d+)?)" Substitution: \1

Details:

  • () Capturing group
  • (?:) Non capturing group
  • \d Matches a digit (equal to [0-9])
  • + Matches between one and unlimited times
  • ? Matches between zero and one times
  • \1 Group 1.

Python code:

def remove_quotes(text):
    return re.sub(r"\"(-?\d+(?:[\.,]\d+)?)\"", r'\1', text)

remove_quotes('"percent_change_7d": "-23.43"') >> "percent_change_7d": -23.43
Srdjan M.
  • 3,310
  • 3
  • 13
  • 34
2

Parse the string first with json, and later convert numbers to floats:

string = '{"name": "Gnosis", \n        "symbol": "GNO", \n        "rank": "99", \n        "price_usd": "175.029", \n        "price_btc": "0.0186887", \n        "24h_volume_usd": "753877.0"}'

data = json.loads(string)
response = {}
for key, value in data.items():
    try:
        value = int(value) if value.strip().isdigit() else float(value)
    except ValueError:
        pass
    response[key] = value
Daniel
  • 42,087
  • 4
  • 55
  • 81
  • This works, and is more maintainable than a regexp (I'm upvoting even if I'm still wary of edge cases :-) ). It can also be adapted to complex JSON values with [this answer](https://stackoverflow.com/questions/10756427/loop-through-all-nested-dictionary-values) – LSerni Feb 03 '18 at 18:51
  • This approach is not working because the json parser process the numbers as strings. To be more precise, the string is part of the response returned by [this endpoint](https://api.coinmarketcap.com/v1/ticker) `https://api.coinmarketcap.com/v1/ticker`. – Álvaro Mondéjar Feb 03 '18 at 18:59
  • @ÁlvaroMondéjar. What makes you think it doesn't work? Have you actually tried it? – ekhumoro Feb 03 '18 at 19:10
  • @Álvaro Mondéjar: this is the idea behind this: convert the strings to numbers afterwards with python. much more stable than using regex. – Daniel Feb 03 '18 at 19:17
  • Regex and `json.loads()` is not better? [Show this gist](https://gist.github.com/mondeja/a8215d34ee1c0c850d7b3f64ab6b2260). – Álvaro Mondéjar Feb 03 '18 at 20:09
  • And if you want to conserve ints and floats what? This is not a good idea in this case. – Álvaro Mondéjar Feb 03 '18 at 20:20
  • @ÁlvaroMondéjar: test for int added. Btw. json only knows floats. – Daniel Feb 03 '18 at 20:27
  • So what? In Python you can decode non floating point numbers as ints. You can see in [`json.JSONDecoder()` documentation](https://docs.python.org/3/library/json.html#encoders-and-decoders). – Álvaro Mondéjar Feb 03 '18 at 21:13
1

You came close. You want to save the numbers, and the colon, so you need to put them in parentheses, not the rest. Also, numbers are \d, not \D (that would be not-numbers).

So:

exp = re.compile(r'(: *)"(\d+\.?\d*)"', re.MULTILINE)
response = re.sub(exp, "\\1\\2", string)

\d+\.?\d*  means "a number (or more), a point (or not), any numbers"

Border cases

The above doesn't cover ".125", which is no numbers, one point.

And if you changed to "\d*.?\d*", that would match ".", since it is **any numbers", one point, any numbers".

I think the only practicable way is

 (\d+\.?\d*|\.\d+)

with | meaning "or": so, either a number optionally followed by one point and any digits (this matches "17."), or a point followed by at least one digit. Unfortunately, "\d+.?\d+" does not match "5".

Or you specify all three cases:

 (\d+|\d+\.?\d+|\.\d+)

First integers (\d+), then floating points with or without decimals, then decimal parts alone without leading zeroes.

LSerni
  • 55,617
  • 10
  • 65
  • 107