1

I've been trying to extract decimal numbers from strings in sparklyr, but it does not work with the regular syntax you would normally use outside of Spark.

I have tried using regexp_extract but it returns empty strings.

regexp_extract($170.5M, "[[:digit:]]+\\.*[[:digit:]]*")

I'm trying to get 170.5 as a result.

J.C.
  • 140
  • 1
  • 7

2 Answers2

1

You could use regexpr from base R

v <- "$170.5M"
regmatches(v, regexpr("\\d*\\.\\d", v))
# [1] "170.5"
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

You may use

regexp_extract(col_value, "[0-9]+(?:[.][0-9]+)?")

Or

regexp_extract(col_value, "\\p{Digit}+(?:\\.\\p{Digit}+)?")

Your [[:digit:]]+\.*[[:digit:]]* regex does not work, becuae regexp_extract expects a Java compatible regex pattern and that engine does not support POSIX character classes in the [:classname:] syntax. You may use digit POSIX character class like \p{Digit}, see Java regex documentation.

See regexp_extract documentation:

Extract a specific(idx) group identified by a java regex, from the specified string column.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • regexp_extract(col_value, "[0-9]+(?:[.][0-9]+)?") doesn't work when you have a value like .85 – Neels Jul 05 '23 at 19:10
  • 1
    @Neels It depends on the numeric format you need to extract. If `.` can come with no digit in front, use `"[0-9]*[.]?[0-9]+"`. There are a [ton of number matching regexps](https://stackoverflow.com/q/14550526/3832970), this answer shows how to use them. – Wiktor Stribiżew Jul 05 '23 at 20:07