1

https://regex101.com/ <- for those who want to test regex.

I'm working on Indonesian price parser.

Say, I have below examples:

1) 150 k
2) 150 kilobyte
3) 150 ka
4) 150 k2
5) 150 k)
6) 150 k.

We know 1), 5), 6) can be the price, while remains obviously cannot be.
My regex is bit complicated in real, but for simplicity,

Let's say my regex is: [0-9]+(\s*[k])

This catches 1) to 6), all of them.

So I put [^0-9a-zA-Z] to regex: [0-9]+(\s*[k])[^0-9a-zA-Z]

Now I got 1), 5), 6) only, and this is fine.

However, the problem is... they have unnecessary suffix like [ ) , ]

How can I parse just '150 k' without any suffix like [ ) , ] which is not related to price information?

Should I have one more process after get 5), 6) manually getting rid of those suffices?

Thank you in advance to any idea.

Kobi
  • 135,331
  • 41
  • 252
  • 292
kispi
  • 185
  • 1
  • 12
  • @Adrian - You shouldn't have removed the Go tag. The Regex description asks for the programming language - "Since regular expressions are not fully standardized, all questions with this tag should also include a tag specifying the applicable programming language or tool". Many things are not implemented in re2, so the Go tag is useful. Thanks! – Kobi Sep 09 '17 at 05:59
  • Then that should be mentioned in the question body, not just in a tag. With no mention whatsoever in the body, the tag is pointless. – Adrian Sep 09 '17 at 12:00

2 Answers2

2

You can use a word boundary - \b. You can also use one at the start, instead of the space:

\b[0-9]+\s*k\b

Working example: https://regex101.com/r/RAF2Vg/3

Kobi
  • 135,331
  • 41
  • 252
  • 292
2

I think (\d+\s*k)\b will serve your purpose. It will check if after the 'k' a word boundary has been reached. This word boundary can be anything, yes, even a ). Look at this example

Marc Lambrichs
  • 2,864
  • 2
  • 13
  • 14