0

From the following text I want to extract the number and the unit of measurement.

I have 2 possible cases:

This is some text 14.56 kg and some other text

or

This is some text kg 14.56 and some other text

I used | to match the both cases. My problem is that it produces empty submatches, and thus giving me an incorrect number of matches.

This is my code:

std::smatch m;
std::string myString = "This is some text kg 14.56 and some other text";

const std::regex myRegex(
        R"(([\d]{0,4}[\.,]*[\d]{1,6})\s+(kilograms?|kg|kilos?)|s+(kilograms?|kg|kilos?)(\s+[\d]{0,4}[\.,]*[\d]{1,6}))",
        std::regex_constants::icase
);

if( std::regex_search(myString, m, myRegex) ){
    std::cout << "Size: " << m.size() << endl;
    for(int i=0; i<m.size(); i++)
        std::cout << m[i].str() << std::endl;
}
else
    std::cout << "Not found!\n";

OUTPUT:

Size: 5
kg 14.56


kg
14.56

I want an easy way to extract those 2 values, so my guess is that I want the following output:

WANTED OUTPUT:

Size: 3
kg 14.56
kg
14.56

This way I can always directly extract 2nd and 3th, but in this case I would also need to check which one is the number. I know how to do it with 2 separate searches, but I want to do it the right way, with a single search without using c++ to check if a submatch is an empty string.

Boy
  • 1,182
  • 2
  • 11
  • 28

2 Answers2

2

Using this regex, you just need the contents of Group 1 and Group 2

((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))\s*((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))

Click for Demo

enter image description here

Explanation:

  • ((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
    • (?:kilograms?|kilos?|kg) - matches kilograms or kilogram or kilos or kilo or kg
    • | - OR
    • (?:\d{0,4}(?:\.\d{1,6})) - matches 0 to 4 digits followed by 1 to 6 digits of decimal part
  • \s* - matches 0+ whitespaces
Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
  • Hi, is there some way to return groups by the same order? I read that ECMAScript does no allow named groups: https://stackoverflow.com/questions/16886992/c11-regex-capture-groups-by-name – Boy Jan 16 '19 at 03:56
  • 1
    By "same order", do you mean that for both the examples, group 1 should contain the numeric value and group 2 should contain the unit? – Gurmanjot Singh Jan 16 '19 at 05:28
  • Yes, exactly that. But my bigger problem is that I forgot to mention that it should also allow integers, not just float :( I tried many things but it always catches some other numbers that aren't surrounded by unit. If you are willing to help I'll update question. Anyway, thanks a lot for helping man. – Boy Jan 16 '19 at 06:00
  • 1
    That Integer problem can be solved by doing a slight modification as shown [HERE](https://regex101.com/r/lYmx7l/3) – Gurmanjot Singh Jan 16 '19 at 06:15
  • 1
    Also, for the first problem, I think you can just programmatically check for the contents of each group. If the contents are non-numeric, that means you have captured the unit, otherwise you have captured the value. – Gurmanjot Singh Jan 16 '19 at 06:19
  • Omg, 0 or 1 for a group, can't believe I haven't tried that! thank you! Anyway, I encountered another problem, check this: https://regex101.com/r/kLnnOg/1 For the ordering problem, yes, I know I can fairly easy do it by code, I wanted to know if it is possible with regex. – Boy Jan 16 '19 at 06:33
  • I understand that it is happening because of OR, because it is not strict so that each alternative inside capturing groups can appear only once. But thank you anyway man, you taught me a lot. If I could, I would give you my more points :) – Boy Jan 16 '19 at 06:56
  • 1
    No worries brother. I will update the solution if I come up with a better solution to include your negative test cases too. Will give it another try after my office hours. Glad that the solution helped you :) – Gurmanjot Singh Jan 16 '19 at 07:13
1

You can try this out:

((?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg)))|(?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)))

As shown here: https://regex101.com/r/9O99Fz/3

USAGE -

As I've shown in the 'substitution' section, to reference the numeral part of the quantity, you have to write $2$5, and for the unit, write: $3$4

Explanation -

There are two capturing groups we could possibly need: the first one here (?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg))) is to match the number followed by the unit,
and the other (?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)) to match the unit followed by the number

Community
  • 1
  • 1
Robo Mop
  • 3,485
  • 1
  • 10
  • 23