1

I want to parse words into numbers and have an error when the string doesn't fully express a real number, for example:

"Twenty two" => 22
"One hundred forty four" => 144
"Twenty bla bla" => error
"One hundred forty thousand one" => error

I tried to use com.ibm.icu.text.RuleBasedNumberFormat but the parse() method is parsing only the beginning and not the full string. This is mentioned in their javadoc:
Parses text from the beginning of the given string to produce a number. The method might not use the entire text of the given string

In their javadoc it is mentioned that a special rule set can be used, in combination with RuleBasedCollator for changing the lenient parsing, but I'm struggling to achieve this.

public class NumFormatter {
    public static int numberFromString(String number, Locale locale) {
        RuleBasedNumberFormat numberFormat = new RuleBasedNumberFormat(locale, RuleBasedNumberFormat.SPELLOUT);

        try {
            return numberFormat.parse(number).intValue();
        } catch (ParseException e) {
            return -1;
        }
    }
}

public class NumFormatterTest
    @Test
    public void formatNumber_fromString() {
        Locale locale =  new Locale("en");
        assertEquals(numberFromString("twenty two", locale), 22);
        assertEquals(numberFromString("three blablabla ", locale), -1); // not ok. It return 3 and not -1.
    }
}

pom.xml
<dependency>
    <groupId>com.ibm.icu</groupId>
    <artifactId>icu4j</artifactId>
    <version>60.2</version>
</dependency>

Did anyone had to deal with this before? Thank you in advance.

Links

Ermal
  • 441
  • 5
  • 19

1 Answers1

0
  • The content of the document is as follows:
To see how these rules actually work in practice, consider the following example: Formatting 25,430 with this rule set would work like this:

<< thousand >>  [the rule whose base value is 1,000 is applicable to 25,340]
twenty->> thousand >>   [25,340 over 1,000 is 25. The rule for 20 applies.]
twenty-five thousand >> [25 mod 10 is 5. The rule for 5 is "five."
twenty-five thousand << hundred >>  [25,340 mod 1,000 is 340. The rule for 100 applies.]
twenty-five thousand three hundred >>   [340 over 100 is 3. The rule for 3 is "three."]
twenty-five thousand three hundred forty    [340 mod 100 is 40. The rule for 40 applies. Since 40 divides evenly by 10, the hyphen and substitution in the brackets are omitted.]

public class NumberFormat {

    public static void main(String[] args) {
        Locale locale = new Locale("en");
        int twenty = numberFromString("twenty-two", locale);
        System.out.println(twenty);
    }

    public static int numberFromString(String number, Locale locale) {
        RuleBasedNumberFormat numberFormat = new RuleBasedNumberFormat(locale, RuleBasedNumberFormat.SPELLOUT);

        try {
            return numberFormat.parse(number).intValue();
        } catch (ParseException e) {
            return -1;
        }
    }
}

You need to replace the Spaces with -

huifer
  • 3
  • 1
  • where can I find the content of the rules you mentioned? Did you just copy pasted the algorithm described in the link I mentioned in the question? Anyway, adding dash in java side does not give the result I am looking for. `int twenty = numberFromString("twenty-two-blablabla ", locale);` is returning 22 but I want a -1 (or anything telling that the number is not correct) – Ermal May 27 '19 at 14:28
  • `"twenty-two-two".split("-");` You can filter characters to filter content that is not a number – huifer May 27 '19 at 23:50