How to get numbers out of string?

Question

I'm using a Java StreamTokenizer to extract the various words and numbers of a String but have run into a problem where numbers which include commas are concerned, e.g. 10,567 is being read as 10.0 and ,567.

I also need to remove all non-numeric characters from numbers where they might occur, e.g. $678.00 should be 678.00 or -87 should be 87.

I believe these can be achieved via the whiteSpace and wordChars methods but does anyone have any idea how to do it?

The basic streamTokenizer code at present is:

        BufferedReader br = new BufferedReader(new StringReader(text));
        StreamTokenizer st = new StreamTokenizer(br);
        st.parseNumbers();
        st.wordChars(44, 46); // ASCII comma, - , dot.
        st.wordChars(48, 57); // ASCII 0 - 9.
        st.wordChars(65, 90); // ASCII upper case A - Z.
        st.wordChars(97, 122); // ASCII lower case a - z.
        while (st.nextToken() != StreamTokenizer.TT_EOF) {
            if (st.ttype == StreamTokenizer.TT_WORD) {                    
                System.out.println("String: " + st.sval);
            }
            else if (st.ttype == StreamTokenizer.TT_NUMBER) {
                System.out.println("Number: " + st.nval);
            }
        }
        br.close();

Or could someone suggest a REGEXP to achieve this? I'm not sure if REGEXP is useful here given that any parding would take place after the tokens are read from the string.

Thanks

Mr Morgan.

What should happen to `1,2,3,4`? – polygenelubricants Jul 17 '10 at 18:37 — polygenelubricants, Jul 17 '10 at 18:37

score 9 · Accepted Answer · edited Mar 28 '12 at 15:43

StreamTokenizer is outdated, is is better to use Scanner, this is sample code for your problem:

    String s = "$23.24 word -123";
    Scanner fi = new Scanner(s);
    //anything other than alphanumberic characters, 
    //comma, dot or negative sign is skipped
    fi.useDelimiter("[^\\p{Alnum},\\.-]"); 
    while (true) {
        if (fi.hasNextInt())
            System.out.println("Int: " + fi.nextInt());
        else if (fi.hasNextDouble())
            System.out.println("Double: " + fi.nextDouble());
        else if (fi.hasNext())
            System.out.println("word: " + fi.next());
        else
            break;
    }

If you want to use comma as a floating point delimiter, use fi.useLocale(Locale.FRANCE);

This is extremely helpful. And I've already added a few other characters to it. Many thanks. — Mr Morgan, Jul 17 '10 at 19:01

Carl Smotricz · Answer 2 · 2010-07-17T17:58:14.833

5

Try this:

String sanitizedText = text.replaceAll("[^\\w\\s\\.]", "");

SanitizedText will contain only alphanumerics and whitespace; tokenizing it after that should be a breeze.

EDIT

Edited to retain the decimal point as well (at the end of the bracket). . is "special" to regexp so it needs a backslash escape.

edited Jul 17 '10 at 17:58

answered Jul 17 '10 at 17:51

Carl Smotricz

66,391
18
125
167

Thanks. Seems to work but with a number of £345.67, it returns 34567.00. – Mr Morgan Jul 17 '10 at 17:56
1

Easy. Just add inside the brackets any other characters you'd like to keep. I'll fix that up for you... – Carl Smotricz Jul 17 '10 at 17:58
This just might have solved a major problem. And after this parsing is done, I can just call the StreamTokenizer as above. Thanks. – Mr Morgan Jul 17 '10 at 18:00
I do notice though that double barrelled names are altered, e.g. Albany-Caxton becomes AlbanyCaxton. Can this be prevented? – Mr Morgan Jul 17 '10 at 18:13
Certainly, if you add a '-' at the end of the bracket. However, you may encounter negative numbers if you do. But then you can fix those with a simple `if` test. – Carl Smotricz Jul 17 '10 at 18:25
I have - working and can test for negative numbers. But what about an apostrophe in like a name O'Finnegan? – Mr Morgan Jul 17 '10 at 18:39
This is an incorrect solution. It will not handle decimals or 1000s separators in currency correctly. Scanner as @tulskiy suggested is a correct and easier solution built into the JavaSE library. – Alain O'Dea Jul 17 '10 at 18:55
2

My hat is off to @tulskiy, his solution is much easier to localize than mine. I'm giving him an upvote for his better solution. – Carl Smotricz Jul 17 '10 at 19:44

score 4 · Answer 3 · edited May 08 '17 at 13:58

4

This worked for me :

String onlyNumericText = text.replaceAll("\\\D", "");

edited May 08 '17 at 13:58

Kirill Kulakov

10,035
9
50
67

answered Dec 20 '12 at 08:49

mordekhai

451
1
3
12

After the edit a \ too much has sneaked in. Should be \\D. – Michael Chatiskatzi Feb 27 '21 at 11:19

score 1 · Answer 4 · answered Aug 06 '10 at 15:40

1

    String str = "1,222";
    StringBuffer sb = new StringBuffer();
    for(int i=0; i<str.length(); i++)
    {
        if(Character.isDigit(str.charAt(i)))
            sb.append(str.charAt(i));
    }
    return sb.toString()

answered Aug 06 '10 at 15:40

ankushb

188
1
8

score 0 · Answer 5 · answered Jul 17 '10 at 17:49

0

Sure this can be done with regexp:

s/[^\d\.]//g

However notice that it eats all commas, which is probably what you want if using american number format where comma is only separating thousands. In some languages comma is used instead of the point as a decimal separator. So take care when parsing international data.

I leave it on you to translate this to Java.

answered Jul 17 '10 at 17:49

gorn

5,042
7
31
46

That's why I want to leave the commas in place. – Mr Morgan Jul 17 '10 at 17:56
I thought you need the number out not the string representation of it. nevermind – gorn Jul 17 '10 at 20:18

How to get numbers out of string?

5 Answers5

Linked