2

I'm using Java and Jsoup to extract the content of the div tag. I need to extract only the numbers.

String html = "";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }

and the output is something like this

Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1–2, 4–5, 7, 9, 12, 13, 32, 35,

My problem is how can I extract the number content of the div tag? Each number has a comma before the needs. So how can I do it using regex? Thank you

Update: How can I extract the number and Roman numeral?

Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1–2, 4–5, 7, 9, 12, 13, 32, 35, V, VI, IX, 

This post is not the same question with the link above because my problem needs to extract Roman numerals

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
audrey ruaburo
  • 53
  • 1
  • 1
  • 6

2 Answers2

1

Description

This Regex will do the following:

  • Match all numeric strings like 2, 3977, 432, 5 ..
  • Match all Ranges of numeric strings like 2-4, 553-999, 1234-9876
  • Match all valid Roman Numerals in the range of 1-4000
  • returns an array of only these values and no additional capture groups

The Regex

\b(?:\d+(?:-\d+)?|(?=[MCDLXVI]+\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\b

Note this is just a raw regex, for many languages like Java you'll need to replace the \ with \\ to get it to work correctly.

Explanation

Regular expression visualization

NODE                     EXPLANATION
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      -                        '-'
----------------------------------------------------------------------
      \d+                      digits (0-9) (1 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [MCDLXVI]+               any character of: 'M', 'C', 'D', 'L',
                               'X', 'V', 'I' (1 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    M{0,4}                   'M' (between 0 and 4 times (matching the
                             most amount possible))
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      CM                       'CM'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      CD                       'CD'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      D?                       'D' (optional (matching the most
                               amount possible))
----------------------------------------------------------------------
      C{0,3}                   'C' (between 0 and 3 times (matching
                               the most amount possible))
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      XC                       'XC'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      XL                       'XL'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      L?                       'L' (optional (matching the most
                               amount possible))
----------------------------------------------------------------------
      X{0,3}                   'X' (between 0 and 3 times (matching
                               the most amount possible))
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      IX                       'IX'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      IV                       'IV'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      V?                       'V' (optional (matching the most
                               amount possible))
----------------------------------------------------------------------
      I{0,3}                   'I' (between 0 and 3 times (matching
                               the most amount possible))
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------

Examples

Live Demo

http://fiddle.re/pvjzra

Sample Text

Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1-2, 4-5, 7, 9, 12, 13, 32, 35, V, VI, IX

Java Code Example

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1-2, 4-5, 7, 9, 12, 13, 32, 35, V, VI, IX";
  Pattern re = Pattern.compile("\\b(?:\\d+(?:-\\d+)?|(?=[MCDLXVI]+\\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\\b",Pattern.CASE_INSENSITIVE );
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Matched Array

$matches Array:
(
    [0] => Array
        (
            [0] => 45-46
            [1] => 25
            [2] => 1-2
            [3] => 4-5
            [4] => 7
            [5] => 9
            [6] => 12
            [7] => 13
            [8] => 32
            [9] => 35
            [10] => V
            [11] => VI
            [12] => IX
        )

)
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • i tried running you're code but it gives me wrong output sir. it removes the hypen for 1-2 and 4-5 [0][0] = 45-46 [1][0] = 25 [2][0] = 1 [3][0] = 2 [4][0] = 4 [5][0] = 5 [6][0] = 7 [7][0] = 9 [8][0] = 12 [9][0] = 13 [10][0] = 32 [11][0] = 35 [12][0] = V [13][0] = VI [14][0] = IX – audrey ruaburo Apr 27 '16 at 06:15
  • @audreyruaburo, this is happening because the hyphen as included in the string, looks like a hyphen but is subtlety different. I reran this and saw my instance of java having a hard time parsing it. I'll update my answer and replace the odd hyphens `–` with real hyphens `-`. – Ro Yo Mi Apr 27 '16 at 19:12
0

You can use this Regex:

\b(\d+(-\d+)?|(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3})))\b

Demo: https://regex101.com/r/rW1mY1/3

Explanation:

  1. \b for word boundary.
  2. (M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3})) That is Roman numeral validator. I got it from here: How do you match only valid roman numerals with a regular expression?
  3. \d+(-\d+)? matches digit with optional number range
Community
  • 1
  • 1
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
  • how can i apply it on java? i tried but it doesn't work as expected String pattern = "\\b(\\d+(-\\d+)?|(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3})))\\b"; – audrey ruaburo Apr 27 '16 at 00:51