1

I have a regex which I wrote:

value='[A-Za-z]+\\,[0-9]+\\,([A-Za-z0-9]+)\\,([A-Za-z0-9]+)'>[A-Za-z0-9]+\\s-\\s(.*)?\\s\\(

It works fairly well but the problem is that the very end of it keeps matching everything..

For example, it is supposed to work on books and I'm testing it on the following:

value='C,201301,F110,JEWL1050'>JEWL1050 - Industry Skills I (F110)</option>
value='C,201301,F114,JEWL1050'>JEWL1050 - Industry Skills I (F114)</option>
value='C,201301,F114,JEWL1054'>JEWL1054 - Jewellery Rendering & Illustra (F114)</option>
value='C,201301,F110,JEWL2029'>JEWL2029 - Production Techniques B (F110)</option>
value='C,201301,F114,JEWL2029'>JEWL2029 - Production Techniques B (F114)</option>
value='C,201301,LIAD,LANG9066'>LANG9066 - Italian For Beginners (LIAD)</option>
value='C,201301,T302,LAW1151'>LAW1151 - Canandian & Environmental Law (T302)</option>
value='C,201301,T305,LAW1151'>LAW1151 - Canandian & Environmental Law (T305)</option>
value='C,201301,F402,LAW1152'>LAW1152 - International Law & Agreements (F402)</option>
value='C,201301,T302,LAW3201'>LAW3201 - Protection Legislation (T302)</option>
value='C,201301,T303,LAW3201'>LAW3201 - Protection Legislation (T303)</option>
value='C,201301,T304,LAW3201'>LAW3201 - Protection Legislation (T304)</option>

So for the first book, it should capture the F110 as group 1, JEWL1050 as group 2, and Industry Skills I as group 3..

However, it captures the first two groups correctly but not the last group. It captures - Industry Skills I (F110)</option> instead..

Any ideas how I can fix my regex? I can't seem to get it to do the last group at all. Please help me. Thank you in advanced.

icedwater
  • 4,701
  • 3
  • 35
  • 50
Brandon
  • 22,723
  • 11
  • 93
  • 186
  • Are you sure it's capturing `- Industry Skills I (F110)`? It doesn't even match the `-`. Are you printing the correct group? And what is the `?` for in the last capturing group? – justhalf Oct 18 '13 at 06:14
  • use an html parser not regex – Anirudha Oct 18 '13 at 06:14
  • @Anirudh - html parser doesnt help this... – libik Oct 18 '13 at 06:16
  • 1
    Do **not** parse html with regex is the answer. See [here](http://stackoverflow.com/questions/677038/how-to-use-regular-expressions-to-parse-html-in-java), and for a legendary thread, [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1) – Mena Oct 18 '13 at 06:17
  • May be you can remove tags before using regex. – Frank59 Oct 18 '13 at 06:18
  • There seems to be some redundancy here; is the `C,201301,` part needed? Most of the information is already between the tags. You could remove those first.. – icedwater Oct 18 '13 at 06:20
  • Hmm I see. Thank you all for the answers and no the C, 201301, is not needed. I only had problems getting the book title without the dash and space at the end. – Brandon Oct 18 '13 at 06:22

3 Answers3

1

In theory, that should be working as-is.

Here's your proposed regex (with \\ changed to \ due to the nature of the tool vs Java code) when applied to your sample input: http://regex101.com/r/hL8pZ8

This tool provides a "Java" checkbox as well, and even the corresponding Java code, although there's no permalink so you'll have to input your regex (again with \\ instead of \) and sample data yourself: http://www.myregextester.com/index.php

That said, for posterity, here's its output:

Raw Match Pattern:

  value='[A-Za-z]+\,[0-9]+\,([A-Za-z0-9]+)\,([A-Za-z0-9]+)'>[A-Za-z0-9]+\s-\s(.*)?\s\(

Java Code Example:

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
    String sourcestring = "source string to match with pattern";
    Pattern re = Pattern.compile("value='[A-Za-z]+\\,[0-9]+\\,([A-Za-z0-9]+)\\,([A-Za-z0-9]+)'>[A-Za-z0-9]+\\s-\\s(.*)?\\s\\(");
    Matcher m = re.matcher(sourcestring);
    int mIdx = 0;
    while (m.find()){
      for (int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

$matches Array:
(
  [0] => Array
    (
      [0] => value='C,201301,F110,JEWL1050'>JEWL1050 - Industry Skills I (
      [1] => value='C,201301,F114,JEWL1050'>JEWL1050 - Industry Skills I (
      [2] => value='C,201301,F114,JEWL1054'>JEWL1054 - Jewellery Rendering & Illustra (
      [3] => value='C,201301,F110,JEWL2029'>JEWL2029 - Production Techniques B (
      [4] => value='C,201301,F114,JEWL2029'>JEWL2029 - Production Techniques B (
      [5] => value='C,201301,LIAD,LANG9066'>LANG9066 - Italian For Beginners (
      [6] => value='C,201301,T302,LAW1151'>LAW1151 - Canandian & Environmental Law (
      [7] => value='C,201301,T305,LAW1151'>LAW1151 - Canandian & Environmental Law (
      [8] => value='C,201301,F402,LAW1152'>LAW1152 - International Law & Agreements (
      [9] => value='C,201301,T302,LAW3201'>LAW3201 - Protection Legislation (
      [10] => value='C,201301,T303,LAW3201'>LAW3201 - Protection Legislation (
      [11] => value='C,201301,T304,LAW3201'>LAW3201 - Protection Legislation (
    )

  [1] => Array
    (
      [0] => F110
      [1] => F114
      [2] => F114
      [3] => F110
      [4] => F114
      [5] => LIAD
      [6] => T302
      [7] => T305
      [8] => F402
      [9] => T302
      [10] => T303
      [11] => T304
    )

  [2] => Array
    (
      [0] => JEWL1050
      [1] => JEWL1050
      [2] => JEWL1054
      [3] => JEWL2029
      [4] => JEWL2029
      [5] => LANG9066
      [6] => LAW1151
      [7] => LAW1151
      [8] => LAW1152
      [9] => LAW3201
      [10] => LAW3201
      [11] => LAW3201
    )

  [3] => Array
    (
      [0] => Industry Skills I
      [1] => Industry Skills I
      [2] => Jewellery Rendering & Illustra
      [3] => Production Techniques B
      [4] => Production Techniques B
      [5] => Italian For Beginners
      [6] => Canandian & Environmental Law
      [7] => Canandian & Environmental Law
      [8] => International Law & Agreements
      [9] => Protection Legislation
      [10] => Protection Legislation
      [11] => Protection Legislation
    )
)
DreadPirateShawn
  • 8,164
  • 4
  • 49
  • 71
  • :o works when I copy paste it from your code.. Odd. Thank you!! =) I bookmarked that site as well. – Brandon Oct 18 '13 at 06:40
1

Here's a more complex regular expression for this.

value='(?:[^,]+,){2}([^,]+),([^,]+)'>[^-]+-\s+([^(]+)(?=\s)

See live demo

hwnd
  • 69,796
  • 4
  • 95
  • 132
1

I've checked that C,201301 is not needed. So a simple solution would be to treat the values between < and > as junk, focusing only on > to <:

<option value='C,201301,T302,LAW3201'>LAW3201 - Protection Legislation (T302)</option>
<option value='C,201301,T303,LAW3201'>LAW3201 - Protection Legislation (T303)</option>
<option value='C,201301,T304,LAW3201'>LAW3201 - Protection Legislation (T304)</option>

Which would suggest:

>([A-Z]+[0-9])+\\s-\\s(.*)?\\s([A-Z0-9]+)<

as a sufficient expression for the three groups.

icedwater
  • 4,701
  • 3
  • 35
  • 50