java and regexp: how to match a string with lithreal parenthesis?

Question

I've this three text, and one regexp. (OK, it's HTML, but ...please, don't focus on it !!!!)

<h3 class="pubAdTitleBlock "><a href="/it/pubblicazioni/libri/Che-speranza-cè-per-i-morti/1101987030/" title="Che speranza c’è per i morti?">Che speranza c’è per i morti? (volantino N. 16)</a></h3>

<h3 class="pubAdTitleBlock "><a href="/it/pubblicazioni/libri/cosa-insegna-la-bibbia/È-questo-che-Dio-voleva/" title="È questo che Dio voleva?">Cosa insegna realmente la Bibbia?</a></h3>

<h3 class="pubAdTitleBlock">Cantiamo a Geova</h3>

This is the regexp

regexp = "<h3[^>]*>(<a[^>]*>)?([^<]+)(</a>)?</h3>";

I've three 3 groups:

the opening <a> tag (optional)
the text (it's a book title, it's the goal of regexp)
the closing </a> tag (optional)

Problem: The second row is matched, the third is matched. The first no. Why ?

Matching code:

pattern = Pattern.compile(regexp);
matcher = pattern.matcher(fullString);
idx = 0;
while (matcher.find()) {
  ...
}

matcher.find() simply skips the first row. This is not the first row of the file, it's the 10th. It's the first of the example.

Can be the literal parenthesis the problem? how to fix the regexp ?

EDIT: I've tried

String regexp = "<h3[^>]*>(.+)</h3>";

But also this regexp skip the first row ... I really cannot understand !!!!

EDIT 2:

I'm having a dubt: can be a problem if there is the accented charcter ?

EDIT 3:

I'm trying to do data scraping from here: http://www.jw.org/it/pubblicazioni/libri/?contentLanguageFilter=it&sortBy=3

I've an input stream, then I convert to a single string using this code:

 // copied from http://stackoverflow.com/questions/309424/read-convert-an-inputstream-to-a-string
public static String convertStreamToString(InputStream is) {
    try {
        return new java.util.Scanner(is, "UTF-8").useDelimiter("\\A").next();
    } catch (java.util.NoSuchElementException e) {
        return "";
    }

Then I'm apllying the regexp ...

Please show the code that does the matching. Calling `matcher(str).find()` returns `true` in all three cases ([link](http://ideone.com/rCjLJP)). — Sergey Kalinichenko, Oct 28 '12 at 14:01
It works fine for me; your problem must be somewhere else. But whatever it is, the problem has nothing to do with those parentheses. — Alan Moore, Oct 28 '12 at 14:48
Could the problem be chars like 'è,é, È' ? They're italian utf-8 chars — realtebo, Oct 28 '12 at 14:50
But those characters are present in the second line, too. I tried mucking about with the character encodings anyway, but I couldn't get it to fail the way you described. — Alan Moore, Oct 28 '12 at 15:12
Works fine for me too. As Alan and Pshemo have correctly pointed out - _there is nothing wrong with the regex and code you've posted so far._ (It does not behave the way you describe - See Pshemo's answer for a working example.) If you want more help, you'll need to post the actual code that produces the error. — ridgerunner, Oct 28 '12 at 15:45

Pshemo · Accepted Answer · 2012-10-30T16:43:27.333

Not sure but maybe this is what you are looking for

String data = "<h3 class=\"pubAdTitleBlock \"><a href=\"/it/pubblicazioni/libri/Che-speranza-cè-per-i-morti/1101987030/\" title=\"Che speranza c’è per i morti?\">Che speranza c’è per i morti? (volantino N. 16)</a></h3>"
        + "<h3 class=\"pubAdTitleBlock \"><a href=\"/it/pubblicazioni/libri/cosa-insegna-la-bibbia/È-questo-che-Dio-voleva/\" title=\"È questo che Dio voleva?\">Cosa insegna realmente la Bibbia?</a></h3>"
        + "<h3 class=\"pubAdTitleBlock\">Cantiamo a Geova</h3>";

Pattern pattern = Pattern
        .compile("<h3[^>]*>(?:<a[^>]*>)?([^<]+)(?:</a>)?</h3>");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) 
    System.out.println(matcher.group(1));

Output:

Che speranza c’è per i morti? (volantino N. 16)
Cosa insegna realmente la Bibbia?
Cantiamo a Geova

Little explanation:

groups like (?:someregex) will not be counted by regex mechanism. Thanks to that in (?:a)(b)(?:c)(d) group (b) will be indexed as 1 and (d) as 2.

Edit1

(I know its blasphemy to use regex to parse HTML but since OP wants it...)
You forgot to mention that parsed HTML contains white spaces like tabulations and new line marks inside <h3 >. Try it this way:

String data = convertStreamToString(new URL(
        "http://www.jw.org/it/pubblicazioni/libri/?contentLanguageFilter=it&sortBy=3")
        .openStream());

Pattern pattern = Pattern
        .compile("<h3[^>]*>\\s*(?:<a[^>]*>)?([^<]+)(?:</a>)\\s*?</h3>");
Matcher matcher = pattern.matcher(data);
int counter=0;
while (matcher.find())
    System.out.println(++counter +")"+matcher.group(1));

Output:

1)Accostiamoci a Geova
2)Accostiamoci a Geova — caratteri grandi
....
11)Cosa insegna realmente la Bibbia?
12)Cosa insegna realmente la Bibbia? — caratteri grandi

It doesn't work. First and second row it's not catched. Only the third. ! What does it mean '?:' in regexp ? — realtebo, Oct 28 '12 at 15:00
Could you show us some [code sample](http://sscce.org/) that reproduce your problem? Without it it will be very hard to help you. — Pshemo, Oct 28 '12 at 15:02
It's always a good idea to use non-capturing groups when you don't need to capture, but it doesn't solve the problem. All it does is change which substrings get captured in which groups. — Alan Moore, Oct 28 '12 at 15:07

score 2 · Answer 2 · edited May 23 '17 at 12:03

2

Don't do it with Parser or RegExp. Try Jerry. Like (not tested):

Jerry doc = jerry(html);
doc.$("a").each(new JerryFunction() {
    public boolean onNode(Jerry $this, int index) {
        String href = $this.attr("href");
        System.out.println(href);
    }
}

or any html-friendly query language. Because of non-externals requirements try Trying to parse links in an HTML directory listing using Java

(Copypasted my answer from: How do you parse links from html using Java?)

EDIT: try

<h3.*?>(<a.*)?+(.*?)(</a>)?</h3>

and get group(2)

EDIT 2: Just for the book title try:

(.*>)?([^<]+?)<.*

EDIT 3: your regexp

<h3[^>]*>(<a[^>]*>)?([^<]+)(</a>)?</h3>

looks to work for me.

edited May 23 '17 at 12:03

Community

1
1

answered Oct 28 '12 at 14:01

Anton Bessonov

9,208
3
35
38

1

I dont'want to use Jerry. I want a regexp !! I'm studying regexp – realtebo Oct 28 '12 at 14:06
Yes, the original regex works fine, so please stop trying to solve the problem by flinging metacharacters at it. (Just kidding, but you really should test your solutions before you post them.) – Alan Moore Oct 28 '12 at 15:25
Alan, you will not believe it, but I've tested my regexp solutions. Your comment is not usefull. – Anton Bessonov Oct 28 '12 at 15:38
When you tested them, did you look at what was in each of the capture groups? – Alan Moore Oct 28 '12 at 15:58
You can try it self: http://www.regexplanet.com/advanced/java/index.html It show also captured groups. – Anton Bessonov Oct 28 '12 at 16:03

java and regexp: how to match a string with lithreal parenthesis?

2 Answers2

Edit1