Grabbing text from an HTML tag using Regular Expressions

Question

I'm trying to read something from within HTML tags and I'm completely stupid when it comes to Regular Expressions (I've though of a few patters and none seem to work).

I'm reading a web page, looking this line: <td title='Visit Page for Demilict'><a href='personal.php?name=Demilict&c=s' class='idk' rel='Demilict' style='color: teal;'>Demilict</a></td>

I need to extract 'Demilict' from there, and there's 3 opportunities to do so as you can see.

Which would be the best position to extract it from and how would I achieve that?

I'm using this to find the name(s) as well, as there is around 60 different names I need to extract and they're all using the same format, except the name can only contain letters numbers and underscores.

public void parse(String list) {
    try {
        URL url = new URL(list);
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(url.openStream()));
        String line;
        StringBuilder stringBuilder = new StringBuilder();
        while ((line = bufferedReader.readLine()) != null) {
            stringBuilder.append(line).append("\n");
        }
        System.out.println(stringBuilder.toString());
        Matcher matcher = namePattern.matcher(stringBuilder.toString());
        if (matcher.find()) {
            System.out.println("matched: " + matcher.group());
        }
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

don't use RegEx to parse HTML - use a parser (see: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — macf00bar, Aug 23 '11 at 11:33
We can't really help you without the actual pattern. Can you show us how you create `namePattern`? Also, consider using an API for HTML/XML processing (if it's XTHML, XML would do). Using regular expressions for such extraction is very error prone. Any unforeseen situation could create an issue. — G_H, Aug 23 '11 at 11:34
How can I tell if it is XTHML? I am using `namePattern` as `private Pattern namePattern = Pattern.compile("\\?obj=([A-Za-z0-9_]*)");` — , Aug 23 '11 at 11:35
just as a side note the \w regex special char is the same as A-Za-z0-9_ — smitec, Aug 23 '11 at 11:40
Alright well I wasn't aware that RegEx was a bad thing to use here, I'm still learning! I've used DOM before for XML documents but I have no idea as to how I could use it for XHTML documents. — , Aug 23 '11 at 11:44
One word: JSoup. It's fast, easy and effective in extracting information from HTML whether it be XTHML or not. — Hovercraft Full Of Eels, Aug 23 '11 at 11:58
@Ron: Regex is perfectly fine for simple problems like yours. — tchrist, Aug 23 '11 at 12:54

smitec · Accepted Answer · 2011-08-23T13:03:45.493

1

<a.*?>(\w+)</a> will grab text between the <a ...> and the < /a> and put it into the first group; but as others have said regex probably isn't the best option here.

Edit: changes first + to * as 0 chars is valid. also removed the second ? as per comment below.

edited Aug 23 '11 at 13:03

answered Aug 23 '11 at 11:34

smitec

3,049
1
16
12

Nice and simple, but it could be simpler: you don't need the second question mark, because left angle bracket is not a word character. – tchrist Aug 23 '11 at 12:54
ah very true thanks. also realised a * should replace the first + as 0 extra chars is valid as well (although that doesn't apply i this case). – smitec Aug 23 '11 at 12:59

score 1 · Answer 2 · answered Aug 23 '11 at 11:41

1

If you really would use Regular Expression to extract the name, this regexp should store the name in group 1:

<td[^>]*?><a[^>]*?>(\\w+)</a></td>

answered Aug 23 '11 at 11:41

Dragon8

1,775
12
8

JJ. · Answer 3 · 2011-08-23T11:46:42.973

Here is one method, to grab the text in the rel='XXX' attribute.

String val = "<td title='Visit Page for Demilict'><a href='personal.php?name=Demilict&amp;c=s' class='idk' rel='Demilict' style='color: teal;'>Demilict</a></td>";
String newVal = val.replaceFirst("^.*rel='([a-zA-Z0-9_]+)'.*$", "$1");
System.out.println("Result: " + newVal);

Basically it just looks for rel='XXX', and throws everything except the XXX away. It allows for rel to contain chars a-z and A-Z, 0-9 and underscore.

Grabbing text from an HTML tag using Regular Expressions

3 Answers3