2

I'm trying to read something from within HTML tags and I'm completely stupid when it comes to Regular Expressions (I've though of a few patters and none seem to work).

I'm reading a web page, looking this line: <td title='Visit Page for Demilict'><a href='personal.php?name=Demilict&amp;c=s' class='idk' rel='Demilict' style='color: teal;'>Demilict</a></td>

I need to extract 'Demilict' from there, and there's 3 opportunities to do so as you can see.

Which would be the best position to extract it from and how would I achieve that?

I'm using this to find the name(s) as well, as there is around 60 different names I need to extract and they're all using the same format, except the name can only contain letters numbers and underscores.

public void parse(String list) {
    try {
        URL url = new URL(list);
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(url.openStream()));
        String line;
        StringBuilder stringBuilder = new StringBuilder();
        while ((line = bufferedReader.readLine()) != null) {
            stringBuilder.append(line).append("\n");
        }
        System.out.println(stringBuilder.toString());
        Matcher matcher = namePattern.matcher(stringBuilder.toString());
        if (matcher.find()) {
            System.out.println("matched: " + matcher.group());
        }
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}
  • why don't you use DOM instead? – Heisenbug Aug 23 '11 at 11:32
  • 1
    don't use RegEx to parse HTML - use a parser (see: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – macf00bar Aug 23 '11 at 11:33
  • We can't really help you without the actual pattern. Can you show us how you create `namePattern`? Also, consider using an API for HTML/XML processing (if it's XTHML, XML would do). Using regular expressions for such extraction is very error prone. Any unforeseen situation could create an issue. – G_H Aug 23 '11 at 11:34
  • How can I tell if it is XTHML? I am using `namePattern` as `private Pattern namePattern = Pattern.compile("\\?obj=([A-Za-z0-9_]*)");` –  Aug 23 '11 at 11:35
  • just as a side note the \w regex special char is the same as A-Za-z0-9_ – smitec Aug 23 '11 at 11:40
  • Alright well I wasn't aware that RegEx was a bad thing to use here, I'm still learning! I've used DOM before for XML documents but I have no idea as to how I could use it for XHTML documents. –  Aug 23 '11 at 11:44
  • One word: JSoup. It's fast, easy and effective in extracting information from HTML whether it be XTHML or not. – Hovercraft Full Of Eels Aug 23 '11 at 11:58
  • @Ron: Regex is perfectly fine for simple problems like yours. – tchrist Aug 23 '11 at 12:54

3 Answers3

1

<a.*?>(\w+)</a> will grab text between the <a ...> and the < /a> and put it into the first group; but as others have said regex probably isn't the best option here.

Edit: changes first + to * as 0 chars is valid. also removed the second ? as per comment below.

smitec
  • 3,049
  • 1
  • 16
  • 12
  • Nice and simple, but it could be simpler: you don't need the second question mark, because left angle bracket is not a word character. – tchrist Aug 23 '11 at 12:54
  • ah very true thanks. also realised a * should replace the first + as 0 extra chars is valid as well (although that doesn't apply i this case). – smitec Aug 23 '11 at 12:59
1

If you really would use Regular Expression to extract the name, this regexp should store the name in group 1:

<td[^>]*?><a[^>]*?>(\\w+)</a></td>
Dragon8
  • 1,775
  • 12
  • 8
0

Here is one method, to grab the text in the rel='XXX' attribute.

String val = "<td title='Visit Page for Demilict'><a href='personal.php?name=Demilict&amp;c=s' class='idk' rel='Demilict' style='color: teal;'>Demilict</a></td>";
String newVal = val.replaceFirst("^.*rel='([a-zA-Z0-9_]+)'.*$", "$1");
System.out.println("Result: " + newVal);

Basically it just looks for rel='XXX', and throws everything except the XXX away. It allows for rel to contain chars a-z and A-Z, 0-9 and underscore.

JJ.
  • 5,425
  • 3
  • 26
  • 31