-1

I have this task: I must read an HTML file and match all the <a> tags with all their attributes and print them out. For example: for the tag:
<a href="https://www.facebook.com" alt="Facebook icon" title="Facebook" target="_blank"></a>

to be printed:

href -  https://www.facebook.com   
alt -  Facebook icon  
title -  Facebook  
target- _blank  
text – not found  

I have basic knowledge of regex and zero knowledge of reading from a file in java. Can someone give me some hints, advices and explanations on how to do it efficiently?
The regex expression for matching the <a> tag with all attributes and the closing </a>, in my opinion, might be:

"\<[aA]\w\>\w\<\/[aA]\>*"

Leo Zhekov
  • 31
  • 5
  • Why don't you try parsers? – Avinash Raj Apr 18 '15 at 09:51
  • 1
    You may want to have a look at this question and its top answer : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – benzonico Apr 18 '15 at 09:53
  • 1
    Don't use regexes for that. Use an HTML parser (jsoup for instance). – fge Apr 18 '15 at 09:54
  • I MUST use regex. It's a homework to practice regexes. – Leo Zhekov Apr 18 '15 at 09:57
  • Show this to your teacher: http://stackoverflow.com/a/1732454/1393766 and ask for task where regex really should be used. HTML and regex is not good combination because in html order of tag attributes can change at any time, also attribute value is not guaranteed to be surrounded with `"`, we can also surround it with `'` which makes potential regex even more complex. For parsing HTML we should use... parser. – Pshemo Apr 18 '15 at 10:04

1 Answers1

2

Like others said, don't parse html files with regex. If you intended, then you may try the below \G anchor based regex.

String s = "<a href=\"https://www.facebook.com\" alt=\"Facebook icon\" title=\"Facebook\" target=\"_blank\"></a>";
Matcher m = Pattern.compile("(?:<a|(?<!^)\\G)\\s+(\\w+)=\"([^\"]*)\"")
        .matcher(s);
while (m.find()) {
    System.out.println(m.group(1) + "\t-\t" + m.group(2));
}

Output:

href    -   https://www.facebook.com
alt     -   Facebook icon
title   -   Facebook
target  -   _blank

DEMO

References for \G anchor:

Community
  • 1
  • 1
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274