0

i am trying to chop a string that contains several information in java .. the text is something like that :

<a href="http://www.hootsuite.com" rel="nofollow">HootSuite</a>

i am thinking of using the .split method that need regular expression .. what i want it to split this string into the URL without quotes .. http://...... .com and then the text between the tags .. this case HootSuite ..

i will appreciate the help Thank you

robert_x44
  • 9,224
  • 1
  • 32
  • 37
AhmadAssaf
  • 3,556
  • 5
  • 31
  • 42
  • 2
    Why don't you use an HTML parser to extract the `href` attribute? Easier and much less brittle. – Anon. Jan 07 '11 at 01:45

1 Answers1

5

You don't want to do this. You want to use an XML or HTML parsing suite like org.w3c.dom. Why, you ask? Because you can't parse HTML with regex.

Community
  • 1
  • 1
asthasr
  • 9,125
  • 1
  • 29
  • 43
  • +1, I will say though that every time I see this response I know the writer has not written a spider because so much of the internet is severely broken HTML that would not get through even the most lax parsers. – Mike Axiak Jan 07 '11 at 01:51
  • the thing that might made this task easier is that am parsing always HTML with the same structure ... it is a URL that is sent back by the Twitter API .. always same structure .. but i think a parser will be the best choice – AhmadAssaf Jan 07 '11 at 01:56
  • And the infamous diatribe has saved another soul. – robert_x44 Jan 07 '11 at 01:57
  • @AhmadAssaf That fills me with a warm glow of happiness. – asthasr Jan 07 '11 at 01:57