1

I would like to extract the folowing String :

http://www.01net.com/images/article/mea/150.100.790233.jpg

This string is the url of the first element tag in the following Java string :

<img src="http://www.01net.com/images/article/mea/150.100.790233.jpg" width="150" height="100" border=0 alt="" align=left style="margin-right:10px;margin-bottom:5px;">A en croire CNet US, le gouvernement américain aurait cherché à obtenir les master keys de plusieurs acteurs du Web pour pouvoir déchiffrer les communications de leurs utilisateurs, protégées par le protocole SSL.<img width='1' height='1' src='http://rss.feedsportal.com/c/629/f/502199/s/2f34155b/mf.gif' border='0'/><div class='mf-viral'><table border='0'><tr><td valign='middle'><a href="http://share.feedsportal.com/share/twitter/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net" target="_blank"><img src="http://res3.feedsportal.com/social/twitter.png" border="0" /></a> <a href="http://share.feedsportal.com/share/facebook/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net" target="_blank"><img src="http://res3.feedsportal.com/social/facebook.png" border="0" /></a> <a href="http://share.feedsportal.com/share/linkedin/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net" target="_blank"><img src="http://res3.feedsportal.com/social/linkedin.png" border="0" /></a> <a href="http://share.feedsportal.com/share/gplus/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net" target="_blank"><img src="http://res3.feedsportal.com/social/googleplus.png" border="0" /></a> <a href="http://share.feedsportal.com/share/email/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net" target="_blank"><img src="http://res3.feedsportal.com/social/email.png" border="0" /></a></td><td valign='middle'></td></tr></table></div><br/><br/><a href="http://da.feedsportal.com/r/172449334514/u/218/f/502199/c/629/s/2f34155b/kg/342/a2.htm"><img src="http://da.feedsportal.com/r/172449334514/u/218/f/502199/c/629/s/2f34155b/kg/342/a2.img" border="0"/></a><img width="1" height="1" src="http://pi.feedsportal.com/r/172449334514/u/218/f/502199/c/629/s/2f34155b/kg/342/a2t.img" border="0"/>
wawanopoulos
  • 9,614
  • 31
  • 111
  • 166
  • What have you tried? You have 108 reputation, you should know that _questions asking for code must demonstrate a minimal understanding of the problem being solved_ – BackSlash Jul 28 '13 at 12:42
  • Yes, of course, i have tried multiple code using `substring` for example but it doesn't work verry well. It's complicated to retrieve only the first occurence of a string .. – wawanopoulos Jul 28 '13 at 12:44
  • _Yes, of course, i have tried multiple code_ So post the code you tried. – BackSlash Jul 28 '13 at 12:46
  • It sounds to me like this would be a good canadate for a HTML parser. – Trae Moore Jul 29 '13 at 10:46

2 Answers2

15

You can use a regular expression for this:

    String str = "<img src=\"http://www.01net.com/images/article/mea/150.100.790233.jpg\" width=\"150\" height=\"100\" border=0 alt=\"\" align=left style=\"margin-right:10px;margin-bottom:5px;\">A en croire CNet US, le gouvernement américain aurait cherché à obtenir les master keys de plusieurs acteurs du Web pour pouvoir déchiffrer les communications de leurs utilisateurs, protégées par le protocole SSL.<img width='1' height='1' src='http://rss.feedsportal.com/c/629/f/502199/s/2f34155b/mf.gif' border='0'/><div class='mf-viral'><table border='0'><tr><td valign='middle'><a href=\"http://share.feedsportal.com/share/twitter/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net\" target=\"_blank\"><img src=\"http://res3.feedsportal.com/social/twitter.png\" border=\"0\" /></a> <a href=\"http://share.feedsportal.com/share/facebook/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net\" target=\"_blank\"><img src=\"http://res3.feedsportal.com/social/facebook.png\" border=\"0\" /></a> <a href=\"http://share.feedsportal.com/share/linkedin/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net\" target=\"_blank\"><img src=\"http://res3.feedsportal.com/social/linkedin.png\" border=\"0\" /></a> <a href=\"http://share.feedsportal.com/share/gplus/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net\" target=\"_blank\"><img src=\"http://res3.feedsportal.com/social/googleplus.png\" border=\"0\" /></a> <a href=\"http://share.feedsportal.com/share/email/?u=http%3A%2F%2Fwww.01net.com%2Feditorial%2F600625%2Fchiffrement-sur-le-web-fbi-et-nsa-voulaient-obtenir-les-cles-ssl-de-geants-du-net%2F%23%3Fxtor%3DRSS-16&t=Chiffrement+sur+le+Web%2C+FBI+et+NSA+voulaient+obtenir+les+cl%C3%A9s+SSL+de+g%C3%A9ants+du+Net\" target=\"_blank\"><img src=\"http://res3.feedsportal.com/social/email.png\" border=\"0\" /></a></td><td valign='middle'></td></tr></table></div><br/><br/><a href=\"http://da.feedsportal.com/r/172449334514/u/218/f/502199/c/629/s/2f34155b/kg/342/a2.htm\"><img src=\"http://da.feedsportal.com/r/172449334514/u/218/f/502199/c/629/s/2f34155b/kg/342/a2.img\" border=\"0\"/></a><img width=\"1\" height=\"1\" src=\"http://pi.feedsportal.com/r/172449334514/u/218/f/502199/c/629/s/2f34155b/kg/342/a2t.img\" border=\"0\"/>";
    Pattern p = Pattern.compile("src=\"(.*?)\"");
    Matcher m = p.matcher(str);
    if (m.find()) {
        System.out.println(m.group(1)); // prints http://www.01net.com/images/article/mea/150.100.790233.jpg
    }
micha
  • 47,774
  • 16
  • 73
  • 80
  • Why wasting ressources for patterns and matchers if you can do it with substring and indexOf? – paulgavrikov Jul 28 '13 at 12:48
  • *You can use regular expression* -> [No You Shouldn't](http://stackoverflow.com/q/1732348/1679863) – Rohit Jain Jul 28 '13 at 13:02
  • @Rohit: There is a difference between complete tag matching and extracting a simple attribute value. – micha Jul 28 '13 at 18:38
  • @bluewhile: Because it is very easy to read. I think everyone who has a solid understanding of regular expressions understands this code very fast. Faster than reading a mix of `indexOf()`, `substring()` calls with some index computations. – micha Jul 28 '13 at 18:40
11

Here is a quick and dirty solution using the String API instead of regular expressions which is faster.

Working principle:

  1. all is the whole text you want to search for

  2. s is the start pattern to look for, in this case it will be the first <img ... tag. If you have multiple img, consider iteration or extending the string to possible id="" or class="" tags

  3. ix is the position of the URL in all

  4. the last line gets the String from all starting at ix to the next " it finds

    String all = "<img src=\"http://www.01net.com/images/article/mea/150.100.790233.jpg\""; // shortened it 
    String s = "<img src=\"";
    int ix = all.indexOf(s)+s.length();
    System.out.println(all.substring(ix, all.indexOf("\"", ix+1)));
    

EDIT: A bit more details for advanced readers. As stated in other answers and comments you should not use the String API to parse HTML, as there are many specifics that are hard to catch. Note that regex won't help you either as it is a type-3 Chomsky language (regular) and therefore a subset of HTML which is type-2 (context sensitive, see Wiki). In production use a DOM-parser like jsoup. For quick hacking or known style a String API solution will probably work just fine and add less overhead.

paulgavrikov
  • 1,883
  • 3
  • 29
  • 51
  • This solution works verry fine ! Thanks a lot for your help. – wawanopoulos Jul 28 '13 at 12:55
  • What do you mean by *Wasting resources with Patterns*? You should not even use String API for parsing HTML. – Rohit Jain Jul 28 '13 at 13:01
  • This solution should be updated for case if src value quoted in `'` – Paul May 15 '17 at 14:34
  • 1
    @Paul No. This is not valid xhtml anymore. – paulgavrikov May 20 '17 at 23:02
  • I don't think this is a solid solution. an img tag can be `` and ``. Your solution does not work for the second case. – yiksanchan Mar 13 '20 at 17:43
  • @YikSanChan this is a valid answer to the specifics of the question. In general you cannot parse HTML reliably with regex and probably not even with the string API, due to the chomsky hierarchy: see https://stackoverflow.com/questions/6687305/reliably-parsing-html-elements-using-regex – paulgavrikov Jun 21 '21 at 14:57