0

I've a list of images and some of these images are used on web. I need to get statistic about what images are used on website and on what pages etc.

How can I "match" my images. Rules are:

  1. I've only filename i.e. "mypic.png"
  2. Here is a regex I want to build <img[anything]src=("or')[anything]mypic.png[anything]("or')[anything]>

here is a dumb of HTML I have

<figure class="gr_col gr_2of3">
    <div class="mll mrm mbs md_pic_wrap1">
        <a href="http://mydomain/nice-page" title="title test">
            <img alt="alt text" class="mbm" src="http://mydomain/file-pic2/mypic.png" width="95" height="95">
        </a>
    </div>
</figure>

Thanks!

Dmytro Pastovenskyi
  • 5,240
  • 5
  • 37
  • 56
  • 1
    So, you want to *extract* the image name from the web page.. In this case mypic.png and check if it in your list right?. Edit : I completely *agree* with *Rawing's* comment :P – TheLostMind Jan 01 '15 at 14:14
  • 3
    [Don't use regex to parse HTML.](http://stackoverflow.com/a/1732454/1222951) – Aran-Fey Jan 01 '15 at 14:14
  • What would use suggest instead? – Dmytro Pastovenskyi Jan 01 '15 at 14:18
  • 2
    @Dmytro - Use a HTML parser like JSoup – TheLostMind Jan 01 '15 at 14:18
  • You are not parsing html with regex, you are extracting bits of data. Parsing involves breaking down each character of a string using rules of a formal grammar language. There's nothing wrong with using regex to get image names as you are not parsing anything, IMO. – gwillie Jan 02 '15 at 02:27

2 Answers2

2

HTML and regex are terrible together in almost all cases. Use a tool that was meant to perform the job you need done e.g. JSoup.

Document document = Jsoup.parse(htmlStringOrFile);
for(Element img : document.select("img")) {
    if(img.attr("src").contains("mypic.png")) {
        System.out.println(img.attr("alt"));
    }
}

This will print the value of the alt attribute of all img elements containing mypic.png in their src. Replace alt with name or id or whatever is the most appropriate for your case.

[As noted by Pshemo]

The selector can be any CSS selector, so you can cut the condition checking and even the loop itself by replacing it with img[src*=mypic.png] which essentially has the same semantics.

kaqqao
  • 12,984
  • 10
  • 64
  • 118
  • 1
    This is comment not answer. It is good that you describe which tools are appropriate for this problem, but answer should also contain description of *how* to use these tools to solve problem described in question. At current state your answer is similar to [link-only-answers](http://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers) but even without a link. – Pshemo Jan 01 '15 at 16:59
  • I believe an answer is good if it gives the person enough tools to help themselves, and in this case it would take at most 5 mins of Googling to get the exact code. Spoon feeding is not helpful in the long run. – kaqqao Jan 01 '15 at 17:05
  • 4
    Informations in your posts are good, but simply not enough for an answer. If you don't want to spoon-feed then don't, give pseudocode instead, or even better write steps needed to solve this problem so OP could change these steps into code himself. As you noticed people already Jsoup as correct tool. They posted it as comment instead of answer because answers are meant to contain *full* (or very close to full) solution, not just general info. – Pshemo Jan 01 '15 at 17:09
  • Ok, I've added sample code. I still maintain that this question was too simple to require it and that all the asker needed was a nudge in the right direction as we're talking about the most basic use case for a well documented tool. I mean, the very home page of JSoup contains almost all the necessary code for this. – kaqqao Jan 01 '15 at 17:39
  • 1
    Instead of `contains` It would be probably better to use `endsWith`, or to avoid this `if(..)` simply use *ends with* syntanx in `select` like `doc.select("img[src$=mypic.png]")`. Anyway since now your answer contains example of how this tool can be used I have no more objections about it. – Pshemo Jan 01 '15 at 17:40
  • 1
    I was thinking the same about endsWidth, but the question stated [anything]mypic.png[anything] so I guessed they might indeed have some image URLs where there's more at the end. – kaqqao Jan 01 '15 at 17:42
  • I see your point. Anyway to avoid additional `if` we can still use `[attr*=value]` selector which will require `attr` to contain `value`. – Pshemo Jan 01 '15 at 17:45
0

To match an image use:

(?i)<img.*?src=["'].*?(mypic\.png).*?["'].*?>

In capturing group 1 there is the name of the image that matches.


public String buildRegex(String... nameList) {
    StringBuilder regex = new StringBuilder();
    regex.append("(?i)<img.*?src=[\"'].*?(");
    for (int i = 0; i < nameList.length - 1; i++) {
        regex.append(nameList[i].replaceAll("\\.", "\\\\.")).append("|");
    }
    regex.append(nameList[nameList.length - 1].replaceAll("\\.", "\\\\."));
    regex.append(").*?[\"'].*?>");
    return regex.toString();
}
Andie2302
  • 4,825
  • 4
  • 24
  • 43