7

everyone.

I'm trying to get all image urls of the current page in UIWebView.

So, here is my code.

- (void)webViewDidFinishLoad:(UIWebView*)webView {
    NSString *firstImageUrl = [self.webView stringByEvaluatingJavaScriptFromString:@"var images = document.getElementsByTagName('img');images[0].src.toString();"];
    NSString *imageUrls = [self.webView stringByEvaluatingJavaScriptFromString:@"var images= document.getElementsByTagName('img');var imageUrls = "";for(var i = 0; i < images.length; i++){var image = images[i];imageUrls += image.src;imageUrls += \\’,\\’;}imageUrls.toString();"];
    NSLog(@"firstUrl : %@", firstImageUrl);
    NSLog(@"images : %@",imageUrls);
}

1st NSLog returns correct image's src, but 2nd NSLog returns nothing.

2013-01-25 00:51:23.253 WebDemo[3416:907] firstUrl: https://www.paypalobjects.com/en_US/i/scr/pixel.gif
2013-01-25 00:51:23.254 WebDemo[3416:907] images :

I don't know why. Please help me...

Thanks.

tsk
  • 225
  • 3
  • 9

4 Answers4

14

Perrohunter pointed out one NSRegularExpression solution, which is great. If you don't want to enumerate the array of matches, you can use the block-based enumerateMatchesInString method, too:

NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"(<img\\s[\\s\\S]*?src\\s*?=\\s*?['\"](.*?)['\"][\\s\\S]*?>)+?"
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:&error];

[regex enumerateMatchesInString:yourHTMLSourceCodeString
                        options:0
                          range:NSMakeRange(0, [yourHTMLSourceCodeString length])
                     usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop) {

                         NSString *img = [yourHTMLSourceCodeString substringWithRange:[result rangeAtIndex:2]];
                         NSLog(@"img src %@",img);
                     }];

I've also updated the regex pattern to deal with the following issues:

  • there can be attributes between the start img tag and the src attribute;
  • there can be attributes after the src attribute and before the >;
  • there can be newline characters in the middle of an img tag (the . captures everything except newline character);
  • the src attribute value can be quoted with ' as well as "; and
  • there can be spaces between src and the = as well as between the = and the subsequent value.

I freely recognize that reading regex patterns is painful for the uninitiated, and perhaps other solutions might make more sense (the JSON suggestion by Joris, using scanners, etc.). But if you wanted to use regex, the above pattern might cover a few more permutations of the img tag, and enumerateMatchesInString might be ever so slightly more efficient than matchesInString.

Rob
  • 415,655
  • 72
  • 787
  • 1,044
  • Thanks!! This regex is more useful. You are great. So, I changed "Accepted answer". Sorry, perrohunter... I need to study regex. – tsk Jan 26 '13 at 13:09
11

I don't like regular expressions, so here's my answer without them.

The javascript indented for clarification:

// javascript to execute:
(function() {
    var images=document.querySelectorAll("img");
    var imageUrls=[];
    [].forEach.call(images, function(el) {
        imageUrls[imageUrls.length] = el.src;
    }); 
    return JSON.stringify(imageUrls);
})()

You'll notice I return a JSON string here. To read this back in Objective-C:

NSString *imageURLString = [self.webview stringByEvaluatingJavaScriptFromString:@"(function() {var images=document.querySelectorAll(\"img\");var imageUrls=[];[].forEach.call(images, function(el) { imageUrls[imageUrls.length] = el.src;}); return JSON.stringify(imageUrls);})()"];

// parse json back into an array
NSError *jsonError = nil;
NSArray *urls = [NSJSONSerialization JSONObjectWithData:[imageURLString dataUsingEncoding:NSUTF8StringEncoding] options:0 error:&jsonError];

if (!urls) {
    NSLog(@"JSON error: %@", jsonError);
    return;
}

NSLog(@"Images : %@", urls);
Joris Kluivers
  • 11,894
  • 2
  • 48
  • 47
  • 1
    For those of you who don't like regex to parse html, refer to this classic: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Rob Jan 24 '13 at 17:10
  • Thanks. That's actually more for those who still like regex. Anyone that doesn't like regex has probably already read/experienced that :) – Joris Kluivers Jan 24 '13 at 17:13
  • @JorisKluivers Thanks! Your code worked well. Because I am not good at regex, I feel an affinity for your suggestion. – tsk Jan 24 '13 at 18:41
  • 1
    I have nothing against using regex but this javascript method is lightning fast compared to any kind of string parsing I have benchmarked. – John Estropia Aug 28 '13 at 04:08
6

You could achieve this running a regex on the loaded webview html source code

NSString *yourHTMLSourceCodeString = [webView stringByEvaluatingJavaScriptFromString:@"document.body.innerHTML"];

    NSError *error = NULL;
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"(<img src=\"(.*?)\">)+?"
                                                                           options:NSRegularExpressionCaseInsensitive
                                                                             error:&error];

    NSArray *matches = [regex matchesInString:yourHTMLSourceCodeString
                                      options:0
                                        range:NSMakeRange(0, [yourHTMLSourceCodeString length])];

    NSLog(@"total matches %d",[matches count]);

    for (NSTextCheckingResult *match in matches) {
        NSString *img = [yourHTMLSourceCodeString substringWithRange:[match rangeAtIndex:2]] ;
        NSLog(@"img src %@",img);
    }

This is a pretty basic regex that matches anything inside a tag, it would need more details if your images have more attributes such as class or id's

perrohunter
  • 3,454
  • 8
  • 39
  • 55
  • sorry, I missed that, it's supporused to be match – perrohunter Jan 24 '13 at 16:30
  • 1
    By the way, if you want to go nuts, you should be more careful about the `img` tag and use a regular expression like `@"()+?"` because (a) you can have attributes between the start `img` tag and the `src` attribute; (b) you might have attributes after the `src` tag and before the `>`; (c) you can have newline characters in the middle of a `img` tag; (d) I think `src` attribute value can be quoted with `'` as well as `"`; etc. – Rob Jan 24 '13 at 17:07
  • 1
    @Rob That's a great suggestion Rob! I know my regex needed more work, I just wrote a pretty simple one :P – perrohunter Jan 24 '13 at 17:11
  • 1
    @Rob Your regular expression is very good. This is what I need. Last time, you helped me. Thank you very much. – tsk Jan 24 '13 at 18:34
  • @user1928537 very good (I didn't recognize your handle ... you always could change your "display name" to something more memorable by going to [your page](http://stackoverflow.com/users/1928537/) and hitting "edit"). Anyway, I noticed the regex wasn't handling spaces before and after the `=`, a correction I've added to my answer in which I point out another regex method you could use. I don't want you to change your "accepted answer", but I also didn't feel right editing perrohunter's excellent answer, either. – Rob Jan 24 '13 at 19:01
2

With given html, you can use SwiftSoup library. Using Swift 3

do {
    let doc: Document = try SwiftSoup.parse(html)
    let srcs: Elements = try doc.select("img[src]")
    let srcsStringArray: [String?] = srcs.array().map { try? $0.attr("src").description }
    // do something with srcsStringArray
    } catch Exception.Error(_, let message) {
        print(message)
    } catch {
        print("error")
    }
kamil3
  • 1,232
  • 1
  • 14
  • 19