0

I am looking to get the content of the go:image property content out of this Japanese Web site page, with UTF-8 text encoding.

The desired result is:

http://www.macotakara.jp//blog/archives/001/201701/5871de9fb4929.jpg

But I get:

jp//blog/archives/001/201701/5871bd1be125c.jpg" />

And I believe the issue is related to the use of ranges.

You can refer to this for the regex: https://regex101.com/r/F29INt/1

The html code snippet is as follows:

<meta name="description" content="CES2017において、OtterBoxが、様々なモジュールを装着出来るモジュール式iPhoneケース「uniVERSE」の展示を行っていました。 背面にあるスライド式「uniVERSEケースシステム」を使用して、背面の下半分を変更す..." />
<meta property="og:image" name="og:image" content="http://www.macotakara.jp//blog/archives/001/201701/5871de9fb4929.jpg" />
<meta name="twitter:image" 

I have my regex class as follows:

public class Regex {
    let regex: NSRegularExpression
    let pattern: String

    public init(_ pattern: String) {
        self.pattern = pattern
        regex = try! NSRegularExpression(pattern: pattern, options: [.caseInsensitive])
    }

    public func matches(_ input: String) -> [NSTextCheckingResult] {
        let matches = regex.matches(in: input, options: [], range:NSRange(location:0, length:input.characters.count))
        return matches
    }
}

And the code I use as follows:

let pattern = "<meta[^>]+property=[\"']\(property)[\"'][^>]+content=[\"']([^\"']*)[\"'][^>]*>"
let regex = Regex(pattern)
let matches = regex.matches(html)

for match in matches {
    // range at index 0: full match
    // range at index 1: first capture group
    var text = ""
    text += "+++StoryPreviewCache.getMetaPropertyContent(): with pattern=\(pattern) for prop=\(property)"
    for j in 1..<match.numberOfRanges {
       text += "+++StoryPreviewCache.getMetaPropertyContent(): Groups \(j), range=\(match.rangeAt(j)), is \(html[match.rangeAt(j)])"
    }
}
print(text)

And I get:

+++StoryPreviewCache.getMetaPropertyContent():
with pattern=<meta[^>]+property=["']og:image["'][^>]+content=["']([^"']*)["'][^>]*> 
for prop=og:image
+++StoryPreviewCache.getMetaPropertyContent(): 
Groups 1, 
range=__C._NSRange, 
is jp//blog/archives/001/201701/5871bd1be125c.jpg" />
Leo Dabus
  • 229,809
  • 59
  • 489
  • 571
Stéphane de Luca
  • 12,745
  • 9
  • 57
  • 95
  • 2
    Compare http://stackoverflow.com/questions/27880650/swift-extract-regex-matches: You must not create the NSRange from characters.count because NSString and String count characters differently. – Martin R Jan 08 '17 at 22:59
  • You need to replace your `NSRange(location:0, length:input.characters.count)` to `NSRange(0.. – OOPer Jan 08 '17 at 23:23
  • Id didn't yet try your suggestion, but how do you know when using input.utf16.count vs utf8 input.utf8.count? also, what's iyo the best solution between yours and Martin R's? – Stéphane de Luca Jan 08 '17 at 23:36
  • Please address me with `@OOPer` when you have some questions to my comments, which informs me. Once you read some text into Swift `String`, it is stored in some hidden way which is Unicode compliant. But in `NSString`, all APIs works based on UTF-16 representation. So, if you want to work with `NSRange`s which is based on `NSString`, you always need to use `utf16`. Encodings before converting to Swift `String` are irrelevant. I don't know which is the better, but you'd better update your useless `String` extension. – OOPer Jan 08 '17 at 23:46

1 Answers1

0

Following the so question suggested by Martin R, I wrote this extension:

extension NSTextCheckingResult {
    public func capture(group:Int, in text:String) -> String {
        let range = self.rangeAt(group)
        let content = (text as NSString).substring(with: range)
        return content as String
    }
}

And altered my code in Regex as follows:

public func matches(_ input: String) -> [NSTextCheckingResult] {
    let nsString = input as NSString
    let matches = regex.matches(in: input, range: NSRange(location: 0, length: nsString.length))
    // former code as follows
    //let matches = regex.matches(in: input, options: [], range:NSRange(location:0, length:input.characters.count))
    return matches
}

And now I use it like this:

       for match in matches {
            var text = ""
            text += "+++StoryPreviewCache.getMetaPropertyContent(): with pattern=\(pattern) for prop=\(property)"
            for j in 1..<match.numberOfRanges {
                text += "+++StoryPreviewCache.getMetaPropertyContent(): Groups \(j), is \(match.capture(group:j, in: html))"
            }
        }
Stéphane de Luca
  • 12,745
  • 9
  • 57
  • 95