How can I use SwiftSoup to scrape a particular website that redirects?

Question

I am trying to make Swift scrape websites using SwiftSoup. However, websites like: https://apple.news/AQZXxg8mUQfKrEaM9MRBpxw , it redirects automatically using JavaScript which causes SwiftSoup to scrape the opening page instead of the actual article that I want. How should I scrape this link so that it would scrape the actual article in question rather than the cover webpage that redirects?

I have tried to use status code but this particular website does not give a status code of 301 or 302, and gives a status code of 200. I tried scraping the JavaScript portion of the HTML of the link but I don't exactly know what to do with it.

Salman500 · Answer 1 · 2019-06-20T07:15:30.940

0

func redirectUrl() {

    let url = URL(string: "https://apple.news/AQZXxg8mUQfKrEaM9MRBpxw")!

    URLSession.shared.dataTask(with: url) { (data, response, error) in

        let html = String(data: data!, encoding: .utf8) ?? "none"
        self.parse(html: html)

    }.resume()


}

func parse(html: String) {

    do {

        let doc = try SwiftSoup.parse(html)
        let link: Element = try doc.select("a").first()!
        let linkHref = try link.attr("href")

        print(linkHref)
    } catch let error {
        print(error.localizedDescription)
    }

}

This will be in the print

https://www.npr.org/2019/06/18/733401736/npr-identifies-fourth-attacker-in-civil-rights-era-cold-case

This will work for redirect url

func redirectLink(url: URL, completion: @escaping (URL?) -> Void) {

    var request = URLRequest(url: url, cachePolicy: .reloadIgnoringLocalCacheData, timeoutInterval: 15.0)
    request.httpMethod = "HEAD"

    URLSession.shared.dataTask(with: request) { (data, response, error) in

        if let response = response {
            completion(response.url)
        }

    }.resume()

}

edited Jun 20 '19 at 07:15

answered Jun 18 '19 at 18:53

Salman500

1,213
1
17
35

Will this work for any site that uses Javascript to redirect? – WannaInternet Jun 18 '19 at 20:00
i have tested this for apple news, link given above https://apple.news/AQZXxg8mUQfKrEaM9MRBpxw – Salman500 Jun 19 '19 at 09:03
I know you tested it for Apple News. Did you test it for other links, though? – WannaInternet Jun 20 '19 at 00:53
I tested apple.news/AQZXxg8mUQfKrEaM9MRBpxw on redirectLink and it doesn't work at all. – WannaInternet Jun 22 '19 at 23:10
yes it will not work that why i added redirectUrl method – Salman500 Jun 24 '19 at 07:39
I'm still confused. So how am I supposed to use both redirectURL and redirectLink together? – WannaInternet Jun 26 '19 at 02:45
apple news does redirect automatically it first load the html page and after few seconds it go to the redirect url – Salman500 Jun 26 '19 at 07:49
use this command on your terminal curl -v https://news.apple.com/A-oPQmJNfTyi9oHKs1xCY3w – Salman500 Jun 26 '19 at 07:59
they use the timeout function function redirectToUrlAfterTimeout(url, timeout) { setTimeout(function() { redirectToUrl(url) }, timeout); } – Salman500 Jun 26 '19 at 07:59

How can I use SwiftSoup to scrape a particular website that redirects?

1 Answers1