Swiftsoup parsing is not finding all HTML classes

Question

I have a method to parse website with using Swiftsoup go get the price of a product:

@objc func actionButtonTapped(){

    let url = "https://www.overkillshop.com/de/c2h4-interstellar-liaison-panelled-zip-up-windbreaker-r001-b012-vanward-black-grey.html"

    let url2 = "https://www.asos.com/de/asos-design/asos-design-schwarzer-backpack-mit-ringdetail-und-kroko-muster/prd/14253083?clr=schwarz&colourWayId=16603012&SearchQuery=&cid=4877"



    do {


        let html: String = getHTMLfromURL(url: url2)
        let doc: Document = try SwiftSoup.parse(html)

        let priceClasses: Elements = try doc.select("[class~=(?i)price]")

        for priceClass: Element in priceClasses.array() {
            let priceText : String = try priceClass.text()
            print(try priceClass.className())
            print("pricetext: \(priceText)")
        }

    } catch Exception.Error(let type, let message) {
        print(message)
    } catch {
        print("error")
    }
}

The method works fine for url but for url2 it is not printing all all the classNames even though they match the regex. This is where the price actually is:

<span data-id="current-price" data-bind="text: priceText(), css: {'product-price-discounted' : isDiscountedPrice }, markAndMeasure: 'pdp:price_displayed'" class="current-price">36,99 €</span>

The output of the function is this:

product-price pricetext:

stock-price-retry-oos

pricetext: stock-price-retry

pricetext:

It is not printing class=current-price. Is something wrong with my regex or why does it not find that class??

EDIT:

I found out that the price is not actually inside the HTML of url2. Only the classes that are actually printed out are inside. What's the reason for that and how can I solve that?

You are using a css selector, not a regex, in `[class~=(?i)price]` — Wiktor Stribiżew, Apr 26 '20 at 13:52

score 1 · Answer 1 · answered Apr 26 '20 at 17:45

The html is not static. It can change over time. If you make a get request to the site's URL you will get the initial value of the html for that site. But on browsers there is this thing, called javascript, that can make the page's HTML change over time. It's quite common actually: - The site gets loaded at first with some javascript - The javascript (developed by the site's creator) than runs and does stuff - Content dynamically changes by calling some API by that javascript

You can't scrape that content by HTML scraping of the base URL.

If you ask me how I'd do that anyway, is by looking for the site's HTTP requests where it gets the content. Look at that API and use that API myself. Get the data, and store it in some of my servers. Than on the client I call my server's API to get that data. Also I'm not really sure that's legal.

But, as far as I understood by your last couple questions, you don't want to do that.

If you really need to do that on the client, you can use WKWebView, load the page, wait for the content to show up, and then get the current HTML of the page by doing something like this:

webView.evaluateJavaScript("document.documentElement.outerHTML.toString()", 
                           completionHandler: { (html: Any?, error: Error?) in
    print(html)
})

Look at this answer for more about this.

I hope this solves all of your problem, because I think I don't have much more time to help you :D

thanks for all your help. The problem is that I am actually using a `ShareExtension` and I am getting the current URL from there and work with that. Is there a way to call `evaluateJavascript` inside my `ShareExtension` ?? — Chris, Apr 27 '20 at 16:31

Swiftsoup parsing is not finding all HTML classes

1 Answers1