0

Inside my app I would like to scrape the price of any product (user types in the wanted URL).

I searched quite a bit now and I found out that there are couple of Webscrapers, I think I will use SwiftSoup for now. However I couldn't find a single tutorial that teaches how to scrape for elements with "dynamic" tags. For example the price of a product on a website looks different for every website:

Example 1:

<div class="price">82 EUR</div>

Example 2:

<span class="gl-price__value">€ 139,95</span>

Example 3:

<span id="priceblock_ourprice" class="a-size-medium a-color-price priceBlockBuyingPriceString">79,99&nbsp;€</span>

I know I can scrape elements like this:

let html: String = "<a id=1 href='?foo=bar&mid&lt=true'>One</a> <a id=2 href='?foo=bar&lt;qux&lg=1'>Two</a>";
let els: Elements = try SwiftSoup.parse(html).select("a");
for element: Element in els.array(){
    print(try element.attr("href"))
}

But what is the best way to scrape dynamically? Couldn't find anything on this so I am happy for every help :)

Update

I managed to get the right 'price' if I know the exact 'class-name' :

let url = "https://www.adidas.de/adistar-trikot/CV7089.html"
    let className = "gl-price__value"


    do {
        let html: String = getHTMLfromURL(url: url)
        let doc: Document = try SwiftSoup.parse(html)

        let price: Element = try doc.getElementsByClass(className).first()!
        let priceText : String = try price.text()

        result.text = priceText

    } catch Exception.Error(let type, let message) {
        print(message)
    } catch {
        print("error")
    }

However, I would like to make this work so all 3 examples above work. Right now I am struggling to get the right 'regex' that includes all three examples... Anyone an idea?

Chris
  • 1,828
  • 6
  • 40
  • 108

1 Answers1

1

I don't think there is a way to scrape virtually anything "dynamically". You have no way to detect all possible way people can write their html in showing you the price.

What you could do, but I don't think it would be that easy, is to train a machine learning model to detect the price most of the times. But that's probably off of the scope of this question.

Another way you could try is to simply look at most sites and add several "generic" algorithms to scrape their sites. If one doesn't work, you just try with another until you either succeed or give up. This way, avoiding to hardcode the class names and other stuff, you're gonna at least scrape all sites that have a similar structure as the one in your generic scrapers.

One way (but I believe you could think of other, better, ways) I would approach the implementation of a "generic" scraper algorithm is to have a list of regex of the class of the prices to match and try with them all, trying then to validate the results you get inside the html text (e.g. is there any number inside the text? Does it contain symbols like €, $, ..? etc.). I would start with something like .*price.* and other similar regexes you could simply find by looking at most sites.

You will definitely incur in some sites that you didn't think of. Then you can send yourself that info (when on the client you detect you can't find the price on a site), and you can look at the site yourself and add more regexes on your list (that probably will need to be updated server side and downloaded on your client every time it updates) if that solves the issue, or add another scraper algorithm or make one of your previous ones more generic and work with that use case too (but this requires a new app release).

I'm sorry if this answer is not very specific, but your question was so wide it was nearly impossible to be more specific.

PS: Not sure if this is the best approach (maybe some parser is better suited for this) but one regex I could rapidly think of that matches all 3 of your examples where <[^>]*class=".*price.*"[^>]*>([^<]*)<. Probably there is something more clever, but with this regex you'll automatically get the text inside the html element in the first capturing group. Than you just need to sanitize it (remove unwanted characters etc) and maybe validate it.

Enricoza
  • 1,101
  • 6
  • 18
  • thanks for your answer! Machine learning sounds very interesting but is way out of scope :D Here is what I was thinking on how to solve this: I need a way to scrape for `classes`, specifically `class names`. And then I could add some sort of `regex` so I can search like this: `if class == matchRegex("price") { return value }` – Chris Apr 25 '20 at 12:03
  • do you know what I mean? And could you maybe help me out with this? – Chris Apr 25 '20 at 12:03
  • I didn't understand what you specifically need. Anyway if you just want the regex for the class there is this post to help you out: https://stackoverflow.com/questions/45759496/regex-for-finding-classes-in-html-files otherwise if you want to know how to use the regex in swift look at this: https://www.hackingwithswift.com/articles/108/how-to-use-regular-expressions-in-swift – Enricoza Apr 25 '20 at 15:09
  • right now I am struggling to get the 'regex done' . Do you maybe know know the right ' regex' for the 3 examples above??? – Chris Apr 25 '20 at 20:31
  • thanks! I got it working with the three examples above. You seem like you know your stuff, could you maybe have a look at this question? https://stackoverflow.com/questions/61432613/regex-for-finding-html-classes-with-jsoup I would like to get all classes that match that `price-regex` into an array so I can go through that and get one I need – Chris Apr 26 '20 at 12:39
  • oh and one more thing, what exactly do you mean by *maybe some parser is better suited for this* ? – Chris Apr 26 '20 at 12:42
  • I mean that maybe using a parser (like SwiftSoup) is better than using a simple regex for the whole HTML (but I'm not sure about it). Anyway the answer you already got on that other question is pretty much perfect, I didn't test it but it should just work: when you got the `Elements` object you can call `array()` on it to get all nodes that have a class that match that regex. – Enricoza Apr 26 '20 at 12:56
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212547/discussion-between-chris-and-enricoza). – Chris Apr 26 '20 at 13:10