1

I am pretty new to swift and have an app that performs a simple url data task to parse the html contents of that website. I was trying to load certain elements but wasn't getting the content that I was seeing on the website when I inspect it manually. I don't really know what the problem.

I guess my question is; is there a way to load content as it would come up if I manually searched this website?

Here is the relevant code:

import SwiftSoup

let config = URLSessionConfiguration.default
config.httpAdditionalHeaders = ["User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"]
        
let session = URLSession(configuration: config)
        
let url = URL(string: link)

let task = session.dataTask(with: url!) { [self] (data, response, error) in            
    do {
        let htmlContent = NSString(data: data!, encoding: String.Encoding.utf8.rawValue)
        let doc: Document = try SwiftSoup.parse(htmlContent! as String)

        let elements = try doc.getAllElements().array()                    
                    
    } catch Exception.Error(type: let type, Message: let message) {
        print(type)
        print(message)
    } catch {
        print("error")
    }
                
}
            

Please let me know if there is any way to do this, even if it involves using a different package to parse the data. It is very important for my app. I would highly appreciate any help possible!

Thanks.

aadi sach
  • 51
  • 5

2 Answers2

0

I suspect the issue may be your user agent that is being sent to the website whose response you are parsing.

The user agent is a string that is sent with the request to the url (as an additional header). It identifies what sort of thing you are so that an appropriate response can be sent.

For example, if you are requesting from Safari on Mac on Big Sur the user agent might be:

"Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15"

Whereas from iPad it might be:

"Mozilla/5.0 (iPad; CPU OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1"

The site serving the request uses the user agent to determine what kind of response to return and what features to include (full site, mobile site, text site, etc).

For a URLSession in a Swift app, the user agent is the app's bundle name. So the site may be getting confused by that and returning something different than you see when you visit it in a browser.

Some options:

Explore the site, it might have a better url to use to get the info you are after.

Change the user-agent string your are sending. The basic steps are:

let config = URLSessionConfiguration.default
config.httpAdditionalHeaders = ["User-Agent": "User-Agent String Here"]
let session = URLSession(configuration: config)

You may need to adapt your use of the shared session to support this (eg: either create a session with your config and use that, as above, or check if there is a way to override the header for your request using the shared session).

Najinsky
  • 628
  • 4
  • 12
  • Thanks for the reply! I appreciate your help. I made some adjustments to the code, to incorporate your suggestions, yet it still doesn't seem to work. Am I missing something? I've edited the code in my initial question so please have a look. Thanks! – aadi sach Aug 18 '21 at 13:27
  • You need to compare htlmContent with the page source you are getting from the browser to narrow down the problem. If the htmlContent and page source are roughly the same, then the problem is in the parsing, but if the content seems very different and doesn't have the elements you need, then the problem is with the response. You ned to isolate the type of problem you are trying to fix. There is no way we can tell which it is from the current question – Najinsky Aug 18 '21 at 15:41
  • I understand. The htmlContent is extremely different from the page source. Pretty much none of the elements I see on the page source come up in the htmlContent. What can I do to change the response? Thanks again. – aadi sach Aug 19 '21 at 01:12
  • Update: I have narrowed down the problem. It is definitely not a problem with the parsing nor the response. When I view the page source, I have realised that is the exact same that comes up when I parse it from swift. Yet on the page itself, I can inspect the various elements that I want, that do not show up in the htmlContent. Why is this? How do I access all the elements that come up when I inspect the elements manually? And why does it not come in the htmlContent? – aadi sach Aug 19 '21 at 02:09
  • Without knowing anything about the site, it's really just guesswork. It really could be many reasons. For example, the site might be expecting a cookie which it's getting from the browser but not from the USRSession request. If it were me, I'd use something like Charles ( https://www.charlesproxy.com ) to investigate the requests and responses in each case. – Najinsky Aug 19 '21 at 02:09
  • Okay big update. I've learnt that 'View Source' shows the HTML that was delivered from the web server to your browser whereas, inspect elements is a developer tool to look at the state of the DOM tree after the browser has applied its error correction and after any JavaScript have manipulated the DOM. I think my code is getting the source code and not the manipulated DOM. I think my question is now, how can I parse my html to have applied the modified DOM? Thank you for your help thus far, I have learnt a lot. – aadi sach Aug 19 '21 at 02:19
0

I found a solution that works for me. Here is the relevant code:

private let webView: WKWebView = {
    let prefs = WKPreferences()
    prefs.javaScriptEnabled = true
    let config = WKWebViewConfiguration()
    config.preferences = prefs
    let webView = WKWebView(frame: .zero, configuration: config)
    return webView
}()

override func viewDidLoad() {
    super.viewDidLoad()
      
    view.addSubview(webView)
    webView.navigationDelegate = self
 
}

func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
    parseData()        
}


func parseData() {
        
    DispatchQueue.main.asyncAfter(deadline: .now() + 5.0) { [unowned self] in

        webView.evaluateJavaScript("document.body.innerHTML") { result, error in
            guard let htmlContent = result, error == nil else {
                print("error")
                return
           }                
                
           do {
               let doc = try SwiftSoup.parse(htmlContent as! String)
               var allProducts = try doc.getAllElements.array()
           } catch {
               print("error")
           }
                
       }
  
   }   
        
}

Using a WebView to load the website first, then parse the data after a delay is a working solution for me. It might not be the best idea to have a fixed delay, so if any has any other suggestion it would be highly appreciated!

aadi sach
  • 51
  • 5
  • Glad you got it working. There are so many ways sites use to render their pages you have to treat it on a case by case basis. A couple of thoughts: 1, WKWebView may be a bit of overkill for this task, it's obviously good that it works but there may be a lighter weight API to use without the view overhead, but I'm a bit rusty in that area and can't recall if there is one. 2: Rather than a fixed delay, perhaps you can watch the page load (in Safari's timeline recorder for example) and identify when the page has loaded (eg: some element appears), and then monitor for that before parsing. – Najinsky Aug 20 '21 at 10:58