0

I'm trying to get to scrap all links from yahoo.com and get the size of the page itself. If I set User-Agent = "Mozilla/5.0" to my HTTP request, I would be able to scrap all links but my content-length would be 0.

let client = reqwest::blocking::Client::new();
let response = client.get(link)
.header("User-Agent", "Mozilla/5.0");

match response.send() {
    Ok(rep) =>{
        Some((res.content_length().unwrap(), rep.text().unwrap()))
    },
    Err(_e) =>{
        None
    }
}

Here's the result from the terminal:

[list of links from www.yahoo.com scraped will be here but I exclude them for visibility] url:https://www.yahoo.com/; size:0

The terminal will show that I just scraped https://www.yahoo.com with content-length received to be 0.

However, in the same code, if I removed the line .header("User-Agent", "Mozilla/5.0");. I will be able to receive the content-length and it will be something like 183174, but I won't be able to scrap any links from yahoo.com.

If I cheated by using len() on the HTML text I received, I will have like 600000 in size.

  • 1
    You'd have to ask Yahoo! about the details of their HTTP implementation. – cdhowie Aug 12 '22 at 05:20
  • When I make request to yahoo.com from browser, I don't get content-length header too, so this is probably something you can't fix on your side. – Cerberus Aug 12 '22 at 05:46
  • 2
    Because in general [`Content-Length`](https://httpwg.org/specs/rfc7230.html#header.content-length) is not mandatory. – rodrigo Aug 12 '22 at 05:57

0 Answers0