1

So I do not want to pull whole page, just the first 40KB of the page. Just like this Facebook Debugger tool does.

My goal is to grab social media meta-data, i.e. og:image etc.

Can be in any programming language, PHP or Python.

I do have code in phpQuery that uses file_get_contents/cURL and I know how to parse the received HTML, my question is "How to fetch only first nKB of a page without fetching whole page"

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • Maybe this will help https://stackoverflow.com/a/12014561/661872 – Lawrence Cherone Sep 16 '17 at 11:10
  • @LawrenceCherone I do have code in phpQuery that uses file_get_contents/cURL and I know how to parse the received HTML, my question is **"How to fetch only first nKB of a page without fetching whole page"** – Umair Ayub Sep 16 '17 at 11:12
  • 2
    This seems already answered [here](https://stackoverflow.com/questions/2032924/how-to-partially-download-a-remote-file-with-curl). – Dardan Iljazi Sep 16 '17 at 11:15
  • the `--range` curl command-line option seem to be a good fit, but doesn't say much about the specifics https://curl.haxx.se/docs/manpage.html – Calimero Sep 16 '17 at 11:16
  • Fair enough, you could look into using curl with `CURLOPT_WRITEFUNCTION` which aborts after reading 40KB, you could also abort before once you hit `` – Lawrence Cherone Sep 16 '17 at 11:16
  • any idea how to `abort before once you hit ` – Umair Ayub Sep 16 '17 at 11:18

2 Answers2

3

This is not specific to Facebook or any other social media sites but you can get first 40 KB with python like this:

import urllib2
start = urllib2.urlopen(your_link).read(40000)
mdegis
  • 2,078
  • 2
  • 21
  • 39
1

This could be used:

curl -r 0-40000 -o 40k.raw https://www.keycdn.com/support/byte-range-requests/

the -r stands for range:

From the curl man page:

r, --range <range>
          (HTTP FTP SFTP FILE) Retrieve a byte range (i.e a partial document) from a HTTP/1.1, FTP or SFTP server or a local  FILE.  Ranges  can  be
          specified in a number of ways.

          0-499     specifies the first 500 bytes

          500-999   specifies the second 500 bytes

          -500      specifies the last 500 bytes

          9500-     specifies the bytes from offset 9500 and forward

          0-0,-1    specifies the first and last byte only(*)(HTTP)

More info can be found in this article: https://www.keycdn.com/support/byte-range-requests/

Just in case this is a basic example of how to doit with go

package main

import (
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "net/http"
)

func main() {
    response, err := http.Get("https://google.com")
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()
    data, err := ioutil.ReadAll(io.LimitReader(response.Body, 40000))
    fmt.Printf("data = %s\n", data)
}
nbari
  • 25,603
  • 10
  • 76
  • 131