0

I'm trying to get all the video results from a web page :

$ curl -qs https://ok.ru/video/c335170 | pup '.video-card_lk attr{href}' | wc -l
24

Another method returns the same result :

$ wget --config="/dev/null" -qO- https://ok.ru/video/c335170 | grep -oP '/video/\d+' | sort -u | wc -l
24

EDIT 1: Scrolled the webpage to the end with firefox and saved it as c335170.html and I get the same result :

$ cat c335170.html | grep -oP '/video/\d+' | sort -u | wc -l
24

However, on the web browser, it shows, after the scrolling to the end, 81 results.

Same pb. with YouTube and the "Load more" button which hides results from http console clients :

$ curl -qs https://www.youtube.com/user/impacttvouaga/videos | grep -oP "/watch\?v=[\w-]+" | uniq | wc -l
21

EDIT 2: I've just saved this webpage with firefox as a "Web Page, HTML only" into RMC_IMPACTV__YouTube.html and then :

$ cat RMC_IMPACTV__YouTube.html | grep -oP "/watch\?v=[\w-]+" | uniq | wc -l
21

How can I have the remote HTTP server to give me all the results ?

SebMa
  • 4,037
  • 29
  • 39
  • 1
    See https://stackoverflow.com/questions/14417994/how-can-be-scraped-using-php-curl-a-webpage-with-infinite-scroll – peak Apr 30 '19 at 21:55
  • @peak Wouah, this is getting much more complicated than I thought. Do this mean I have to write a `https://ok.ru` specific PHP script to retrieve what I want ? – SebMa Apr 30 '19 at 22:20
  • 1
    I'd try to find out whether ok.ru doesn't have an API so you can avoid all the complexities of simulating "on scroll" triggers. (The simulation would not have to be done in PHP ...) – peak Apr 30 '19 at 22:25
  • @peak First, I'd like to try something much more simple and save `https://ok.ru/video/c335170` with firefox as a `Web Page, HTML only` into `c335170.html` but somehow, `firefox` does not save all the results it shows into this file. Any idea why ? – SebMa Apr 30 '19 at 22:48
  • You could try: browse, scroll to end, and then save. – peak Apr 30 '19 at 23:18
  • @peak I did so, but it does not work and I don't understand why. Take a look at my EDIT1. – SebMa Apr 30 '19 at 23:29
  • See also https://unix.stackexchange.com/questions/440965/how-to-curl-full-web-page-content – peak May 01 '19 at 00:40
  • @peak I think I'll give console browsers (lynx, w3m, ...) a try before digging any deeper. – SebMa May 01 '19 at 11:17
  • 1
    I was able to download the expanded HTML using the Chrome 'Developer Tools" interface, but it involves several steps ... – peak May 01 '19 at 18:53
  • @peak I found this firefox addon : [Save Page WE](https://addons.mozilla.org/en-US/firefox/addon/save-page-we/). It works :-) – SebMa May 01 '19 at 19:58
  • @peak This add-on is also available from Chrome : [Save Page WE](https://chrome.google.com/webstore/detail/save-page-we/dhhpefjklgkmgeafimnjhojgjamoafof) – SebMa May 03 '19 at 01:36
  • @peak I've tried a few add-ons and it seems [Scroll it!](https://chrome.google.com/webstore/detail/scroll-it/nlndoolndemidhlomaokpfbicfnjeeed/related) is the most appropriate I've found so far, when it comes to auto scrolling down. – SebMa May 04 '19 at 19:29

1 Answers1

0

To download the expanded html one I installed Save Page WE and to scroll down I installed Scroll it!

SebMa
  • 4,037
  • 29
  • 39