0

cURL: I'm trying to get/save the html page of some "dynamic model's blogspot" such as:

http://jackturf.blogspot.fr/

My simple trial on dos command line:

"D:\EXE_UTIL\CURL\curl.exe"  -o "d:\temp.html" "http://jackturf.blogspot.fr/"

Received=21597 bytes  

But google chrome CTRL-S save to HTML COMPLETE PAGE = 160 kb!

I'm using curl for many years, always ok even with cookies but now with this "google dynamic model" I don't know how to get full html page size?

My cURL version: ( also I did try few other previous versions...)

curl 7.39.0 (i386-pc-win32) libcurl/7.39.0 OpenSSL/1.0.0o zlib/1.2.8 libidn/1.18 libssh2/1.4.3 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap pop3 pop3s rtmp rtsp scp sftp smtp smtps telnet tftp 
Features: AsynchDNS IDN Largefile SSPI SPNEGO NTLM SSL libz 

Anybody have solution for a dos-command-line working?

ebo
  • 2,717
  • 1
  • 27
  • 22
steve
  • 13
  • 6

2 Answers2

0

The difference in size is caused by curl not executing the JavaScript inside the page, while your browser does execute the JavaScript (and thus changes the HTML) before you save it with CTRL-S.

To get the same result you would have to execute the JavaScript inside the page before you save it. This is not possible with curl, so you might want to look into other alternatives.

Community
  • 1
  • 1
ebo
  • 2,717
  • 1
  • 27
  • 22
  • Thanks. Other alternatives... yes, if anyone have simple ideas? if possible on command line ... or else if not... – steve Dec 06 '14 at 09:29
0

A simple traffic analysis reveals that a json feed is available for parsing. Try this:

"D:\EXE_UTIL\CURL\curl.exe" -o "d:\temp.json" "http://jackturf.blogspot.fr/feeds/posts/default?alt=json&orderby=published"
user2243670
  • 345
  • 3
  • 8
  • Edit: changed \temp.html to \temp.json – user2243670 Dec 06 '14 at 09:42
  • Yes, thanks, it works. File is about 10 times bigger... but I guess I can manage this solution. Unless someone else have other solutions to review... – steve Dec 06 '14 at 10:27
  • Analyse the traffic to find out about the api structure. For example, this url would yield a 4,6 MB file: http://jackturf.blogspot.fr/feeds/posts/default?alt=json&orderby=published&max-results=2500 – user2243670 Dec 06 '14 at 10:35
  • I see... I learn... tks. Also, is there a json way to ask website for a chronological way ? like .../feeds/posts/default?alt=json&orderby=published& ... 2014_12_01_archive to get only current month and then previous month if needed and so on... -Thanks- – steve Dec 06 '14 at 11:05
  • @steve If the api was written to support this type of request than yes, there is. You'll need either to contact Blogger about it or do some more network traffic analysis. Tools like ‘Developers Tools’ for Chrome come in very handy. Please accept the answer if was helpful. – user2243670 Dec 06 '14 at 14:41