Linux, save webcontent as text

Question

I need a way to save content from a website running java, like: https://www.betfair.com/exchange/plus/tennis

I need a function that can do this: Ctrl+A, Create a textdocument, Ctrl+V, Save document.

I know wget and curl, but I can't make them do exactly this, any help?

Welcome to SO ;) Please read this article on [how to ask a good question](https://stackoverflow.com/help/how-to-ask). This would include a proper description of what you are trying to achieve, your code (or the relevant snippets) as well as your efforts showing what you have tried so far and possible error messages. — iLuvLogix, Oct 20 '18 at 09:02
Thx - actually I have tried a lot of different things with Wget and Curl, but none of them seems to save the content that I need. I haven't written any code yet, this is the basic start of my project. I thought this was "an easy one" for a linux expert, so I didn't think I needed to put a lot of words on it :) — Søren Wedel Christensen, Oct 20 '18 at 09:05
I've posted an answer to your question - let me know if that solves your issue ;) — iLuvLogix, Oct 20 '18 at 09:21
I guess you mean a website using JavaScript, not one running Java (or some website *served* by some Java program) — Basile Starynkevitch, Oct 21 '18 at 06:21
You really should [edit](https://stackoverflow.com/posts/52904005/edit) your question to improve it a lot. Tell more about the motivation of your question, and the *actual* website you are interested in. If possible, give the URL of that website. BTW, it could happen that what you want to achieve may be illegal, even if technically possible — Basile Starynkevitch, Oct 22 '18 at 04:44

iLuvLogix · Answer 1 · 2018-10-20T09:43:40.440

2

If you want to download and save the contents of a certain page to file you could use the -O file option:

 wget "https://www.betfair.com/exchange/plus/tennis" -O tennis.txt

Please be aware that on some systems (CentOS and others) the order of parameters in the command line is important.

wget -O FILE URL

works.

wget URL -O FILE

does not work (at least on CentOS).

If you want to download a whole site using wget, you can do the following:

$ wget \
 --recursive \
 --no-clobber \
 --page-requisites \
 --html-extension \
 --convert-links \
 --restrict-file-names=windows \
 --domains betfair.com \
 --no-parent \
     www.betfair.com/

INFO: If you would like to suppress tracing information you can use '-q'

For more information, see the wget man page:

$man wget

edited Oct 20 '18 at 09:43

answered Oct 20 '18 at 09:19

iLuvLogix

5,920
3
26
43

The problem is when I use Wget or Curl, I only get the "back-site" and not the content as the players or the odds, which is what I need. If you try to run "wget -O test.txt betfair.com/exchange/plus/tennis" you will see that none of the odds or the players will be saved. That is the big issue :) – Søren Wedel Christensen Oct 20 '18 at 11:59
The big problem here is that the content is generated by javascript, and when I use wget, no content is generated. At least that is what I have found out so far :/ – Søren Wedel Christensen Oct 20 '18 at 12:30

Basile Starynkevitch · Answer 2 · 2018-10-21T06:28:21.183

The mention of Ctrl A and Ctrl V suggests the involvement of the clipboard (and/or some selection). It makes sense only when a display server is running with some desktop environment. This is not always the case (for example, many web servers are running in datacenters under Linux and don't have clipboards; and I can also use my Linux system in some virtual console running some unix shell without any display server).

This answer explains how to deal with the clipboard in shell scripts. Adapt that to use wget or curl

See xclip(1), wget(1), curl(1) for more and combine them cleverly, perhaps in your shell script using a pipeline.

I need a way to save content from a website

Be sure to understand in details the HTTP exchanges (requests & replies, with their headers) involved in your particular case. You might need to deal with HTTP cookies.

Probably, your main issue is to have the JavaScript (not Java as mentioned in your question) interpreted on the HTTP client side (e.g. in some modern browser, or something mimicking it); this requires a different approach. Look into Selenium.

Some websites also provide a web API to query programmatically (perhaps using JSON and even REST) their content. A good example is the github REST API. You need to ask the maintainer of your target website for more.

good point! I thought the just wanted to save the content to file using wget or curl in a similar behaviour such as Ctrl A, Ctrl C & Ctrl V - upvote for your detailed info on that topic ;) — iLuvLogix, Oct 20 '18 at 09:55
This is exactly what I want :) The Ctrl+A, Ctrl+V thing was only to describe my needs. The problem is when I use Wget or Curl, I only get the "back-site" and not the content as the players or the odds, which is what I need. If You try to run "wget -O test.txt https://www.betfair.com/exchange/plus/tennis" you will see that none of the odds or the players will be saved. That is the big issue :) — Søren Wedel Christensen, Oct 20 '18 at 11:58

Linux, save webcontent as text

2 Answers2