5

As a new user of w3m I am trying to do something basic like:

w3m -dump_source nytimes.com > nytimes.html

The output produced gives crazy characters and symbols. However, when I browse using w3m nytimes, it loads properly, and I can even view the HTML using v.

Further when I tried:

w3m -dump_extra nytimes.com > nytimes.html

I get all the extra info associated with the site perfectly, except for the HTML source.

Any help would be appreciated.

Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
user1311034
  • 261
  • 2
  • 14

1 Answers1

7

By default, w3m requests compressed output from the server by sending the following HTTP header:

Accept-Encoding: gzip, compress, bzip, bzip2, deflate

The value of the header may vary depending on the version of w3m, but the fact is that the latest versions of the program request compressed output from the host using Accept-Encoding header. You can find out the exact headers with the following command:

w3m -dump_source -reqlog nytimes.com > /dev/null

The request and response headers will be logged to ~/.w3m/request.log file.

You can request uncompressed version by overriding the header as follows:

w3m -dump_source nytimes.com -o accept_encoding='identity;q=0'

Or even

w3m -dump_source nytimes.com -o accept_encoding='*;q=0'

Alternatively, decompress the output via pipe:

w3m -dump_source nytimes.com | gunzip -f

The -f option causes gunzip to copy the input data without change to the standard output, if the input data is not in a format recognized by gunzip. According to the documentation, you should also pass --stdout option, but the piped command should print the result to standard output even without this option.

Note, the server may respond with content compressed in bzip2. In this case, you can pipe the output through bunzip2 -f command.

Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
  • 1
    Thank you very much Ruslan. Everything worked perfectly. A follow up question...is there a way to pause w3m from loading the page source for say 5 seconds. Some websites don't load fast enough to capture the html. Tx. – user1311034 Jan 22 '17 at 16:57
  • 1
    @user1311034, do you mean the connect/read timeout? I couldn't find such an option for w3m. If you need to set a timeout, you can use `wget` tool that supports `--timeout`, `--dns-timeout`, `--read-timeout`, and `--connect-timeout` options. If the answer solves the problem, please accept it. – Ruslan Osmanov Jan 24 '17 at 07:42