2

If I use wget to download this page:

wget http://www.aqr.com/ResearchDetails.htm -O page.html

and then attempt to view the page in less, less reports the file as being a binary.

less page.html 
"page.html" may be a binary file.  See it anyway? 

These are the response headers:

Accept-Ranges:bytes
Cache-Control:private
Content-Encoding:gzip
Content-Length:8295
Content-Type:text/html
Cteonnt-Length:44064
Date:Sun, 25 Sep 2011 12:15:53 GMT
ETag:"c0859e4e785ecc1:6cd"
Last-Modified:Fri, 19 Aug 2011 14:00:09 GMT
Server:Microsoft-IIS/6.0
X-Powered-By:ASP.NET

Opening the file in vim works fine.

Any clues as to why less can not handle it?

palacsint
  • 28,416
  • 10
  • 82
  • 109
Joel
  • 29,538
  • 35
  • 110
  • 138

4 Answers4

2

Because it is UTF-16 encoded as can be seen with the BOM of ff ee in the first two octets:

$ od -x page.html | head -1
0000000 feff 003c 0021 0044 004f 0043 0054 0059

vim is smarter about it (because it is more Unicode era) than less.

added:

See Convert UTF-16 to UTF-8 under Windows and Linux, in C for what to do about it. Or use vim to write it back out with UTF-8 encoding.

Community
  • 1
  • 1
msw
  • 42,753
  • 9
  • 87
  • 112
  • Nice guess, but I think `wget` is not so stupid to produce uncompressed binary data to output. – dma_k Sep 25 '11 at 12:34
  • @dma_k: right you were, which is why I fixed my answer (that is, your comment applies to a prior edit and I didn't want people to point and laugh because it now makes little sense ;) – msw Sep 25 '11 at 12:41
  • Annoyingly in the HTML meta it is reported as "charset=iso-8859-1" - presumably this is just wrong? – Joel Sep 25 '11 at 12:47
  • @msw: No, the answer didn't look ridiculous. By the time I've answered I have the original (different) version, and now all answers are the same – that was funny to observe :) Anyway this question was a simple nut for community. – dma_k Sep 25 '11 at 21:57
  • @Joel: Yes, if HTML `` tag gives this information, it is wrong. Web Designers sometimes don't know in the details about the data they deal with. – dma_k Sep 25 '11 at 21:59
2

It's an UTF-16 encoded file. (Check with W3C Validator). You can convert it to UTF-8 with this command:

wget http://www.aqr.com/ResearchDetails.htm -q -O - | iconv -f utf-16 -t utf-8 > page.html

less usally knows UTF-8.

edit:

As @Stephen C reported, less in Red Hat supports UTF-16. It looks to me that Red Hat patched less for UTF-16 support. On the official site of the less UTF-16 support currently is an open issue (ref number 282).

palacsint
  • 28,416
  • 10
  • 82
  • 109
  • Annoyingly in the HTML meta it is reported as "charset=iso-8859-1" - presumably this is just wrong? – Joel Sep 25 '11 at 12:46
  • It's definitely not ISO-8859-1. Maybe it comes from a template or the file was saved accidentally with UTF-16. – palacsint Sep 25 '11 at 12:50
1

Firstly, it works for me. When I download the file using that file, I get a file that "less" shows me without any questions / problems. (I use RedHat Fedora 14.)

Second, the "file" command reports "page.html" as:

page.html: Little-endian UTF-16 Unicode HTML document text, with very long lines, with CRLF line terminators

Maybe the UTF-16 encoding is the cause of the problems. (But why ... I don't know why it would work with my version of "less" and not yours.)


@palacsint's solution works for me:

wget http://www.aqr.com/ResearchDetails.htm -q -O - | \
     iconv -f utf-16 -t utf-8 > page.html
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • "I don't know why it would work with my version of "less" and not yours" - It's Red Hat specific, they patched `less` and [`file` too to support UTF-16](https://bugzilla.redhat.com/show_bug.cgi?id=235420). Take a look at my updated answer. On my old Debian UTF-16 is not supported by `less` nor `file`. – palacsint Sep 25 '11 at 13:21
0

Very likely this HTML file contains UTF characters and your locale is not set correctly (export LANG=en_US.UTF8 LESSCHARSET=utf-8). It may also happen that HTML contains invalid characters.

EDIT: After checking the file I clearly see it is UTF-16. So you need to correct your terminal settings correspondingly (although I was able to see the text correctly with UTF8 setting, perhaps my terminal program is smart).

dma_k
  • 10,431
  • 16
  • 76
  • 128