4

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
Cammel
  • 2,653
  • 2
  • 16
  • 5

6 Answers6

5

wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

See also: lynx.

dsummersl
  • 6,588
  • 50
  • 65
dsm
  • 10,263
  • 1
  • 38
  • 72
  • Does html2text have a strip white-space option, because I couldn't find it – Cammel Jan 12 '09 at 17:55
  • Not that I am aware of, but you can use awk/sed/perl ... etc to strip whitespace – dsm Jan 13 '09 at 08:44
  • Keep an eye out for limitations in the tools. `lynx`, for example, won't render things like tables. If `html2ascii` is anything like `pdftotext`, it's able to keep tables in-tact, but it limits output to 80 characters per line. Given a modestly wide table that would fit comfortably in, say, 150 characters per line, it'll insert new lines and add text vertically and completely destroy readability and/or grepability (if that's a word). – Brian Vandenberg Jul 31 '13 at 17:25
3

Python Beautiful Soup allows you to build a nice extractor.

matan h
  • 900
  • 1
  • 10
  • 19
S.Lott
  • 384,516
  • 81
  • 508
  • 779
0

I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.

Jean Azzopardi
  • 2,289
  • 23
  • 36
0

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

Robert Elwell
  • 6,598
  • 1
  • 28
  • 32
0

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

olle
  • 4,597
  • 24
  • 28
0

Use wget to download the required html and then run html2text on the output files.

Krishna Gopalakrishnan
  • 1,616
  • 2
  • 10
  • 11