I have to do spell check for large number of big html and xml documents (more than 30.000). I also need custom dictionary and sophisticated algorithms of checking. I try to use BASH
+ linux utility (sed
, grep
, ...) with hunspell. Hunspell
has option -H that force it to check document as HTML (for XML the option is also suitable). But there is one problem: it output offsets and not number of line also it can check line by line because in this case it looks inside of tags (he can't find closed tag).
So what is the right way to do the task?
Asked
Active
Viewed 4,379 times
9

MaXal
- 841
- 2
- 8
- 23
-
What exactly are you missing in plain `aspell`? – Šimon Tóth Apr 06 '11 at 13:15
-
I'd recommend that you add an XML tag to the post. There are a fair number of advanced XML users at S.O. Good Luck! – shellter Apr 06 '11 at 13:43
-
I can't find how to force `aspell` output line number and not strange and useful offset (as in `hunspell`). – MaXal Apr 06 '11 at 13:56
-
Hunspell now has the `-X` option for XML. – jww Mar 16 '18 at 00:59
2 Answers
7
I just had a similar problem. You should be able to get a good output by using those undocumented switches, e.g. -u
or -U
. But be careful, as those features seem to be experimental right now, and I only found out about their existance by looking at the sources of hunspell.
So essentially:
hunspell -H -u my-file.html
should do it.
Alternatively, there are also the switches -u1
, -u2
and -u3
you can play around with.
1
Have you tried using tidy?
I have not used it on such elevated number of files, but it worked fine for finding issues in 100+ HTML pages. You can also use it on XML files and is able to accept a configuration file with many option which I have not yet explored.

Victor
- 348
- 2
- 12
-
I can't find options for custom dictionary specification. Is it possible? And how is it fast and reliable for spell check? – MaXal Apr 10 '11 at 20:09
-
If it's not possible to add it on the configuration file I'm not sure it can be done in tidy. 1 html file is parsed instantly but I am not sure on how much will it take to parse thousands. You'll also need a script or something to parse the results, because they can have a lot of verbosity. – Victor Apr 11 '11 at 07:46