4

I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :

The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.

I tried to find an existing solution.

DocBook stylesheets does not address the problem.

The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.

Any other suggestions?

olpa
  • 1,167
  • 10
  • 28
  • Can you clarify what you mean by index? Do you want every word, just important words, etc? Do you want to have page/doc locations? – Tim Barrass Dec 09 '10 at 11:13
  • 1
    I suggest hiring a competent indexer (see http://en.wikipedia.org/wiki/Index_(publishing)#Indexer_roles) to build your index. It'll be more expensive, sure, but when I've gone looking for information in books it is usually pretty clear when a human had done the job versus when a computer program had done the job. (Kurt Vonnegut (http://en.wikipedia.org/wiki/Kurt_Vonnegut) even went so far to say an author should never index his or her own book.) – sarnold Dec 09 '10 at 11:22
  • I mean by index: a part of the book, just like a table of contents or glossary. For example, http://www.amazon.com/gp/reader/0596006624/ref=sib_dp_pt#reader-link page 251. The book already had index markers inside, now I should process these markers to create an index section. And I have no idea how to put words in alphabetical order. – olpa Dec 09 '10 at 11:30

2 Answers2

0

Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.

But indexes are generally a bit more focused than that.

Tim Barrass
  • 4,813
  • 2
  • 29
  • 55
  • 1
    I edited the question to describe the problem more precisely. Data structures is not a problem for me, the linguistic details are. – olpa Dec 09 '10 at 11:17
0

Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:

import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
  print word.decode("string-escape")
olpa
  • 1,167
  • 10
  • 28