Text segmentation: dictionary-based word splitting

Question

Background

Split database column names into equivalent English text to seed a data dictionary. The English dictionary is created from a corpus of corporate documents, wikis, and email. The dictionary (lexicon.csv) is a CSV file with words and probabilities. Thus, the more often someone writes the word "therapist" (in email or on a wiki page) the higher the chance of "therapistname" splits to "therapist name" as opposed to something else. (The lexicon probably won't even include the word rapist.)

Source Code

TextSegmenter.java @ http://pastebin.com/taXyE03L
SortableValueMap.java @ http://pastebin.com/v3hRXYan

Data Files

lexicon.csv - http://pastebin.com/0crECtXY
columns.txt - http://pastebin.com/EtN9Qesr

Problem (updated 2011-01-03)

When the following problem is encountered:

dependentrelationship::end depend ent dependent relationship
end=0.86
ent=0.001
dependent=0.8
relationship=0.9

These possible solutions exist:

dependentrelationship::dependent relationship
dependentrelationship::dep end ent relationship
dependentrelationship::depend ent relationship

The lexicon contains words with their relative probabilities (based on word frequency): dependent 0.8, end 0.86, relationship 0.9, depend 0.3, and ent 0.001.

Eliminate the solution of dep end ent relationship because dep is not in the lexicon (i.e., 75% word usage), whereas the other two solutions cover 100% of words in the lexicon. Of the remaining solutions, the probability of dependent relationship is 0.72 whereas depend ent relationship is 0.00027. We can therefore select dependent relationship as the correct solution.

Question

Given:

// The concatenated phrase or database column (e.g., dependentrelationship).
String concat;

// All words (String) in the lexicon within concat, in left-to-right order; and
// the ranked probability of those words (Double). (E.g., {end, 0.97}
// {dependent, 0.86}, {relationship, 0.95}.)
Map.Entry<String, Double> word;

How would you implement a routine that generates the most likely solution based on lexicon coverage and probabilities? For example:

for( Map.Entry<String, Double> word : words ) {
  result.append( word.getKey() ).append( ' ' );

  // What goes here?

  System.out.printf( "%s=%f\n", word.getKey(), word.getValue() );
}

Thank you!

Are you certain that you want to exclude any words that are not in your lexicon? There will always be 'interesting' words that are not in the lexicon. — Stompchicken, Jan 05 '11 at 10:21
@StompChicken: 1) Technical manuals + business documents `->` corpus. 2) Corpus + dictionary (w/lemmas) `->` probability lexicon. 3) Algorithm( lexicon + columns ) `->` split words. I have since tested the solution against 3300+ column names. The software correctly split 87% (random sample of 100) of the words. The remaining 13% that I saw could be accurately split if (a) the dictionary included lemmas; and (b) the corpus had more data. Both of those issues can be easily and quickly resolved. — Dave Jarvis, Jan 05 '11 at 14:27
@Dave Jarvis I'm not sure if that's a yes or a no. My point is that word distributions are long-tailed and the amount of data required to get those extra % increase exponentially. — Stompchicken, Jan 05 '11 at 14:32
@StompChicken: Words not in the lexicon are parsed correctly (e.g., ageconrolnoticeperiod `->` age **conrol** notice period). The closest answer to your question is *yes*, but that does not mean that the 'interesting' words are discarded from the solution. Keep in mind that the goal is not 100% conversion. Attaining 95% will suffice: the rest can be fixed manually. — Dave Jarvis, Jan 05 '11 at 15:13

score 1 · Answer 1 · edited Nov 26 '22 at 19:44

1

Peter Norvig has written some stuff in python.

http://norvig.com/ngrams/ngrams.py

contains a function called segment. It run a Naive Bayes probability of a sequence of words. works well. Can be a good basis for what your trying to accomplish in java.

edited Nov 26 '22 at 19:44

Dave Jarvis

30,436
41
178
315

answered Jan 24 '11 at 15:44

Mike St. John

11
1

score 0 · Answer 2 · answered Jan 02 '11 at 22:58

I would approach the problem slightly differently. It's significant that "end" and "dependent" overlap, yet that is lost in your word Map. If instead of a single word map you were to create a set of word maps, with each one representing a possible segmentation of the column name, consisting only of non-overlapping words, you could compute a score for each segmentation based on the word's probabilities and word length. The score for a segmentation would be the average of the scores of the individual words in the segmentation. The score for an individual word would be some function of the word's length(l) and probability(p), something like

score=al + bp

where a and b are weights that you can tweak to get the right mix. Average the scores for each word to get the score for the segmentation and pick the segmentation with the highest score. The score function would not have to be a linear weighting either, you could experiment with logarithmic, exponential or higher-order polynomials (squares for instance)

score 0 · Answer 3 · answered Jan 02 '11 at 23:09

0

Your problem is a very common one in NLP - don't start by reinventing the wheel - it will take you a long time and not be as good as what is out there already.

You should certainly start by seeing what the NLP libraries have to offer: http://en.wikipedia.org/wiki/Natural_language_processing and http://en.wikipedia.org/wiki/Category:Natural_language_processing_toolkits. Your problem is a common one and there are different approaches which you will need to explore for your corpus.

Your wordsplitting may be found under hyphenation routines. Two possible approaches are n-grams (where the frequency of (say) 4-character substrings are used to predict boundaries) and tries which show common starts or ends to words. Some of these may help with misspellings.

But there is no trivial answer - find what works best for you.

answered Jan 02 '11 at 23:09

peter.murray.rust

37,407
44
153
217

We parse chemical names - e.g. "chloromethylbenzene" and use a large vocabulary of fragments for both DFAs and regexes. We had to write our own code which took ca 2 person years. The parsing can take place from both ends. We alos use n-grams and I have used tries again starting from both ends. We have made some limited progress with detecting unknown lexemes (there are an infinite number as chemists are allowed to invent them) and misspellings. I don't know immediately what packages do English wordsplitting but I would start with hyphenation. In German this will be an essential feature. – peter.murray.rust Jan 03 '11 at 10:30
This problem is *much* simpler, and the solution need not be 100% perfect. I don't need to analyze words based on n-grams. The last paragraph in the revised problem statement describes the algorithm required to pick the most probable answer. I doubt that devising a solution for such a problem would take years... :-) I appreciate your feedback. – Dave Jarvis Jan 03 '11 at 14:53

Text segmentation: dictionary-based word splitting

Background

Source Code

Data Files

Problem (updated 2011-01-03)

Related

Question

3 Answers3

Linked