CFStringTokenizer not tokenizing lower-case sentences

Question

I'm trying to use CFStringTokenizer with kCFStringTokenizerUnitSentence to split a string into sentences. The first problem I'm having is that sentences need to be capitalized in order for them to be recognized as sentences. If not, it just thinks it's part of the previous sentence.

I'm splitting user-entered text so I'm expecting the text to be very unclean.

Is there something else I can do with CFStringTokenizer to have it detect uncapitalized sentences? Or will I have to use another method of splitting altogether?

I followed the answer on this SO question for my implementation: How to get an array of sentences using CFStringTokenizer?

NOTE: After testing a bit more it seems that with kCFStringTokenizerUnitSentence, if a '!' or a '?' is followed by an uncapitalized sentence, it will recognize the sentence. Also, if one of those punctuation marks is followed by a sentence without a space between the '!' and the first word, it will still separate.

So the one case I need to work around is a '.' followed by an uncapitalized sentence.

ANOTHER OPTION I found, if you're getting the text from a textField, is to use this:

textField.autocapitalizationType = UITextAutocapitalizationTypeSentences;

It will automatically capitalize sentences so you don't have to worry about converting for CFStringTokenizer. It still doesn't account for edge cases like abbreviations, but at least in my case the user will have an option to delete the auto-capitalization if it's wrong.

Do you require language-independent parse? If not, you could approximate with [sentence componentsSeparatedByString:@" "]; — danh, Mar 28 '13 at 04:42
@danh I do need language-independent parse. Also, I need something pretty robust, as the strings are going to be all over the place. I'd really like an out-of-the-box sentence tokenizer that covers all cases (if it exists). — OdieO, Mar 28 '13 at 16:12

score 0 · Accepted Answer · answered Mar 28 '13 at 06:22

0

You can convert the input string to all uppercase first and then run it through CFStringTokenizer and use the ranges to get the substrings of the original input string. But you must be careful here because some characters might become more than 1 character after conversion to uppercase.

answered Mar 28 '13 at 06:22

fumoboy007

5,345
4
32
49

I've been delaying really learning about unicode - Are the characters to watch out for non-English characters? Such as accented characters? I sure I can find an already compiled character set of them somewhere online... – OdieO Mar 28 '13 at 16:28
Found it: http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt. Looks like they're all Greek, Latin, Lithuanian or Turkish. German has one character: 'ß'. I'm not implementing any of these languages, so looks like this will be a non-issue for this project. – OdieO Mar 28 '13 at 16:47
So I implemented this but of course now I'm realizing that a sentence such as "An m.d. named Dr. Jum." will be split wrong whether I capitalize the words and then tokenize or not. I'm going to accept your answer because it answered my question but I still need to find a good tokenizer that accounts for edge cases like this. I feel like I've seen sentence tokenizers for other programming languages that are pretty robust. – OdieO Mar 28 '13 at 17:43

CFStringTokenizer not tokenizing lower-case sentences

1 Answers1