0

From a remote datasource I get text nibbles (usually no longer than like 100 chars) which are all upper case. This is mostly natural language but with interspersed acronyms and punctionation (like + and -). What I would like to do is to convert this text into a readable form, that is, make most of it lower case, except for acronyms and properly capitalize nouns and names (this is for german where many more words are capitalized than, say, in english).

I'd prefer a solution for Cocoa (OS X), but any other approach is welcome to. I read about NSLinguisticTagger (e.g. in this question) but it seems that tagging words highly depends on already properly captialized words.

Community
  • 1
  • 1
Mike Lischke
  • 48,925
  • 16
  • 119
  • 181

1 Answers1

1

I’d do it in two passes. First convert it to all lowercase (except the beginning of sentences), then then run spell-check on it. That should hopefully turn most the proper nouns and acronyms into uppercase.

That’s just if you want to use existing Cocoa frameworks.

Wil Shipley
  • 9,343
  • 35
  • 59
  • This is actually what I do now (except for POS tagging). This has however problems (e.g. acronyms stay lower case) and since the linguistic tagger needs proper captialization to detect nouns it is a classic chicken-egg problem to solve. – Mike Lischke Jan 13 '14 at 08:35
  • I'd guess that most acronyms don't pass spell check or maybe aren't too meaningful (kind of like [this question](http://stackoverflow.com/a/6298193/583834)) - maybe checking something like that could work? if not, are you expecting acronnyms within a specific set or are there always new acronyms coming up? – arturomp Jan 13 '14 at 12:10