R spell checker / tokenizer

Question

I'm not sure if R is the right place to try this or not but here's my situation. I have a character vector full of strings.

id    Words
 1    'The'
 2    'victory'
 3    'wasgreat'
...   ...

The original data had some encoding problems and some of the strings are concatenizations of several words:

 (ie 'My name is' -> 'Mynameis').

I need to leave the correct words alone and get the misspelled concatenizations separated into their correct substrings.

I'm curious if there's any setup in R to handle this type of problem. I think that there are several programs in python that would handle this much better but my python skills are substantially weaker (bordering on non-existent). However, I'd be willing to consider it as an alternative.

Any suggestions?

http://stackoverflow.com/questions/6897214/breaking-a-string-into-individual-wordspython — fraxel, Mar 20 '12 at 15:52

score 6 · Accepted Answer · answered Mar 20 '12 at 15:58

6

The most recent issue of the R Journal has an article by Hornik and Murdoch on R for spell-checking which, recursion to the rescue, they apply to the R sources themselves.

answered Mar 20 '12 at 15:58

Dirk Eddelbuettel

360,940
56
644
725

I've been all day trying to figure out how to make aspell to work with custom dictionaries on Windows 7 x64. Tried the **saveRDS()** function and the **aspell_write_personal_dictionary_file()** function. With the former I receive this error **"The word "UTF-8" is invalid. The character '-' may not appear at the middle of a word."** and a warning. With the later **aspell** can't find my custom dictionary. Any idea about how to attack this? – Diego May 24 '14 at 23:37

R spell checker / tokenizer

1 Answers1