3

I'm not sure if R is the right place to try this or not but here's my situation. I have a character vector full of strings.

id    Words
 1    'The'
 2    'victory'
 3    'wasgreat'
...   ...

The original data had some encoding problems and some of the strings are concatenizations of several words:

 (ie 'My name is' -> 'Mynameis').

I need to leave the correct words alone and get the misspelled concatenizations separated into their correct substrings.

I'm curious if there's any setup in R to handle this type of problem. I think that there are several programs in python that would handle this much better but my python skills are substantially weaker (bordering on non-existent). However, I'd be willing to consider it as an alternative.

Any suggestions?

screechOwl
  • 27,310
  • 61
  • 158
  • 267

1 Answers1

6

The most recent issue of the R Journal has an article by Hornik and Murdoch on R for spell-checking which, recursion to the rescue, they apply to the R sources themselves.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • I've been all day trying to figure out how to make aspell to work with custom dictionaries on Windows 7 x64. Tried the **saveRDS()** function and the **aspell_write_personal_dictionary_file()** function. With the former I receive this error **"The word "UTF-8" is invalid. The character '-' may not appear at the middle of a word."** and a warning. With the later **aspell** can't find my custom dictionary. Any idea about how to attack this? – Diego May 24 '14 at 23:37