Is there a pre-existing function to sanitize a data.frame's character columns for Mechanical Turk? Here's an example of a line that it's getting hung up on:
x <- "Duke\U3e32393cs B or C, no concomittant malignancy, ulcerative colitis, Crohn\U3e32393cs disease, renal, heart or liver failure"
I assume those are unicode characters, but MT is not letting me proceed with them in there. I can obviously regex these out pretty easily, but I use MT a decent bit and was hoping for a more generic solution to remove all non-ascii characters.
Edit
I can remove the encoding as follows:
> iconv(x,from="UTF-8",to="latin1",sub=".")
[1] "Duke......s B or C, no concomittant malignancy, ulcerative colitis, Crohn......s disease, renal, heart or liver failure"
But that still leaves me lacking a more generic solution for vectors that use non-utf8 encodings for any element.
> dput(vec)
c("Colorectal cancer patients Duke\U3e32393cs B or C, no concomittant malignancy, ulcerative colitis, Crohn\U3e32393cs disease, renal, heart or liver failure",
"Patients with Parkinson\U3e32393cs Disease not already on levodopa",
"hi")
Note that regular text is encoding "unknown", which has no conversion to "latin1", so simple solutions that use iconv fail. I have one attempt at a more nuanced solution below, but I'm not very happy with it.