I need help cleaning excel files in R.
These are excel files made by different people, they are supposed to contain the same identical text. My task is to compare the snippets of text and make sure they match (sometimes people type, sometimes people copy paste, it is a mess).
My particular issue is that there is no standard formatting, and some of them have been extracted from a PDF.
To give you an idea, the text might look like this:
File A:
"The monkey must eat:
·Bananas, or
·6 bananas."
File B:
"The monkey must eat:
Bananas, or
5 bananas."
File C:
"The monkey must eat:
· Bananas, or
· 6 bananas."
So far, I have used a combination of the functions below, but in the end my comparison is still FALSE.
monkeyr$txtcp <- stri_enc_toascii(monkeyr$txtcp)
monkeyr$txtcp <- removeNumbers(monkeyr$txtcp)
#bad idea as I want to compare the number of bananas
monkeyr$txtcp <- tolower(monkeyr$txtcp)
monkeyr$txtcp <- stripWhitespace(monkeyr$txtcp)
monkeyr$txtcp <- removePunctuation(monkeyr$txtcp)
monkeyr$txtcp <- trimws(monkeyr$txtcp)
monkeyr$txtcp <- stri_replace_all_charclass(monkeyr$txtcp,
"\t", " ", merge = T)
#This above was specially because I wanted to remove the "tab" in File 3.
#Does not work. This is some sort of "invisible" tab that gets turned into
#a series of ->->-> when saved as csv.
Additional info:
This is how FILE C, looks like after being stripped and opened in excel:
Screenshot of "invisible" tabs, which turn into arrows
Any suggestions to normalize the text somehow?
Caveats: No packages that use Java.
Thanks in advance
EDIT
Example input:
monkeyr <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("The monkey must be fed a combination of:\r\n<U+F0B7> Bananas, or\r\n<U+F0B7> 6 Bananas.",
"The monkey must:\r\n· Be active\r\n· Be petted\r\n· Be inactive.",
"The monkey must:\r\nbe tame\r\njump")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("id",
"txtcp"))
Expected output:
cleanmonkey <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("the monkey must be fed a combination of bananas or 6 bananas",
"the monkey must be active be petted be inactive", "the monkey must be tame jump"
)), .Names = c("id", "txtcp"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L))