0

I need help cleaning excel files in R.

These are excel files made by different people, they are supposed to contain the same identical text. My task is to compare the snippets of text and make sure they match (sometimes people type, sometimes people copy paste, it is a mess).

My particular issue is that there is no standard formatting, and some of them have been extracted from a PDF.

To give you an idea, the text might look like this:

File A:

"The monkey must eat:

·Bananas, or

·6 bananas."

File B:

"The monkey must eat:

Bananas, or

5 bananas."

File C:

"The monkey must eat:

·       Bananas, or

·       6 bananas."

So far, I have used a combination of the functions below, but in the end my comparison is still FALSE.

monkeyr$txtcp <- stri_enc_toascii(monkeyr$txtcp)

monkeyr$txtcp <- removeNumbers(monkeyr$txtcp) 
#bad idea as I want to compare the number of bananas

monkeyr$txtcp <- tolower(monkeyr$txtcp)

monkeyr$txtcp <- stripWhitespace(monkeyr$txtcp)

monkeyr$txtcp <- removePunctuation(monkeyr$txtcp)

monkeyr$txtcp <- trimws(monkeyr$txtcp)

monkeyr$txtcp <- stri_replace_all_charclass(monkeyr$txtcp,
 "\t", " ", merge = T)

#This above was specially because I wanted to remove the "tab" in File 3.
#Does not work. This is some sort of "invisible" tab that gets turned into
#a series of ->->-> when saved as csv.

Additional info:

This is how FILE C, looks like after being stripped and opened in excel:

Screenshot of "invisible" tabs, which turn into arrows

Any suggestions to normalize the text somehow?

Caveats: No packages that use Java.

Thanks in advance

EDIT

Example input:

monkeyr <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("The monkey must be fed a combination of:\r\n<U+F0B7>     Bananas, or\r\n<U+F0B7>     6 Bananas.", 
"The monkey must:\r\n·   Be active\r\n·   Be petted\r\n·   Be inactive.", 
"The monkey must:\r\nbe tame\r\njump")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("id", 
"txtcp"))

Expected output:

cleanmonkey <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("the monkey must be fed a combination of bananas or 6 bananas", 
"the monkey must be active be petted be inactive", "the monkey must be tame jump"
)), .Names = c("id", "txtcp"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L))
  • Please add reproducible example and expected output – Sotos Dec 20 '16 at 09:09
  • Hi Sotos, thanks for your comment. How do I make a reproducible example without linking a copy of the excel workbook? The characters that show on excel, seem to be invisible, in R and in notepad++. Can I link a file somehow? – dra.explota Dec 20 '16 at 09:33
  • [see this Q&A](http://stackoverflow.com/questions/5963269) – Jaap Dec 20 '16 at 09:37
  • Have a look [at this link](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Sotos Dec 20 '16 at 09:37

1 Answers1

1

Nothing elegant, but how about this. The code replaces the non-ASCII elements, then the "\r's" and "\n's" and finally cleans up extra spaces.

monkeyr$clean <- str_replace_all(string = monkeyr$txtcp, pattern = "<U.*>", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\\r", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\\n", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "[[:punct:]]", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\s{2}", replacement = "")

monkeyr$clean
[1] "The monkey must be fed a combination of Bananas or 6 Bananas"
[2] "The monkey must Be active Be petted Be inactive"             
[3] "The monkey mustbe tamejump" 

The combination of "mustbe" and "tamejump" should be noted.

lawyeR
  • 7,488
  • 5
  • 33
  • 63