0

Task

I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).

I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).

Code thus far

#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)

#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE) 

write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)

Explanation I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).

Ask One

I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!

Ask Two

To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.

Any help on functions / methods to further automate this / better strategies using R would be great.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos Aug 29 '18 at 14:46
  • 1
    It would really help to have a reproducible example, but something like this may help with your first question: `"^[%s%s].*" %>% sprintf(letters, LETTERS) %>% map(~grep(.x, uniqueID, value=TRUE)) %>% map(~(adist(.x),ignore.case=TRUE))`. The code requires `dplyr` and `purrr`. – Vlad C. Aug 29 '18 at 14:52

1 Answers1

0

It would help to have an example of your data, but this might work.

EDIT: I am assuming your data is in a data.frame named df

for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}

Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.

Using subset, the correct IDs are extracted

paste0 will also create a unique filename for write.table().

And it is all tied together using a for()-loop

P1storius
  • 917
  • 5
  • 12
  • Thank you for the reply. I will work with the information you gave me to better understand it and implement a solution. – Carson Sherwood Aug 29 '18 at 15:12
  • FYI I made a small edit in the `subset(grep( ... ))` command, where I left something I use for trying out the code myself – P1storius Aug 29 '18 at 15:16
  • Thank you MKBakker. Using the code you were provided I was able to adjust my code. MrFlick edited my original post, but I am more or less a beginner programmer and R user, so thank you very much for the help. If you are keeping up with this thread, is there any chance you cant point me to a resource to better understand the syntax around: stake<-grep(paste0('^[',letters[i],LETTERS[i],'].*'),uniqueID,value=TRUE) I do not yet understand why the the single quotations, ^ and the .* are used. Thanks! – Carson Sherwood Aug 29 '18 at 19:29
  • Dear @CarsonSherwood , certainly. The singly quotes `'` are a personal preference. In this example you can use the double quotes `"` as well. About the grep characters: `grep()` uses standard unix expressions to match a string you pass to the command. Here, `^` means "the beginning of a string". And `.` means "any character". Finally, `*` means "any number of copies of the previous character". In other words: "find strings that begin with `letters[I]` or `LETTERS[I]` that are followed by any number of any character. (http://www.robelle.com/smugbook/regexpr.html) – P1storius Aug 30 '18 at 07:27