2

I have multiple large data tables in R. Some column names appear twice having a nearly duplicate name: they are the same except for the last character.

For example:

[1] "Genre_Romance" (correct) 
[2] "Genre_Sciencefiction" (correct) 
[3] "Genre_Sciencefictio" (wrong)
[4] "Genre_Fables" (correct)
[5] "Genre_Fable" (wrong) 

Genre_Romance <- c(1, 0, 1, 0, 1) 
Genre_Sciencefiction <- c(0, 1, 0, 0, 0)
Genre_Sciencefictio <- c(1, 0, 1, 1, 0)
Genre_Fables <- c(0, 0, 1, 0, 0)
Genre_Fable <- c(0, 0, 0, 0, 1)
dt <- data.table(Genre_Romance, Genre_Sciencefiction, Genre_Sciencefictio,   Genre_Fables, Genre_Fable) 

Now I want to add the column values with nearly the same column name. I want to save this sum under the correct column name while removing the incorrect column. The solution here would be:

dt[,"Genre_Sciencefiction"] <- dt[,2] + dt[, 3]
dt[,"Genre_Fables"] <- dt[,4] + dt[, 5]
dt[,"Genre_Sciencefictio"] <- NULL
dt[,"Genre_Fable"] <- NULL
dt

Genre_Romance    Genre_Sciencefiction   Genre_Fables
    1                   1                   0       
    0                   1                   0       
    1                   1                   1       
    0                   1                   0       
    1                   0                   1   

As you can see, not every column name has a nearly duplicate one (such as "Genre_Romance"). So we just keep the first column like that.

I tried to solve this problem with a for loop to compare column names one by one and use substr() function to compare the longest column name with the shorter column name and take sum if they are the same. But it does not work correctly and is not very R-friendly.

The post below also helped me a bit further, but I cannot use 'duplicated' since the column names are not exactly the same. how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R

Thanks in advance.

Community
  • 1
  • 1
equuz
  • 21
  • 2
  • It might help if you gave a more concrete example, with code to reproduce it and (even though it's fairly obvious) the corresponding desired output. See http://stackoverflow.com/a/28481250 for guidance. – Frank Mar 14 '17 at 15:45
  • Thanks for your comment @Frank. I edited my post. – equuz Mar 14 '17 at 16:05
  • Would taking the max of the two columns always give the same result as taking the sum? That is, might it ever have 1 in both Sciencefictio and Sciencefiction? It doesn't seem like it would be meaningful to end up with a 2 for that (apparently 1/0 column). – Frank Mar 14 '17 at 16:25
  • stringdist::stringdistmatrix ? – Henk Mar 14 '17 at 16:35
  • 1
    Yes, taking the max of the two columns should always give the same result. The range of final columns should also be [0,1]. – equuz Mar 14 '17 at 16:40
  • Do you know an alternative solution? @Frank – equuz Mar 16 '17 at 16:22
  • @equuz If you mean to take max instead of sum: something like replacing `Reduce("+", x)` with `do.call(pmax, x)` should do that. Re working with string comparisons, I'm no expert. – Frank Mar 16 '17 at 16:25

1 Answers1

2

Here is a more-or-less base R solution that relies on agrep to find similar names. agrep allows for close string matches, based on the "generalized Levenshtein edit distance."

# find groups of similar names
groups <- unique(lapply(names(dt), function(i) agrep(i, names(dt), fixed=TRUE, value=TRUE)))
# choose the final names as those that are longest
finalNames <- sapply(groups, function(i) i[which.max(nchar(i))])

I chose to keep the longest variable names in each groups that matched the example, you could easily switch to the shortest with which.min or you could maybe do some hard-coding depending on what you want.

Next, Reduce is given "+" operator and is fed matching groups with lapply. To calculate the maximum instead, use max in place of "+". The variables are selected using .SDcols from data.table with a data.frame, you could directly feed it the group vectors.

# produce a new data frame
setNames(data.frame(lapply(groups, function(x) Reduce("+", dt[, .SD, .SDcols=x]))),
         finalNames)

@Frank's comment notes that this can be simplified in newer (1.10+, I believe) versions of data.table to avoid .SD, .SDcols with

# produce a new data frame
setNames(data.frame(lapply(groups, function(x) Reduce("+", dt[, ..x]))), finalNames)

To make this a data.table, just replace data.frame with as.data.table or wrap the output in setDT.


To turn the final line into a data.table solution, you could use

dtFinal <- setnames(dt[, lapply(groups, function(x) Reduce("+", dt[, .SD, .SDcols=x]))],
                    finalNames)

or, following @Frank's comment

dtFinal <- setnames(dt[, lapply(groups, function(x) Reduce("+", dt[, ..x]))], finalNames)

which both return

dtFinal
   Genre_Romance Genre_Sciencefiction Genre_Fables
1:             1                    1            0
2:             0                    1            0
3:             1                    1            1
4:             0                    1            0
5:             1                    0            1
lmo
  • 37,904
  • 9
  • 56
  • 69
  • 1
    Thanks for the tip, @Frank. I saw that syntax in the update notices, but thought it had to do with direct reference to objects outside of the data.table. I'll have to go back and re-read. – lmo Mar 14 '17 at 17:09
  • Thank you for your response @lmo! It does more or less what I was looking for, however there are still some problems. For example, when I try it on the complete data set “Genre_Sciencefiction” is left out, but the column “Genre_Sciencefictionfilms” is in the new data frame twice. A lot of genres are also missing (probably because a lot of genres contain similar strings such as 'Genre_Detective' and 'Genre_Detectivefilms'). How could we solve this? – equuz Mar 14 '17 at 17:36
  • A simple fix that may work is to tweek the `max.distance` argument for the `agrep` function. It is currently set at 0.1, which means that less than 1 tenth of the letters need to be transformed. You could increase this value. Probably a better solution would be to remove common lengthy differences such as "films". Something like `setnames(dt, names(dt), gsub("films?$", "", names(dt)))` to removes "film" and "films" from the end of the variable name. There are many SO posts that expand this example. – lmo Mar 14 '17 at 18:10
  • Unfortunately, tweaking the `max.distance` does not eliminate the duplicate columns appearing in the final data frame (they also have a different column sum) and does not bring back the ones that are removed (even the ones that are entirely different) @lmo. When I try to remove the "film" and "films" an error occurs while running the code because I am creating duplicate column names. I also prefer to keep the distinction :) So I am still stuck... Maybe the code should focus more on the fact that the two columns are only different for the last character (for example using `substr()`) – equuz Mar 16 '17 at 10:27
  • 1
    You should probably ask a new question that links to this question and provide a reproducible example of the problems that are not present in your current example. Otherwise, we'll just be going in circles. – lmo Mar 16 '17 at 11:46