I have multiple large data tables in R. Some column names appear twice having a nearly duplicate name: they are the same except for the last character.
For example:
[1] "Genre_Romance" (correct)
[2] "Genre_Sciencefiction" (correct)
[3] "Genre_Sciencefictio" (wrong)
[4] "Genre_Fables" (correct)
[5] "Genre_Fable" (wrong)
Genre_Romance <- c(1, 0, 1, 0, 1)
Genre_Sciencefiction <- c(0, 1, 0, 0, 0)
Genre_Sciencefictio <- c(1, 0, 1, 1, 0)
Genre_Fables <- c(0, 0, 1, 0, 0)
Genre_Fable <- c(0, 0, 0, 0, 1)
dt <- data.table(Genre_Romance, Genre_Sciencefiction, Genre_Sciencefictio, Genre_Fables, Genre_Fable)
Now I want to add the column values with nearly the same column name. I want to save this sum under the correct column name while removing the incorrect column. The solution here would be:
dt[,"Genre_Sciencefiction"] <- dt[,2] + dt[, 3]
dt[,"Genre_Fables"] <- dt[,4] + dt[, 5]
dt[,"Genre_Sciencefictio"] <- NULL
dt[,"Genre_Fable"] <- NULL
dt
Genre_Romance Genre_Sciencefiction Genre_Fables
1 1 0
0 1 0
1 1 1
0 1 0
1 0 1
As you can see, not every column name has a nearly duplicate one (such as "Genre_Romance"). So we just keep the first column like that.
I tried to solve this problem with a for loop to compare column names one by one and use substr() function to compare the longest column name with the shorter column name and take sum if they are the same. But it does not work correctly and is not very R-friendly.
The post below also helped me a bit further, but I cannot use 'duplicated' since the column names are not exactly the same. how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R
Thanks in advance.