2

I am attempting to merge all columns with different names but identical variable labels (imported from an SPSS file). The way I am trying to go about this is running a few checks to make sure the columns are neither NA nor identical, then pasting j to i and deleting j. However, this appears to be changing nothing whatsoever in my dataframe. What am I doing wrong here?

A note-- mergedSet is rows bound together from set1 and set2, each of which contain the labels.

for(i in colnames(set1)) {
    for(j in colnames(set2)){
        if(!is.na(attributes(set1)$variable.labels[i]) && 
           !is.na(attributes(set2)$variable.labels[j])) {
                if(attributes(set1)$variable.labels[i] == 
                   attributes(set2)$variable.labels[j]) {
                     if(i != j) {
                       mergedSet <- within(mergedSet, i <- paste(i,j))
                       mergedSet <- within(mergedSet, rm(j))
                       }
                  }
            } 

         }
    }
eli-k
  • 10,898
  • 11
  • 40
  • 44
Sam S.
  • 113
  • 1
  • 7
  • You tried with merge(df1, imported_df, by = "key variable", all.x = TRUE)? – Cristóbal Alcázar Jul 06 '17 at 20:17
  • I agree with @CristóbalAlcázar on that there must be a better solution to this. Besides, why don't you use "normal" R syntax to assign the variables, i.e. `mergedSet$i <- paste(mergedSet$i, mergedSet$j)` and `mergedSet$j <- NULL` ? did you include a check that you actually "arrive" in the code block? E.g. you could add a print("here") before the two crucial lines to see that you're not kicked out by any of the checks (which are hard to check without a toy dataset btw!). – friep Jul 06 '17 at 20:49
  • I suppose that your problem is that exist many columns and you can't directly said col x_2 id equal to col y_23. Maybe for each categorical variable (you reduce the problem first) iterate over all other data categorical variables, and if the unique label are the same create a pair of column. Then normalize the names. – Cristóbal Alcázar Jul 06 '17 at 21:34
  • @CristóbalAlcázar That was a bit helpful in the initial merging, but we are specifically considering columns with non-identical names that have identical variable labels in the metadata. friep I do have code in there that verifies that I am in the block, and it indicates that the column j simply isn't removed. I need to use the bracket notation because the $ notation looks for a variable named "j", rather than the value pointed to by j. CristobalAlcazar (again) that's what I'm trying to do. Both of you, I know that there HAS to be a better solution. – Sam S. Jul 07 '17 at 13:07
  • I strongly recommend that you add a reproducible example, please see the answer for this [question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) is very useful. It hasn't to be extensive, only that remark your problem with a toy data set. – Cristóbal Alcázar Jul 07 '17 at 13:12
  • How would you like that to be conveyed? I have a script that produces the proper schema, I can put that on Github proper or Gist, or any other service. @CristóbalAlcázar – Sam S. Jul 07 '17 at 13:30
  • I agree with both comments above. To use `merge` would be my first option. EDIT: now I saw the other comments... sorry. I'll try to think about your situation. – Luís Telles Jul 07 '17 at 13:34
  • it's necessary to put all the columns to represent your problem? and the rows? can you give two mini data set with the minimum information to represent the problem you describe? Use examples like this [type](https://stackoverflow.com/questions/44930149/replace-a-subset-of-a-data-frame-with-dplyr-join-operations). It's more easy to us to help you if you gives an example directly to try in the console, but a quick, easy, with the minimal details that compound your problem. You can use the dput(data_frame) to share the output to rebuild your data easily. – Cristóbal Alcázar Jul 07 '17 at 13:37
  • @CristóbalAlcázar [here](https://gist.github.com/sternj/9413dc97da02f1b0f57848aa7ef15286) is a script which constructs the situation which I am describing. I have it all documented in the code. – Sam S. Jul 07 '17 at 13:44

1 Answers1

0

If I am understanding your question correctly this code should merge the columns based on the columns having matching variable.labels and non-matching column names.

mergedSet <- data.frame(c(1,3,5),c("a","b","c"))
mergedSet <- data.frame(mergedSet,c("s","","h"))
attributes(mergedSet)$variable.labels["gas"] <- "three"
attributes(mergedSet)$variable.labels["xhs"] <- "three"
attributes(mergedSet)$variable.labels["hhh"] <- "notSame"
names(mergedSet) <- c("gas","hhh","xhs")


set1 <- data.frame(c(2),c(4))
names(set1) <- c("gas","factpr")
attributes(set1)$variable.labels["gas"] <- "three"
attributes(set1)$variable.labels["factpr"] <- "nah"


set2 <- data.frame(c("asd"),c("pqr"))
names(set2) <- c("non","hhh")
attributes(set2)$variable.labels["non"] <- "something"
attributes(set2)$variable.labels["hhh"] <- "three"


for(i in colnames(set1)) {
  for(j in colnames(set2)){
    if(!is.na(attributes(set1)$variable.labels[i]) && 
       !is.na(attributes(set2)$variable.labels[j])) {
      if(attributes(set1)$variable.labels[i] == 
         attributes(set2)$variable.labels[j]) {
        if(i != j) {
          mergedSet[, i] <- paste(mergedSet[,i], mergedSet[,j])
          mergedSet[, j] <- NULL
        }
      }
    } 
  }
}

mergedSet
#   gas xhs
# 1 1 a   s
# 2 3 b    
# 3 5 c   h
Matt Jewett
  • 3,249
  • 1
  • 14
  • 21