0

Basically I have 2 tables with the same column names and want to do calculations across tables. Ideally, I would have taken data from the two tables and created a third, but I could only find a way to do that if the data tables are the same dimensions because it would be by cell position. Instead, I'd like to do it by column name after having done a join so that I know that the calculations are taking from the correct values.

I am trying to loop through column names to do calculations between various associated columns in the same data table (I have 2 lists of column names, that I am using to call columns from a table where I've joined the two tables. I've adjusted the column name list to add the "_A" and "_B" which were added during the join as the columns had the same names). I'm trying to call the column names using [[i]] (in this case I am using [[1]] to test it).

Does anyone know why I can't call the column name in the name$colname format? If I replace the variable with the name, it works, and if I take just the variable (colnameslistInf[[1]]) it shows the right column name, but once I put it together it says "Unknown or uninitialised column".

> joininfsup$colnameslistInf[[1]]
NULL
Warning message:
Unknown or uninitialised column: `colnameslistInf`.

> colnameslistInf[[1]]
[1] "newName.x"

> joininfsup$newName.x
  [1] 5 5 5 5 5 5 5 5 5 5 5
 [12] 5 5 5 5 5 5 5 5 5 5 5
 [23] 5 5 5 5 5 5 5 5 5 5 5
 [34] 5 5 5 5 5 5 5 5 5 5 5
 [45] 5 5 5 5 5 5 5 5 5 5 5

I am also getting this error:

Error in `[[<-.data.frame`(`*tmp*`, col, value = integer(0)) : 
  replacement has 0 rows, data has 264

The code I am trying to run is here. joininfsup is the joined table, and I use mutate to create new columns with the calculations across each of the 200+ columns and its associated column.

joined_day_inf_numeric <-select_if(joined_day_inf, is.numeric) joined_day_sup_numeric<-select_if(joined_day_sup, is.numeric) joininfsup<- left_join(joined_day_inf_numeric, joined_day_sup_numeric, "JOININF", suffix = c("_A", "_B"))

#take colnames from original tables and add _A and _B as those are added during the join

colnameslistInf <- paste0(colnames(joined_day_inf_numeric), "_A") 
colnameslistSup <- paste0(colnames(joined_day_sup_numeric), "_B")

for (i in 1:length(colnameslistInf)) { #245 cols, for example
  name <- paste0(colnames(joined_day_inf_numeric)[[i]]) #names of new columns as loops through
  
  joininfsup2 <-joininfsup %>% 
    mutate(!!name := ((joininfsup[[ colnameslistInf[[i]] ]])-joininfsup[[colnameslistSup[[i]] ]]))*joininfsup$proportion_A+joininfsup[[ colnameslistInf[[i]] ]]
  
  write_csv(joininfsup2, paste0("test/finalcalc.csv"))
  
}

I think this might be the key but am having trouble applying it: Use dynamic name for new column/variable in `dplyr`

UPDATE: I replaced name in the mutate function with !!name := and the code ran! But gave me the same output as the original joined table because I'm still getting the "Unknown or uninitialised column: colnameslistInf." warning.

UPDATE2: added missing join code, needed to save variable in for loop, added [[]] acording to @Parfait 's suggestion-- but the code still does not work (does not add any new columns).

UPDATE3: I tried @Parfait's common_columns method but got an error: Error: Can't subset columns that don't exist. x Columns 8_, 50_, 51_, 55_, 78_, etc. don't exist. These columns were removed at the is.numeric step so not sure why it is pulling from the original dataset. Also, using match deletes a bunch of other columns that have characters as names

mtango
  • 3
  • 2
  • I see your formula uses the `proportion` column from first data frame but this column is also included in both colnames. – Parfait May 11 '22 at 13:40
  • True, I will change to joininfsup$proportion since the numbers should be the same. Thanks. – mtango May 11 '22 at 13:46
  • Actually, *one* new column should have been added -the *very* last since you re-assign data frame with each iteration. Hence, avoid looping. O/W initialize new data frame before loop and assign to itself with `mutate` inside loop. – Parfait May 11 '22 at 21:14
  • No, weirdly there is no new column at all – mtango May 12 '22 at 13:38

1 Answers1

0

In R, when referencing names with the $ operator, identifiers are interpreted literally requiring a column named "colnameslistInf[[1]]" (but even this will fail without backticks). However, the extract operator, [[, can interpret dynamic variables:

joininfsup[[ colnameslistInf[[1]] ]]

Additionally, mutate also takes identifiers literally. Hence, in each iteration of loop, you are assigning and re-assigning to a variable named, name. But you resolved it with the double bang operator, !!.


However, consider avoiding the loop by columns and calculate your formula on block of columns in matrix-style arithmetic. Specifically, adjust the default suffix in dplyr::inner_join (or suffixes argument in base::merge) and then reassign non-underscored columns, finally remove underscored columns. Below assumes your join operation. Adjust type of join and by arguments as needed.

joined_day_inf_numeric <- select_if(joined_day_inf, is.numeric) 
joined_day_sup_numeric <- select_if(joined_day_sup, is.numeric) 

common_columns <- intersect(
    colnames(joined_day_inf_numeric), colnames(joined_day_sup_numeric)
)
common_columns <- common_columns[common_columns != "JOININF"]

joininfsup <- left_join(
    joined_day_inf_numeric, joined_day_sup_numeric, by = "JOININF", suffix = c("", "_")
)

# ASSIGN NON-UNDERSCORED COLUMNS
joininfsup[common_columns] <- (
    (
     joininfsup[common_columns] - joininfsup[paste0(common_columns, "_")]
    ) * 
    joininfsup$proportion + joininfsup[common_columns]
)

# REMOVE UNDERSCORED COLUMNS
joininfsup[paste0(common_columns, "_")] <- NULL

write_csv(joininfsup, paste0("test/finalcalc.csv"))
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Run `match` *after* your `select_if` but *before* `left_join`. This solution does not know you ran `select_if`. Also, remove the *by* column, `JOINIF`, from common columns. See edit. – Parfait May 11 '22 at 21:10
  • It looks like match is just listing which columns match without maintaining the colnames (just finds by position). It is listing 1:245. Therefore I'm still getting the "Can't subset columns that don't exist" error. But all of the columns should match anyway so I can probably just take the column names of one joined_day_inf_numeric and carry out the same process – mtango May 12 '22 at 14:00
  • Yup, it worked. Just substituted the match part with `common_columns <- colnames(joined_day_inf_numeric)`. Thanks a ton!! – mtango May 12 '22 at 14:03
  • My apologies! Don't use `match` which returns positions. Instead use `intersect` which returns actual values. – Parfait May 12 '22 at 14:45
  • Ah, great, I had looked for an equivalent but wasn't able to find it. Thank you so much! – mtango May 12 '22 at 16:01