1

I want the elements referenced in my data frame to be replaced with the argument I put into the function, however at the moment it is just replacing the elements with the argument I used to initially define the function (I'm finding it hard to explain - hopefully my code and pictures will clarify this a bit!)

Project_assign <- function(prjct) {
  Truth_vector <- is.element((giraffe[,1]),(prjct[,1]))
  giraffe[which(Truth_vector),5] <- 'prjct'
  assign('giraffe' , giraffe , envir= .GlobalEnv)
}
Project_assign(spine_hlfs)

This mostly works however the elements get replaced with prjct instead of spine_hlfs https://i.stack.imgur.com/uuPnv.png

If I can get this to work as intended, then I will next create a vector with all the project names and use lapply with this function saving me a lot of manual work every few months. I am relatively new to R so any explanations are well appreciated.

Maharero
  • 238
  • 1
  • 10
  • 1
    Can you please include some sample data to make your code reproducible, and also provide your expected output. We don't know anything about `giraffe`, `prjct`, `spine_hlfs` etc. Posting images/screenshots doesn't usually help much. Lastly, using `assign` is generally not a good idea (see e.g. [here](https://stackoverflow.com/questions/17559390/why-is-using-assign-bad) for more details). – Maurits Evers Dec 06 '17 at 03:17
  • I will throw together an example (cant use this one as data is being extracted from a sql database) - however Giraffe is a dataframe with runkeys ranging from 1 to 500 , spine_hlfs is also a dataframe but only contains certain runkeys that are a subset of giraffe (eg. 44, 260, 478). prjct is the functions argument, I want the 'project' column in the giraffe dataframe to be updated with the specific name (in this case spine_hlfs) only where the runkeys match. – Maharero Dec 06 '17 at 03:25

2 Answers2

1

Sounds like a simple replace based on matching entries between a (list of) query dataframes and a subject dataframe.

Here is an example based on some simulated data.

I first simulate data for the subject dataframe:

# Sample data
giraffe <- data.frame(
    runkeys = seq(1:500),
    col1 = runif(500),
    col2 = runif(500),
    col3 = runif(500),
    col4 = runif(500));

I then simulate runkeys data for 2 query dataframes:

spine_hlfs <- data.frame(
    runkeys = c(44, 260, 478));
ir_dia <- data.frame(
    runkeys = c(10, 20, 30))

The query dataframes are stored in a list:

lst.runkeys <- list(
    spine_hlfs = spine_hlfs,
    ir_dia = ir_dia);

To flag runkeys entries present in any of the query dataframes, we can use a for loop to match runkeys entries from every query dataframe:

# This is the critical line that loops through the dataframe
# and flags runkeys in giraffe with the name of the query dataframe
for (i in 1:length(lst.runkeys)) {
    giraffe[match(lst.runkeys[[i]]$runkeys, giraffe$runkeys), 5] <- names(lst.runkeys)[i];
}

This is the output of the subject dataframe after matching runkeys entries. I'm only showing rows where entries in column 5 where replaced.

giraffe[grep("(spine_hlfs|ir_dia)", giraffe[, 5]), ];
10       10 0.7401977 0.005703928 0.6778921     ir_dia
20       20 0.7954076 0.331462567 0.7637870     ir_dia
30       30 0.5772808 0.183716142 0.6984193     ir_dia
44       44 0.9701355 0.655736489 0.4917452 spine_hlfs
260     260 0.1893012 0.600140166 0.0390346 spine_hlfs
478     478 0.7655976 0.910946623 0.9779205 spine_hlfs
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Sorry for the late reply. Your code does what I need to do (except at the #replace line I would assign "spine_hlfs" instead of "prjct") however in my situation 'spine_hlfs' is just one of many data frames I need to match with 'giraffe' meaning I would have to copy that code out many times only changing the dataframe name for each (i.e. not just 'spine_hlfs', but 'spine_moe', 'spine_dia', .... and so on). I'm trying to create a function that would generalise this code so that it covers all dataframes at once - im 99% there except it assigns 'prjct', not 'spine_hlfs' when I use that argument – Maharero Dec 06 '17 at 06:37
  • For further clarification: I want it so that when I call 'Project_assign(spine_hlfs) ' it will update the 'project' column in giraffe with 'spine_hlfs' at every row where the Runkeys match. Then if I call 'Project_assign(ir_dia)' it will update the 'project' column in giraffe with 'ir_dia' where the Runkeys match and so on. Each subset dataframe has unique runkeys to the other subset dataframes. Once I get this function working I will try to create a list containing all of the subset dataframes and then lapply the list with this function, in theory making the code much shorter and saving time – Maharero Dec 06 '17 at 07:12
  • @Maharero I still see no need for a function. Just store all query `dataframes` in a list, and loop over them to replace entries where you have matching `runkeys`. **There is definitely no need for `attach`!** I've updated my example to show you a working minimal example with 2 query `dataframes`. – Maurits Evers Dec 06 '17 at 07:27
  • I'm reading up on attach and will try to avoid it if I can (don't need it for a for loop like you're doing however I think it is necessary for a function). Your method is very nearly what I'm after - the only thing that I require is that in your example inside col 4, runkeys 10,20,30 should say 'ir_dia' instead of 'prjct' and runkeys 44, 260, 478 should say 'spine_hlfs' instead of 'prjct'. This is why I thought a function is necessary as the line inside my function # giraffe[which(Truth_vector),5] <- 'prjct' # would dynamically change 'prjct' to whatever I called in the functions argument – Maharero Dec 06 '17 at 07:44
  • @Maharero No, there is definitely no need for a function, despite your insistence. A `for` loop is the way to go here. Perhaps you've read about `for` loops and how to avoid them generally in R. That's mostly true, but not generally (the story is more complex and dates back to old S-plus). This is an excellent example for a "good use" of a `for` loop. I've updated my example, to replace entries by the name of the query database. This should be what you're after. `attach` is evil, and should be avoided. – Maurits Evers Dec 06 '17 at 08:50
  • @Maharero PS. There is another reason why using a function doesn't make much sense in this case: R will create another copy of the `dataframe` that you pass as a function argument (that's because R doesn't know about passing objects by reference). So you will need (at least) twice as much memory just for storing/reading one `dataframe`. – Maurits Evers Dec 06 '17 at 09:02
  • @MauritsEvers, I think, you mean **assign**. `fortunes::fortune(236)` says: *The only people who should use the assign function are those who fully understand why you should never use the assign function. -- Greg Snow R-help (July 2009)* – Uwe Dec 06 '17 at 09:17
  • Thank you for both your methods and your patience, for loops are much more versatile than I previously thought and now I'm aware of the data.table package / method too (Is there no way I can signify that there is more than 1 accepted answer?) – Maharero Dec 06 '17 at 09:53
  • @Maharero, you can only accept one answer as most convenient to your question but you can upvote both answers (click on the up triangle above the check mark) at your discretion. – Uwe Dec 06 '17 at 10:01
  • @Uwe Yes, my mistake, I meant `assign`. Thanks for the clarification. I already posted a link to a [SO post](https://stackoverflow.com/questions/17559390/why-is-using-assign-bad) discussing the dangers of `assign` above. – Maurits Evers Dec 06 '17 at 10:02
  • @Maharero No worries, happy to help & I'm glad it worked out in the end. – Maurits Evers Dec 06 '17 at 10:03
0

As far as I have understood OP's intentions from the many comments, he wants to update the giraffe data frame with the name of many other data frames where runkey matches.

This can be achieved by combining the other data frames into one data.table object treating the data frame names as data and finally updating giraffe in a join.

Sample Data

According to the OP, giraffe consists of 500 rows and 5 columns including runkey and project. project is initialized here as character column for the subsequent join with the data frame names.

set.seed(123L) # required for reproducible data
giraffe <- data.frame(runkey = 1:500,
                      X2 = sample.int(99L, 500L, TRUE),
                      X3 = sample.int(99L, 500L, TRUE),
                      X4 = sample.int(99L, 500L, TRUE),
                      project = "",
                      stringsAsFactors = FALSE)

Then there are a number of data frames which contain only one column runkey. According to the OP, runkey is disjunct, i.e., the combined set of all runkey does not contain any duplicates.

spine_hlfs <- data.frame(runkey = c(1L, 498L, 5L))
ir_dia     <- data.frame(runkey = c(3L, 499L, 47L, 327L))

Proposed solution

# specify names of data frames
df_names <- c("spine_hlfs", "ir_dia")
# create named list of data frames 
df_list <- mget(df_names)
# update on join 
library(data.table)
setDT(giraffe)[rbindlist(df_list, idcol = "df.name"), on = "runkey", project := df.name][]
     runkey X2 X3 X4    project
  1:      1  2 44 63 spine_hlfs
  2:      2 73 99 77           
  3:      3 43 20 18     ir_dia
  4:      4 73 12 40           
  5:      5  2 25 96 spine_hlfs
 ---                           
496:    496 75 45 84           
497:    497 24 63 43           
498:    498 33 53 81 spine_hlfs
499:    499  1 33 16     ir_dia
500:    500 99 77 41

Explanation

setDT() coerces giraffe to data.table. rbindlist(df_list, idcol = "df.name") creates a combined data.table from the list of data frames, thereby filling the df.name column with the names of the list elements:

      df.name runkey
1: spine_hlfs      1
2: spine_hlfs    498
3: spine_hlfs      5
4:     ir_dia      3
5:     ir_dia    499
6:     ir_dia     47
7:     ir_dia    327

This intermediate result is joined on runkey with giraffe. The project column is updated with the contents of df.name only for matching rows.

Alternative solution

This is looping over df_names and performs repeated joins which update giraffe in place:

setDT(giraffe)
for (x in df_names) giraffe[get(x), on = "runkey", project := x]
giraffe[]
Uwe
  • 41,420
  • 11
  • 90
  • 134