0

I am working in R trying to generate several distinct vectors using a for loop.

First I created a small reproducible example data frame called df.

cluster.assignment <- c("1 Unknown", "1 Unknown", "2 Neuron","3 
PBMC","4 Basket")
Value1 <- c("a","b","c","d","e")
Value2 <- c("191","234","178","929","123")
df <- data.frame(cluster.assignment,Value1,Value2)

df

  cluster.assignment Value1 Value2
1          1 Unknown      a    191
2          1 Unknown      b    234
3           2 Neuron      c    178
4             3 PBMC      d    929
5           4 Basket      e    123 . 

Next I create a variable named clusters that includes keys to the datasets that I am interested in.

clusters <- c("1 ","4 ")

Here is my attempt to extract rownames of the data of interest in df using a for loop.

for (COI in clusters) { 
  name2 <- c(gsub(" ","", paste("Cluster", COI, sep = "_")))
  assign(Cluster_1, name2, envir = parent.frame())
  name2 <- grep(COI, df$cluster.assignment)
}

Desired output is two vectors called Cluster_1 and Cluster_4.

Cluster_1 would contain the values 1 and 2

Cluster_4 would contain the value 5

I can't seem to figure out how to assign the name of the COI variable to be the name of the output vector.

Paul
  • 656
  • 1
  • 8
  • 23
  • `COI` takes the value of each element of `clusters`, that is, first it is `"1 "` and then it is `"2 "`. A number with a space is an exceptionally bad variable name--is this really what you want, to assign the name of the COI variable to be the name of the output? – Gregor Thomas Sep 04 '18 at 19:00
  • In this case yes because I am mining an existing dataset generated by someone else. – Paul Sep 04 '18 at 19:03

2 Answers2

1

I would suggest against using assign. Instead, I'll create a named list. See this answer for a long discussion of why lists are better than sequentially named variables. If, at any point, you decide you want to convert the list to objects in the global environment, you can use list2env, but doing so will probably just make more work.

## subset the data to the parts we care about, use `split` to separate it
## into a list
subdf = df[grepl(paste(clusters, collapse = "|"), df$cluster.assignment), ]
result = split(subdf, subdf$cluster.assignment, drop = TRUE)
result
# $`1 Unknown`
#   cluster.assignment Value1 Value2
# 1          1 Unknown      a    191
# 2          1 Unknown      b    234
# 
# $`4 Basket`
#   cluster.assignment Value1 Value2
# 5           4 Basket      e    123

## name the list as desired
names(result) = paste("Cluster", trimws(clusters), sep = "_")
result
# $`Cluster_1`
#   cluster.assignment Value1 Value2
# 1          1 Unknown      a    191
# 2          1 Unknown      b    234
# 
# $Cluster_4
#   cluster.assignment Value1 Value2
# 5           4 Basket      e    123

## if only the row names are needed, use lapply
result = lapply(result, row.names)
result
# $`Cluster_1`
# [1] "1" "2"
# 
# $Cluster_4
# [1] "5"

A few other notes - I assume you are including the spaces in clusters to prevent, e.g., "1" from matching "12 foo". You might consider using the regex word boundary "\\b1\\b" instead, as "1 " will still match, say, "11 foo" or "21 bar". Better yet, you could use strplit or similar to create a new column with just the numeric key you want to match.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Oh my, I see now why the spaces are so bad. Thanks for your suggestions and very informative answer I will give them a try! – Paul Sep 04 '18 at 19:27
0

I don't see the necessity to create a for loop for this unless you have your own reasons, but the following code gives you what you want:

library(data.table)
Cluster_1<-df[df$cluster.assignment %like% "1 ", c("Value1", "Value2")]
Cluster_2<-df[df$cluster.assignment %like% "4 ", c("Value1", "Value2")]
View(Cluster_1);View(Cluster_2)

you can remove or alter c("Value1", "Value2") to get the columns that you want in the final output.

Pang
  • 9,564
  • 146
  • 81
  • 122
Shirin Yavari
  • 626
  • 4
  • 6
  • I should have specified that this is a small portable example. Unfortunately in real life I need to repeat this over hundreds of different COI values. So a loop to iterate the process and make it portable across datasets is required. The heart of the question really is how do we do this in a for loop or some other high throughput way. – Paul Sep 04 '18 at 19:10