1

I need a way to filter my data based on the target_id. Because I have a set of 1600 target_id values that have no consistent name and another set of that contain the word 'comp', I thought it might be easiest to create a new column with a value based on the value in target_id. I have a dataframe with a million rows that looks like this (just grabbed random rows to show the gist of it):

      sample_id          target_id l ength eff_length est_counts     tpm
159  SRR3884838C           CR1_Mam   2204       2005           0           0
160  SRR3884838C         CYRA11_MM    617        418           0           0
161  SRR3884838C          DERV2a_I   5989       5790          19    0.734541
162  SRR3884838C        DERV2a_LTR    335        136           7     11.5213
1094236 SRR3884878C comp78901_c0_seq3_1 1115     916       113.4     32.3604
1094237 SRR3884878C comp85230_c0_seq1_1 1201     1002      514       134.088
1094238 SRR3884878C comp56944_c0_seq1_1 2484     2285      10.5      1.20115

I need to create a new column ("class") that has a value of 1 for sample_ids that contain the 'comp' and 0 for all others. Is this possible? The data has 40 samples (SRR3884838 --> SRR3884878) and each sample has the same set of target_ids, one set of non-uniform target names, and then another set that all contain comp. Example (with tpm column removed for formatting reasons)

 sample_id          target_id       length   eff_length      est_counts class
159  SRR3884838C           CR1_Mam   2204       2005           0           0        
160  SRR3884838C         CYRA11_MM    617        418           0           0
161  SRR3884838C          DERV2a_I   5989       5790          19           0
162  SRR3884838C        DERV2a_LTR    335        136           7           0
1094236 SRR3884878C comp78901_c0_seq3_1 1115     916       113.4           1
1094237 SRR3884878C comp85230_c0_seq1_1 1201     1002      514             1
1094238 SRR3884878C comp56944_c0_seq1_1 2484     2285      10.5            1

I tried using the merge function by first creating a new data frame that had a class column with the correct value for one set of target_ids with the probably incorrect expectation that it would create the new column in which instance where one of the target_ids is listed , but when I did that it deleted the eff_length column and messed with the format of the data. All the examples I've found where users create a new column based on another columns value used numbers and I'm not sure how to do it using the string comp. Here's what I did:

total <- merge(data frameA,data frameB,by="target_id")

were df A was my original data and df B looked like the above example with the class column.

Jaap
  • 81,064
  • 34
  • 182
  • 193
ZincFingers
  • 125
  • 1
  • 6
  • `df$class <- grepl('comp', df$taget_id)` will give a logical vector; wrap the `grepl` part in `as.numeric` or `as.integer` to get a vector of zero's and one's. – Jaap Jul 11 '17 at 16:34

2 Answers2

1

Using:

df$class <- as.integer(grepl('comp', df$target_id))

gives:

> df
          sample_id           target_id length eff_length est_counts class
159     SRR3884838C             CR1_Mam   2204       2005        0.0     0
160     SRR3884838C           CYRA11_MM    617        418        0.0     0
161     SRR3884838C            DERV2a_I   5989       5790       19.0     0
162     SRR3884838C          DERV2a_LTR    335        136        7.0     0
1094236 SRR3884878C comp78901_c0_seq3_1   1115        916      113.4     1
1094237 SRR3884878C comp85230_c0_seq1_1   1201       1002      514.0     1
1094238 SRR3884878C comp56944_c0_seq1_1   2484       2285       10.5     1
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • If I got it right, OP wants to get 1 for a sample id if there's `"comp"` in any row of that id. Not sure tho. – M-- Jul 11 '17 at 16:41
  • The output looks like I what I want, but when I copied and pasted what you wrote I got the error: Error in `$<-.data.frame`(`*tmp*`, "class", value = integer(0)) : replacement has 0 rows, data has 1094241 My dataframe is called df so I'm not sure where I'm going wrong. Sorry, I am quite new to this and also quite stupid – ZincFingers Jul 11 '17 at 16:42
  • @ZincFingers could you include some example data that reproduces the problem? – Jaap Jul 11 '17 at 16:47
  • I created my dataframe from an imported .tsv file. The data you called df in your answer is example data. I copied and pasted the table in your answer into a new .tsv file and created a testdf like so: `testdf <- read.table("filelocation/Test.tsv")` `testdf$class <- as.integer(grepl('comp', df$target_id))` – ZincFingers Jul 11 '17 at 16:55
  • @ZincFingers if you called it `testdf` you are referring to the wrong data inside `grepl` (`df$target_id` instead of `testdf$target_id`); consequently, you need to use: `testdf$class <- as.integer(grepl('comp', testdf$target_id))` – Jaap Jul 11 '17 at 16:59
  • Ok , it appears even with the error message my data now includes a column called class with the correct values in the testdf but not in my original df – ZincFingers Jul 11 '17 at 17:02
  • @ZincFingers [Here are some guidelines on how to give a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). Looking at the image, the main problem I see is how you read the data: The columnnames are on the first row and the columns are now named from `V1` to `V6`. So, your `target_id` column is now called `V2`. Did you use `header = TRUE` in `read.table`? `read.table` use `header = FALSE` by default. Consequently you need: `df <- read.table("file.tsv", header = TRUE)` – Jaap Jul 11 '17 at 17:07
  • 1
    D'oh. For some reason I thought header = TRUE was the default. Your fix worked, I now have the data formatted properly. Thank you for your help. I will make sure to review the link you provided as well. – ZincFingers Jul 11 '17 at 17:21
-1

How about sample$class <- as.numeric(grepl ("^comp", sample$target_id)) ?

John M.
  • 9
  • 3
  • Edited. I thought that given the interchangability between TRUE/FALSE and 1/0, there's no real need to change to numeric. – John M. Jul 11 '17 at 16:43
  • Error in sample$target_id : object of type 'closure' is not subsettable – ZincFingers Jul 11 '17 at 16:46
  • I tried pasting your df and I don't see any issue with my code. > test$class <- as.numeric(grepl ("^comp", test$target_id)) > test sample_id target_id length eff_length est_counts class 1 SRR3884838C CR1_Mam 2204 2005 0.0 0 – John M. Jul 11 '17 at 16:56