I need a way to filter my data based on the target_id. Because I have a set of 1600 target_id values that have no consistent name and another set of that contain the word 'comp', I thought it might be easiest to create a new column with a value based on the value in target_id. I have a dataframe with a million rows that looks like this (just grabbed random rows to show the gist of it):
sample_id target_id l ength eff_length est_counts tpm
159 SRR3884838C CR1_Mam 2204 2005 0 0
160 SRR3884838C CYRA11_MM 617 418 0 0
161 SRR3884838C DERV2a_I 5989 5790 19 0.734541
162 SRR3884838C DERV2a_LTR 335 136 7 11.5213
1094236 SRR3884878C comp78901_c0_seq3_1 1115 916 113.4 32.3604
1094237 SRR3884878C comp85230_c0_seq1_1 1201 1002 514 134.088
1094238 SRR3884878C comp56944_c0_seq1_1 2484 2285 10.5 1.20115
I need to create a new column ("class") that has a value of 1 for sample_ids that contain the 'comp' and 0 for all others. Is this possible? The data has 40 samples (SRR3884838 --> SRR3884878) and each sample has the same set of target_ids, one set of non-uniform target names, and then another set that all contain comp. Example (with tpm column removed for formatting reasons)
sample_id target_id length eff_length est_counts class
159 SRR3884838C CR1_Mam 2204 2005 0 0
160 SRR3884838C CYRA11_MM 617 418 0 0
161 SRR3884838C DERV2a_I 5989 5790 19 0
162 SRR3884838C DERV2a_LTR 335 136 7 0
1094236 SRR3884878C comp78901_c0_seq3_1 1115 916 113.4 1
1094237 SRR3884878C comp85230_c0_seq1_1 1201 1002 514 1
1094238 SRR3884878C comp56944_c0_seq1_1 2484 2285 10.5 1
I tried using the merge function by first creating a new data frame that had a class column with the correct value for one set of target_ids with the probably incorrect expectation that it would create the new column in which instance where one of the target_ids is listed , but when I did that it deleted the eff_length column and messed with the format of the data. All the examples I've found where users create a new column based on another columns value used numbers and I'm not sure how to do it using the string comp. Here's what I did:
total <- merge(data frameA,data frameB,by="target_id")
were df A was my original data and df B looked like the above example with the class column.