I have a dataset that comes from an RNASeq experiment.
> dim(expression)
[1] 149 39879
like this:
> expression[1:5, 1:5]
# A tibble: 5 × 5
sample_id ENSG00000004059 ENSG00000003056 ENSG00000173153 ENSG00000004478
<chr> <dbl> <dbl> <dbl> <dbl>
1 123 (Colon) 6.518498 7.141934 5.766983 5.471909
2 121 (Colon) 6.983914 7.078940 5.909575 5.911879
3 004 (Ileum) 6.403912 7.131915 6.191672 5.771549
4 045 (Colon) 6.890916 7.233934 6.019052 6.272799
5 010 (Ileum) 6.674921 7.645998 5.859013 5.322049
The first column is called "sample_id" and in that column I have ids that look like this: "123(colon)", "142(ileum)", "123(ileum) etc. where 123 is the id of the patient and colon and ileum is where the samples were taken from. The rest of the columns are gene names and their expression values. Sometimes one patient has only one sample: colon or ileum, the other one missing. The rows begin with 123(colon) and then the rest of the values for each gene. I want to modify my data in such way that I don`t have two rows for one patient, for example 123.colon and 123.ileum but a single row combining the two. Something like: "123 colon.gene1 colon.gene2...ileum.gene1 ileum.gene2..."
so far I was able to select the data from one patient (the two samples or just one if the case) with this code:
ptn = '^010.*?'
ndx = grep(ptn, expression$sample_id, perl=T)
selected_rows = expression[ndx,]
selected_rows
This however just selects the information I want, like this:
> selected_rows
# A tibble: 2 × 39,879
sample_id ENSG00000004099 ENSG00000003956 ENSG00000973153 ENSG00000004498 ENSG00000003139
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 010 (Ileum) 6.674229 7.645929 5.850019 5.322049 0.6259249
2 010 (Colon) 6.861709 6.768619 5.950409 5.752779 0.3727669
# ... with 39873 more variables: ENSG00000003509 <dbl>, ENSG00000001036 <dbl>,
but I can`t figure out how to go from here. I need to concatenate but still keep track of which gene expression belongs to which organ. Thank you.
Expected result something in the shape of:
sample_id ENSG1-Ileum ENSG2-ileum ENSG3-Ileum ENSG4-Ileum ENSG5-Ileum… ENSG1-Colon ENSG2-Colon ENSG3-Colon ENSG4-Colon ENSG5-Colon…
010 6.674229 7.645929 5.850019 5.322049 0.625924… 9.861709 6.768619 5.950409 5.752779 0.3727669…
# ... with 39873 more variables: ENSG00000003509 <dbl>, ENSG00000001036 <dbl>,
#
To say it otherwise (removing the biological factor): How do I transform this:
p_id g1 g2 g3
p1_a vn vn vn
p1_b vn vn vn
p2_a vn vn vn
p2_b vn vn vn
into this:
p_id g1pa g2pa g3pa g1pb g2pb g3pb
p1 vn vn vn vn vn vn
p2 vn vn vn vn vn vn
vn are just floating points that may or may not be equal