0

I have a dataset that comes from an RNASeq experiment.

> dim(expression)
[1]   149 39879

like this:

> expression[1:5, 1:5]
# A tibble: 5 × 5
       sample_id ENSG00000004059 ENSG00000003056 ENSG00000173153 ENSG00000004478
           <chr>           <dbl>           <dbl>           <dbl>           <dbl>
1 123 (Colon)        6.518498        7.141934        5.766983        5.471909
2 121 (Colon)        6.983914        7.078940        5.909575        5.911879
3 004 (Ileum)        6.403912        7.131915        6.191672        5.771549
4 045 (Colon)        6.890916        7.233934        6.019052        6.272799
5 010 (Ileum)        6.674921        7.645998        5.859013        5.322049

The first column is called "sample_id" and in that column I have ids that look like this: "123(colon)", "142(ileum)", "123(ileum) etc. where 123 is the id of the patient and colon and ileum is where the samples were taken from. The rest of the columns are gene names and their expression values. Sometimes one patient has only one sample: colon or ileum, the other one missing. The rows begin with 123(colon) and then the rest of the values for each gene. I want to modify my data in such way that I don`t have two rows for one patient, for example 123.colon and 123.ileum but a single row combining the two. Something like: "123 colon.gene1 colon.gene2...ileum.gene1 ileum.gene2..."

so far I was able to select the data from one patient (the two samples or just one if the case) with this code:

ptn = '^010.*?'
ndx = grep(ptn, expression$sample_id, perl=T)
selected_rows = expression[ndx,]
selected_rows

This however just selects the information I want, like this:

> selected_rows
# A tibble: 2 × 39,879
       sample_id ENSG00000004099 ENSG00000003956 ENSG00000973153 ENSG00000004498 ENSG00000003139
           <chr>           <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
1 010 (Ileum)        6.674229        7.645929        5.850019        5.322049       0.6259249
2 010 (Colon)        6.861709        6.768619        5.950409        5.752779       0.3727669
# ... with 39873 more variables: ENSG00000003509 <dbl>, ENSG00000001036 <dbl>,

but I can`t figure out how to go from here. I need to concatenate but still keep track of which gene expression belongs to which organ. Thank you.

Expected result something in the shape of:

sample_id   ENSG1-Ileum ENSG2-ileum ENSG3-Ileum ENSG4-Ileum ENSG5-Ileum… ENSG1-Colon ENSG2-Colon ENSG3-Colon ENSG4-Colon ENSG5-Colon…
010         6.674229 7.645929 5.850019 5.322049 0.625924… 9.861709 6.768619 5.950409 5.752779 0.3727669…
# ... with 39873 more variables: ENSG00000003509 <dbl>, ENSG00000001036 <dbl>,
#

To say it otherwise (removing the biological factor): How do I transform this:

p_id   g1   g2   g3
p1_a   vn   vn   vn
p1_b   vn   vn   vn
p2_a   vn   vn   vn
p2_b   vn   vn   vn

into this:

p_id   g1pa    g2pa    g3pa   g1pb    g2pb    g3pb
p1     vn      vn      vn     vn      vn      vn 
p2     vn      vn      vn     vn      vn      vn

vn are just floating points that may or may not be equal

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Megiddo
  • 53
  • 6
  • Maybe you want to reshape wide? See [this post](http://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format). – lmo Dec 30 '16 at 17:00
  • Can you add an example of how you would like the output. This is almost certainly just a case of reshaping the data (perhaps into long format using `melt`). But without precise definition of what you want, it's not possible to give a precise answer. Depending on the format you want, there is almost certainly already an answer on this site showing how to do it. – dww Dec 30 '16 at 17:15
  • I want it to look something like this: 123 colon.gene1 colon.gene2...ileum.gene1 ileum.gene2... and then the next patient, say 124 with colon.gene1 colon.gene2...ileum.gene1 ileum.gene2... where 123 and 124 takes place of the old sample_id. Basically the word colon goes away from sample_id and replaces itself after every gene that comes from "colon" sample; the same for the "ileum" – Megiddo Dec 30 '16 at 18:19
  • Why not add a column "tissue," and populate it with "Colon" or "Ileum? What you are proposing will almost certainly cause headaches for later analyses down the road. – emilliman5 Dec 30 '16 at 18:26
  • I have edited the post with expected results. I want to cluster patients and this way colon groups together and ileum groups together. I want to find some patient profile based on both ileum and colon and so far this is what I have considered a good start. Thank you for your input; I`ll take your suggestions into account and as soon as time permits I`ll come back with the solution (if any). – Megiddo Jan 01 '17 at 10:45
  • Converting a set of rows of one column into aone row of set of columns is called a **pivot** operation. – Has QUIT--Anony-Mousse Jan 01 '17 at 15:42

1 Answers1

0

With lots of help, here is the solution:

library(pbapply)

expressions <- lapply(selected.patients, function(pat) {
  expr.ileum <- expression[sample.info$patient == pat & sample.info$origin == "Ileum",-1]
  expr.colon <- expression[sample.info$patient == pat & sample.info$origin == "Colon",-1]
  expr.comb <- data.frame(Patient = pat, Ileum = expr.ileum, Colon = expr.colon)
  expr.comb
})

library(dplyr)
expression.comb <- bind_rows(expressions)
dim(expression.comb)
expression.comb[1:6, 1:6]
rm(expressions)

and here is the output:

> expression.comb[1:6, 1:6]
  Patient Ileum.ENSG58 Ileum.ENSG47 Ileum.ENSG36 Ileum.ENSG56 Ileum.ENSG54
1  SV-121    5.635079    5.538374     3.543571    2.579083     5.2418536
2  07-004    6.403112    2.131415     1.171672    5.771545     0.9711039
3  01-010    1.674221    3.645928     8.850013    2.322047     0.6259241
4  KD-081    6.714947    6.643261     6.328018    9.709639     4.4346171
5  KV-247    6.771593    2.798890     6.370311    1.830785     5.5326661
6  MK-101    6.312121    2.570204     2.632856    3.402819     5.6695553
Megiddo
  • 53
  • 6