0

I am stuck how to solve a problem. I have a matrix with the following format:

Sequence      Position   Raw Binding (log ratio)
UC001AOZ.3      146     -0.746
UC001AOZ.3      147     -1.27
UC001AOZ.3      148     -1.66
UC001AOZ.3      149     -2.16
UC001AOZ.3      150     -2.08
...             ...      ...
UC222AOF.2     5000      1.22
UC222AOF.2      146     -1.12
UC222AOF.2      147     -1.41
...             ...      ...
UC222AOF.2     5000      5.13
...             ...      ...

The first column (Sequence) describes genes by these cryptic names. The second column is a position within the human genome and the third column refers to a value for an event.

The position goes up to 5000 and starts then at 146 again for the next gene (see format, second gene name "UC222AOF.2"). In total are there 250 genes with 4854 positions and respective Raw Biding values.

I want to get mean values of all Raw Binding (log ratio) values at the positions between 146 and 5000.

One possibility could look like this (values might be different than above):

            146      147     148     149     ...    5000
UC001AOZ.3  -0.746   -1.27   -1.66   -2.16   ...     1.22
UC222AOF.2  -1.12    -1.41   -1.31   -1.81   ...     5.13
UC002BW1.1  -0.112   -0.31   -0.51   -1.01   ...     1.01

I am no R regular but know some basics. Thank you in advance!

S.Baum
  • 117
  • 7
  • Good start for getting solid answers: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – s_baldur Jul 27 '16 at 20:16
  • it seems you can use ´data.table´ and do ´DT[Position >= 146, mean(Raw Binding (log ratio)) ]´ – s_baldur Jul 27 '16 at 20:18
  • Thank you for that helpful link! I uploaded a sample here http://www.filehosting.org/file/details/589167/sample.tab – S.Baum Jul 27 '16 at 20:27
  • Please realize that all of us are volunteers, and often we look quickly at a question and move on. First, by posting a link, you make it harder for us to see anything. If that's not enough, you post it in a comment, making it that much less likely that somebody will take the time to try to help you. I strongly urge you add the output from `dput(x)` (where `x` is a *small and representative* portion of your data or sample data to recreate what you need). – r2evans Jul 27 '16 at 20:29
  • Do you want the mean over positions for each gene or the mean over genes for each position? – aichao Jul 27 '16 at 20:37
  • Thank you for the qualified enhancements. Next time I will respect these suggestions. – S.Baum Jul 27 '16 at 21:05

1 Answers1

1

The dcast() function of the reshape2 package might be of use.

library(reshape2)
df
#     Sequence Position Binding
# 1 UC001AOZ.3      146  -0.746
# 2 UC001AOZ.3      147  -1.270
# 3 UC001AOZ.3      148  -1.660
# 4 UC001AOZ.3      149  -2.160
# 5 UC001AOZ.3      150  -2.080
# 6 UC222AOF.2     5000   1.220
# 7 UC222AOF.2      146  -1.120
# 8 UC222AOF.2      147  -1.410

dcast(df, Sequence ~ Position, value.var = "Binding")
#     Sequence    146   147   148   149   150 5000
# 1 UC001AOZ.3 -0.746 -1.27 -1.66 -2.16 -2.08   NA
# 2 UC222AOF.2 -1.120 -1.41    NA    NA    NA 1.22

Essentially, you "swing up" the Position column to now be a set of rows, and tell R to use the values in the Binding column to fill in your newly minted rows.

http://seananderson.ca/images/dcast-illustration.png from http://seananderson.ca/2013/10/19/reshape.html is a great visual representation of the dcast() function.

tluh
  • 668
  • 1
  • 5
  • 16