0

I have a csv formatted input matrix file like:

         x     y      z
a1_b1   0.6   0.44   0.6
a1_b2   0.9   0.1    0.04
a2_b1   0.7   0.02   0.7
a2_b2   0.5    0.4   0.11

From this, I want a row wise calculation of b# values with different combination of a.. The model output for the above matrix be like:

    *_b1    *_b2   *_b2-*_b1     Calculate_Positives (for each variable)
x   0.6      0.9     -0.3       =# of positive values for x (in 4th column)/# of x 
x   0.7      0.5      0.2
y   0.44     0.1      0.34      =# of positive values for y/# of y
y   0.02     0.4     -0.38
z   0.6      0.04     0.56      =# of positive values for z/# of z
z   0.7      0.11     0.59 

I will highly appreciate if anyone suggest any awk/sed/r code for the above.

Please see the model (csv formatted) input and output file below.

Input

Output

  • 4
    Welcome to SO. Stack Overflow is a question and answer site for professional and enthusiast programmers. The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself. – Cyrus Nov 01 '18 at 21:46
  • 1
    Please include any code or data in your question, rather than linking externally to it. – Anonymous coward Nov 01 '18 at 21:51
  • You are seeking to _transpose_ a matrix. You can do that with `awk`, but line-oriented tools aren't best-suited for this, because you need to read the entire input first, no matter what... Your question is answered [here](https://stackoverflow.com/questions/1729824) – Mikhail T. Nov 01 '18 at 21:51

1 Answers1

0

Here is a tidyverse solution

library(tidyverse)

df %>%
    rownames_to_column("id") %>%
    gather(row, value, -id) %>%
    separate(id, into = c("tmp", "col")) %>%
    spread(col, value) %>%
    select(-tmp) %>%
    arrange(row) %>%
    mutate(`b1-b2` = b1 - b2)
#  row   b1   b2 b1-b2
#1   x 0.60 0.90 -0.30
#2   x 0.70 0.50  0.20
#3   y 0.44 0.10  0.34
#4   y 0.02 0.40 -0.38
#5   z 0.60 0.04  0.56
#6   z 0.70 0.11  0.59

Update

df %>%
    rownames_to_column("id") %>%
    gather(row, value, -id) %>%
    separate(id, into = c("tmp", "col")) %>%
    spread(col, value) %>%
    select(-tmp) %>%
    arrange(row) %>%
    mutate(`b1-b2` = b1 - b2) %>%
    group_by(row) %>%
    summarise(no_positives = sum(`b1-b2` > 0) / n())
## A tibble: 3 x 2
#  row   no_positives
#  <chr>        <dbl>
#1 x              0.5
#2 y              0.5
#3 z              1
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • excellent suggestion..I will look more into tidyverse library.. can you please elaborate how to add a fifth column having 'positive values of each variable divided by total no of particular variable'. I can count the positives in 4th column as 'sum(df$'b1-b2' > 0)', but it does not give me x,y,z specific counts – Abhijit Barerjee Nov 02 '18 at 02:49
  • @AbhijitBarerjee I'm not sure what you mean. My code reproduces your expected output. I recommend running the code line-by-line to understand what each line does. I've updated my answer to summarise the number of positive `b1-b2` values per `row`. Perhaps that is what you're after. – Maurits Evers Nov 02 '18 at 03:02
  • sorry I might not be very specific, the code works fine, but I wanted to add one extra column. In that column, for x, it will calculate 1/2 =0.5 [because number of positive value for x is 1 (as found in fourth column) and the total number of x is 2 (as found in first column)]. likewise for y and z and if I had other variables – Abhijit Barerjee Nov 02 '18 at 03:21
  • @AbhijitBarerjee Just divide `no_positive` by `n()`, see my update. – Maurits Evers Nov 02 '18 at 03:23