1

I failed about 10 approaches on this:

I have a data formatted in 3 columns:

H7  200,3   2227649,5
H13 200,3   1084585
H15 200,3   873123,1
Zdrowy  200,3   2245707
Z26 202,7   1742185,9
Zdrowy  202,7   2420348,5
K4  203,5   10251587
Z18 203,5   6904013
H7  203,5   5765803
H13A    203,5   4219047,5
H13 203,5   1791847,1
H15 203,5   1260959,3
K3  203,5   9878843
K5  203,5   3406241
K35 203,5   4746493
K37 203,5   6303874
Z11 203,5   1415927,8
Z16 203,5   6245137,5
Z26 203,5   7330417,5

columns 1 and 2 do contain values that are not unique. What I need to do with this is to get a matrix with column1 as row names, column2 as column names and column3 as the corresponding values in the cells. If the value is duplicated (since columns 1,2 have some non unique values) cell value should be an average.

Any help with that?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
redaktroll
  • 63
  • 1
  • 1
  • 9
  • if col 2 is column names and col 3 is values but col 1 becomes row names, does that mean the rest of the rows apart from row 1 will be blanks / NAs? – vagabond Oct 27 '16 at 21:03
  • I didn't quite catch your comment. NAs will be those cells where given column1 across all column2 values did not have any column3. pass the data aboe into: library(tidyr) ee <- spread(df, V2,V3, fill=0) and this is what I want in the end, just don't know how do deal with duplicated rows – redaktroll Oct 27 '16 at 21:07

1 Answers1

3

It sounds like you are trying to convert your data from long to wide format. You can do this with the package reshape2. Here is a mini-example:

# Create dummy data
df <- data.frame(col1=sample(c('a','b','c'), 10, replace=T), col2=sample(1:3, 10, replace=T), col3=rnorm(10))
df
   col1 col2       col3
1     b    2  0.1514541
2     c    3 -0.2566596
3     a    3 -0.8939474
4     a    1 -0.2174930
5     a    1 -0.1739861
6     b    1 -0.4525370
7     b    2 -0.5592760
8     b    3  0.5206133
9     a    2 -1.9239337
10    c    2 -0.1581582

# Load the library
library(reshape2)

# Cast the library into wide format
df.long <- dcast(df, col1~col2, fun.aggregate=mean)
df.long
col1          1          2
1    a -0.1957395 -1.9239337
2    b -0.4525370 -0.2039110
3    c        NaN -0.1581582
           3
1 -0.8939474
2  0.5206133
3 -0.2566596

If no value exists for some of the combinations of factors in columns 1 and 2 this will insert an NaN.

gfgm
  • 3,627
  • 14
  • 34