0

Question 1)

I have an R data table with three columns (The actual dataset is bigger but simplifying for better understanding)

Column_One, Column_Two, Column_Three

A, 1, 4
A, 2, 3
A, 3, 77
B, 1, 44
B, 2, 32
B, 3, 770
C, 1, 43
C, 2, 310
C, 3, 68

I want to create a new matrix (data table) from the above as shown below.

A, B, C
4, 44, 43
3, 32, 310
77, 770, 68

Please note in the actual data table there are hundreds of different values for column one and two. Hence a generic solution would be needed.

Any questions, please let me know. Much appreciative of any suggestions.

Question 2)

There could be another level as in a fourth column, column zero, which links a few of the column ones. In this case we need to create new data tables based on column zero and then apply the solution to column one on each sub data table. Please suggest the quickest / simplest way possible.

Column_Zero, Column_One, Column_Two, Column_Three

XX,A, 1, 4
XX,A, 2, 3
XX,A, 3, 77
XX,B, 1, 44
XX,B, 2, 32
XX,B, 3, 770
XX,C, 1, 43
XX,C, 2, 310
XX,C, 3, 68       
YY,A1, 1, 4
YY,A1, 2, 3
YY,A1, 3, 77
YY,B1, 1, 44
YY,B1, 2, 32
YY,B1, 3, 770
YY,C1, 1, 43
YY,C1, 2, 310
YY,C1, 3, 68 
YY,D2, 1, 4
YY,D2, 2, 5
YY,D2, 3, 6 

--------- And so on -----

We then need to create,

------ Data Table one ------

A, B, C
4, 44, 43
3, 32, 310
77, 770, 68

------ Data Table Two ------

A1, B1, C1, D2
4, 44, 43,4
3, 32, 310,5
77, 770, 68,6

------ and so on -----

Related Question:

Once this matrix is split and recast, it becomes important to know the dimensions of the new data structure and its components and also how to access them individually, which is discussed here:

R Finding Multidimension Array Dimension Sizes

Community
  • 1
  • 1
texmex
  • 151
  • 9
  • And please note, a `data.table` in R is very different from a matrix. – Heroka Jan 25 '16 at 14:25
  • @Heroka … Thanks for your answer. I am using a data table as a store for a matrix. Is there another structure that would be suited better. Clearly you can tell, I know very little R. – texmex Jan 25 '16 at 14:26
  • 2
    a `data.table` is a data-table object, capable of holding different data-types (and `data.table` offers a lot of speed and ease in grouped operations). A matrix can only hold one datatype. Do yourself a favour, and read up on data-types and their properties in R. It will make learning a lot easier for you. – Heroka Jan 25 '16 at 14:28
  • @Heroka Much appreciated .. I have been looking at the online guides mostly … But R has too many things … Will certainly brush up more. Appreciate you guys tolerating a newbie and helping out ... – texmex Jan 25 '16 at 14:31
  • @akrun Certainly will do … I will check it out shortly and accept … I do not have enough points to vote up the answers, but I am much appreciative of the pointers. But someone voted the answers down and I am sorry for any loss that created .. – texmex Jan 25 '16 at 14:33
  • @akrun how do you know? (it wasn't me) – Jaap Jan 25 '16 at 14:37
  • @akrun … I am much appreciative of the suggestions received. I have modified the question to be slightly more involved and actual case. I was planning to do this manually, but looks like there might be an easy solution for question 2 as well. – texmex Jan 25 '16 at 14:45
  • @Jaap Please note I have modified the question …. Is there any way to tackle the question 1 and question 2. (Question 1 is essentially the same as before) Question 2 is the overall problem to be solved … Looking to do without big loops or manually ... – texmex Jan 25 '16 at 14:48
  • @texmex Updated my solution. Though, now it is not a clear dupe, it is too bad that it will remain as dupe.. – akrun Jan 25 '16 at 14:51
  • 3
    If question 2 is solved, will there be a question 3 too? – mtoto Jan 25 '16 at 14:51
  • 2
    @texmex If you have a new question, you should post a new question, not edit an existing one. – Jaap Jan 25 '16 at 14:53
  • @Jaap sorry for creating confusion .... but once the question was marked duplicate, the system asked me to edit it to modify the question .... the second one being a natural extensions of the first... I edited it ... Happy to make it a new question, if that is more appropriate ... – texmex Jan 26 '16 at 01:32
  • @mtoto Please note a third question did come up ... I have made it into a new one ... If any suggestions, please let me know. If this present question is no longer a duplicate, I am happy to repost it and delete this one if better for the forum ... – texmex Jan 26 '16 at 03:32

2 Answers2

2

We can use acast to convert from 'long' to 'wide' format. The resulting dataset will be a matrix.

library(reshape2)
acast(df1, Column_Two~Column_One, value.var="Column_Three")
#   A   B   C
#1  4  44  43
#2  3  32 310
#3 77 770  68

For the second dataset, we can split by "Column_Zero" and then loop over the list and do the acast as before

 lst <- lapply(split(df2[-1], df2$Column_Zero), function(x) 
         acast(x, Column_Two~Column_One,value.var="Column_Three"))

lst
#$XX
#   A   B   C
#1  4  44  43
#2  3  32 310
#3 77 770  68

#$YY
#  A1  B1  C1 D2
#1  4  44  43  4
#2  3  32 310  5
#3 77 770  68  6
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Could you please clarify why you have a -1 in df2[-1] ? – texmex Jan 26 '16 at 01:34
  • I tried with and without the -1 seems to work ... with the -1 I get a warning "in split.default... data length is not a multiple of split variable " – texmex Jan 26 '16 at 02:33
  • @texmex The `df2[-1]` is removing the first column ie. 'Column_Zero' as it is not needed in the result. Do you have a `matrix` or `data.frame` as initial dataset? If you have a `matrix`, it is better to convert to `data.frame` before doing the `split`. i.e. `df2 <- as.data.frame(df2)` and the warning will go away. – akrun Jan 26 '16 at 04:33
  • @texmex If it is important to know the dimensions in the `lst`, try `lapply(lst, dim)` . To extract the individual components, use `[[` i.e. `lst[[1]]` gets the first output dataset, `lst[[2]]` 2nd and so on.. – akrun Jan 26 '16 at 04:46
  • How do I find out that there are two matrices here, $XX and $YY. Also, how do I get those names into variables? Much appreciative as always of your help. Please see related question: http://stackoverflow.com/questions/35006572/r-finding-multidimension-array-dimension-sizes – texmex Jan 26 '16 at 06:37
  • Also, How do I find the minimum number of columns among all the sub matrices that are created by split? Thanks again – texmex Jan 26 '16 at 06:43
  • @texmex You can do `min(sapply(lst, ncol))` – akrun Jan 26 '16 at 11:45
  • … Please note I have created a separate question with a few related themes on multidimensional stuff. Thanks for your inputs. http://stackoverflow.com/questions/35006572/r-finding-multidimension-array-dimension-sizes – texmex Jan 26 '16 at 12:45
1

A possible tidyr / dplyr solution:

library(dplyr)
library(tidyr)

df %>% spread(Column_One, Column_Three) %>% select(2:4)

#   A   B   C
#1  4  44  43
#2  3  32 310
#3 77 770  68
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
mtoto
  • 23,919
  • 4
  • 58
  • 71