-1

I've got a df (300000 obs by 15 var). A simplified version looks like this:

    Col1        Col2 
1   TRGV3*01    YCAVWIGSTSWVKIFA

2   TRGV7*02    YCASWAGQSGSGFHKVFA

3   TRGV7*02    YCASWAGRGSGFHKVFA

4   TRGV3*01    YCAVWINLGTSWVKIFA

5   TRGV3*01    YCAVWSGTSWVKIFA

6   TRGV7*02    YCASWAGRESSGFHKVFA

7   TRGV7*02    YCASWAVYSSGFHKVFA

8   TRGV7*02    YCASWSSSGFHKVFA

9   TRGV2*01    YCAVWICGTSWVKIFA

I want to subset the df and create a new df such that for a particular value in col1 (TRGV7*02), all possible values of Col2 are subsetted and the output is something like this:

TRGV7*02                 TRGV3*01                     TRGV2*01

YCASWAGQSGSGFHKVFA       YCAVWIGSTSWVKIFA             YCAVWICGTSWVKIFA

YCASWAGRGSGFHKVFA        YCAVWSGTSWVKIFA    

YCAVWINLGTSWVKIFA       

YCASWAGRESSGFHKVFA      

YCASWAVYSSGFHKVFA       

YCASWSSSGFHKVFA     

Once I do this I need to do some analysis and I can't figure out how to do that either... 1. Spit out another df from the one just created above with the same format but only with all unique values in that particular column (lot of repeats in the original dataset) 2. Compare the values generated in each column above and figure out which values in each column are unique compared to the values in all the columns or when just trying to compare it with one other column.

Hope that made sense. Thank you so much in advance!

NicE
  • 21,165
  • 3
  • 51
  • 68
Nina
  • 19
  • 3
  • see : [Here][1] - Extracting specific columns from a data frame [1]: http://stackoverflow.com/questions/10085806/extracting-specific-columns-from-a-data-frame – nAPL Mar 01 '15 at 20:55
  • That link refers to just splitting out whole columns. I need to make 1 column into the row titles and list all values for that row. Alos, any idea how to do the next steps I mentioned? – Nina Mar 01 '15 at 21:01

2 Answers2

1

Here's a possible data.table solution (if the column order doesn't metter, you can skip the setcolorder part)

library(data.table)
setcolorder(dcast(setDT(df), Col2 ~ Col1)[, 
          lapply(.SD, sort, na.last = TRUE), .SDcols = -c("Col2")], 
          c("TRGV7*02", "TRGV3*01", "TRGV2*01"))[]
#              TRGV7*02          TRGV3*01         TRGV2*01
# 1: YCASWAGQSGSGFHKVFA  YCAVWIGSTSWVKIFA YCAVWICGTSWVKIFA
# 2: YCASWAGRESSGFHKVFA YCAVWINLGTSWVKIFA               NA
# 3:  YCASWAGRGSGFHKVFA   YCAVWSGTSWVKIFA               NA
# 4:  YCASWAVYSSGFHKVFA                NA               NA
# 5:    YCASWSSSGFHKVFA                NA               NA
# 6:                 NA                NA               NA
# 7:                 NA                NA               NA
# 8:                 NA                NA               NA
# 9:                 NA                NA               NA
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks! I tried this but I keep getting an error: col1 = v.TRV for me and col2 = x.CDR3 > setcolorder(dcast(setDT(gamma.TRVfilt), x.CDR3 ~ v.TRV)[, + lapply(.SD, sort, na.last = TRUE), .SDcols = -c("x.CDR3")], + c("TRGV1*02", "TRGV1*04", "TRGV1*05", "TRGV1*06", "TRGV1*07", "TRGV2*01", "TRGV2*02", "TRGV4*01", "TRGV4*04", "TRGV5*01", "TRGV6*02", "TRGV7*02"))[] Using num as value column: use value.var to override. Aggregation function missing: defaulting to length Error in `[.data.frame`(dcast(setDT(gamma.TRVfilt), x.CDR3 ~ v.TRV), , : unused argument (.SDcols = -c("x.CDR3")) – Nina Mar 01 '15 at 21:11
  • Did you load the `data.table` package? Also, does just `dcast(setDT(gamma.TRVfilt), x.CDR3 ~ v.TRV)[, lapply(.SD, sort, na.last = TRUE), .SDcols = -c("x.CDR3")]` works? – David Arenburg Mar 01 '15 at 21:20
  • Get the following error. Using num as value column: use value.var to override. Aggregation function missing: defaulting to length Error in `[.data.frame`(dcast(setDT(gamma.TRVfilt), x.CDR3 ~ v.TRV), , : unused argument (.SDcols = -c("x.CDR3")) By doing: > format= dcast(gamma.TRVfilt, x.CDR3 ~ v.TRV, value.var="num", fun.aggregate=sum, na.rm=TRUE) I get a matrix with values in the cell using some values off the original df. Don't knw how you got the actual values of Col2 (x.CDR3) instead of the numeric values its picking up. Something to do with .SDcols...? that I don't get :/ – Nina Mar 01 '15 at 21:32
  • Yes. THis is what I get Get the following error. Using num as value column: use value.var to override. Aggregation function missing: defaulting to length Error in [.data.frame(dcast(setDT(gamma.TRVfilt), x.CDR3 ~ v.TRV), , : unused argument (.SDcols = -c("x.CDR3")) – Nina Mar 01 '15 at 21:50
  • The data that you provided don't have these columns. It only has `Col1` and `Col2`. – David Arenburg Mar 01 '15 at 21:52
  • Same issue. setcolorder(dcast(setDT(df), Col2 ~ Col1)[,lapply(.SD, sort, na.last = TRUE), .SDcols = -c("Col2")], + c("TRGV7*02", "TRGV3*01", "TRGV2*01"))[] Using Col2 as value column: use value.var to override. Error in `[.data.frame`(dcast(setDT(df), Col2 ~ Col1), , lapply(.SD, sort, : unused argument (.SDcols = -c("Col2")) – Nina Mar 01 '15 at 22:17
  • What `data.table` version do you have? – David Arenburg Mar 01 '15 at 22:22
  • I'm updating on R studio on mac. It didn't show that there was a newer version available but will let you know if it installs regardless in a bit. In the meantime, could you please tell me how you got the col2 values to fill the matrix instead of a numeric variable value? – Nina Mar 01 '15 at 22:35
  • I've got 1.9.4(had some trouble) But getting the same error. Using num as value column: use value.var to override. Aggregation function missing: defaulting to length Error in `[.data.frame`(dcast(setDT(gamma.TRVfilt), x.CDR3 ~ v.TRV), , : unused argument (.SDcols = -c("x.CDR3")) – Nina Mar 02 '15 at 01:10
  • Did you `library(data.table)` ? – David Arenburg Mar 02 '15 at 01:11
  • Just doing format=dcast(df, Col2~Col1) works fine - but I end up with NAs. Using your code, keep getting the "unused argument (.SDcols = -c("Col2")) error. Thank you so much btw! – Nina Mar 02 '15 at 01:44
  • I don't know. This works perfectly fine on the data you provided. If you can provide a data that this code fails on, it could be usefull – David Arenburg Mar 02 '15 at 09:19
0

I would suggest splitting one column by the other, and then using stri_list2matrix from the "stringi" package to convert the resulting list to a matrix. Once the data are in a list, you can also use lapply to do any filtering or sorting you might want to do on each list item.

Here's an example (assuming your source data.frame is called "mydf"). I've separated it out into 4 steps so you can see what's happening, but it's possible with more compact code.

library(stringi)
A <- with(mydf, split(Col2, Col1))          # (1) Split col1 by col2
B <- lapply(A, function(x) sort(unique(x))) # (2) Extract unique and sort
C <- stri_list2matrix(B)                    # (3) Convert to matrix
colnames(C) <- names(B)                     # (4) Add column names
C
#      TRGV2*01           TRGV3*01            TRGV7*02            
# [1,] "YCAVWICGTSWVKIFA" "YCAVWIGSTSWVKIFA"  "YCASWAGQSGSGFHKVFA"
# [2,] NA                 "YCAVWINLGTSWVKIFA" "YCASWAGRESSGFHKVFA"
# [3,] NA                 "YCAVWSGTSWVKIFA"   "YCASWAGRGSGFHKVFA" 
# [4,] NA                 NA                  "YCASWAVYSSGFHKVFA" 
# [5,] NA                 NA                  "YCASWSSSGFHKVFA"
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485