2

I have checked out the following questions on StackOverFlow and links to other R help books:

R: Reshape Data Long to Wide - understanding reshape parameters

Reshape long to wide with multiple groupings

How to reshape data from long to wide format?

http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

I want to take my data which is in long format (two columns, with 6200 entries) with non-unique names in column "id" with different sequences in "sequence" and reshape into a wide format where the column headers are now the "id" with all of the sequences from "sequence" listed underneath each id.

          id    sequence
1   CK1alpha TPSIAsDISLP
2   CK1alpha IASDIsLPIAT
3       CDK1 SVPSSsPGTSV
4       CDK1 EGCQGsPQRRG
5   CK1alpha DICEDsDIDGD
6 PKCepsilon IHGSDsVKSAE

What I would like to get:

id          CK1alpha    CDK1        PKCepsilon
sequence    TPSIAsDISLP SVPSSsPGTSV
            IASDIsLPIAT EGCQGsPQRRG
            DICEDsDIDGD 

I have tried to use reshape

kinase_sub_wide <- reshape(kinase_substrate, idvar = "id", timevar = "sequence", direction = "wide")

However I get warning messages saying multiple rows match:

Warning messages:
1: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=TPSIAsDISLP: first taken
2: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=IASDIsLPIAT: first taken
3: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=RSQSRsNSPLP: first taken

I have also tried to use spread

kinase_substrate_wide <- spread(kinase_substrate, id, sequence)

but get errors with duplicate identifiers:

> kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
Error: Duplicate identifiers for rows (1812, 1813, 4469), (906, 3349), (253,     285, 2114, 2174, 3022, 4385, 4501), (155, 203, 218, 261, 316, 542, 682, 1021, 1123, 1238, 1492, 1919, 1938, 1997, 2064, 2139, 2323, 2387, 2597, 2826, 3058, 3377, 3899, 4024, 4135, 4241, 4314, 4617, 4733, 5055, 5289, 5467, 5726, 5952, 6165), (72, 272, 749, 1100, 2792, 3573, 3858, 4254, 4257), (209, 548, 637, 653, 1034, 1038, 1213, 1387, 1445, 1475, 1476, 1692, 1735, 2635, 3180, 4005, 4661, 4988, 5672, 5870, 6042), (21, 1802), (23, 24, 30, 49, 60, 86, 122, 127, 137, 177, 182, 227, 250, 260, 268, 270, 299, 347, 356, 361, 400, 424, 425, 448, 483, 488, 494, 509, 510, 512, 522, 523, 524, 540, 559, 572, 612, 614, 616, 622, 720, 750, 774, 794, 816, 820, 829, 866, 868, 912, 916, 918, 940, 946, 955, 962, 984, 992, 1004, 1013, 1054, 1055, 1070, 1073, 1083, 1086, 1105, 1140, 1154, 1164, 1179, 1222, 1228, 1230, 1284, 1295, 1316, 1318, 1333, 1334, 1348, 1356, 1375, 1383, 1389, 1390, 1406, 1421, 1444, 1458, 1473, 1474, 1490

How can I use either of the functions above in order to get the data into wide format and bring each sequence that it corresponds under the column of the id?

Thanks in advance.

EDIT #1

Using the suggestion to include an index from David's comment gets me there

reshape(transform(df, indx = ave(as.character(id), id, FUN = seq)), idvar = "indx", timevar = "id", direction = "wide")

resulting in this:

   indx sequence.CK1alpha sequence.CDK1 sequence.PKCepsilon sequence.GRK2 sequence.ICK sequence.CDK5 sequence.PKCbeta sequence.PAK1 sequence.GSK3beta
1     1       TPSIAsDISLP   SVPSSsPGTSV         IHGSDsVKSAE   DIDESsPGTEW  VDRLQsEPESI   AQAPSsPRVTE      GAQAPsSPRVT   AQERPsQAAPA       NIDNLsPKASH
2     2       IASDIsLPIAT   EGCQGsPQRRG         KLSGLsFKRNR   EKKEEsEESDD  DNRVPsPPPTG   PAEVKsPEKAK      DESTGsIAKRL   RSRTPsASNDD       FNYNPsPRKSS
5     3       DICEDsDIDGD   TLNSGsPEKTC         TALAPsTMKIK   EESEEsDDDMG  PDTKDsPVCPH   QKPAAsPRPRR      IVENLsSRCSW   KQKVDsLLENL       SSGAKsPSKSG
7     4       TFEDLsDVEGG   HVAVSsPTPET         VAKRLsLTMGG   MNSSIsSGSGS  LKVEGsPTEEA   DFTCGsPTAAG      YPVSPsDKVLI   RALRAsESGI_       FPDDLsLDHSD
16    5       PRSGRsPTGNT   TEVPRsPKHAH         EKLVLsKLYEE   RPTSIsWDGLD  ESERGsGSQSS   SDTVTsPQRAG      EKKVVsLNGEL   PGSPLsSQPVL       YSDSIsPFNKS
29    6       MSDTGsPGMQR   KYSPTsPTYSP         EILNRsPRNRK   KNRPTsISWDG         <NA>   GRGAEsPFEEK      LVNSAsAQKRS   SSKTAsLPGYG       PSRTAsFSESR

EDIT #2

Is there are way in the reshape function to avoid putting in "sequence." in front of each name? Or will I have to turn to a regex to rename all of the column names?

EDIT #3

Used gsub to remove "sequence." from the column names and assigned it to a variable:

new_col_names <- names(DF) <- gsub("sequence.", "", names(DF))

followed by applying the new_col_names to the data frame

colnames(DF) <- new_col_names

Thanks for helping me out everyone!

Community
  • 1
  • 1
  • You have to try `dcast(df, sequence ~ id, value.var = "sequence")`, of the `reshape2` package. The result will be slightly different than the expected but to be honest it is a great thing because your expected output it is not only obscure but also non sense. – SabDeM Jan 26 '16 at 22:51
  • 2
    If you want to use `reshape`, you will probably need to add an index, for instance `reshape(transform(df, indx = ave(as.character(id), id, FUN = seq)), idvar = "indx", timevar = "id", direction = "wide")` – David Arenburg Jan 26 '16 at 23:03
  • I should note that I am stuck with the output format that I would like. The program that I am trying to prepare the data for input takes the data in this way... – Adam Rabalski Jan 26 '16 at 23:03
  • @SabDeM I tried using `dcast` in this case but the result is a data frame with 128 columns (id) and 5233 rows (sequence) – Adam Rabalski Jan 26 '16 at 23:11
  • @DavidArenburg thanks David that did the trick! the next step would be to remove "sequence." from the beginning of the each id. is there a way that i can incorporate this into the reshape function or will i have to use regex in order to remove this from the column header? thanks again. – Adam Rabalski Jan 26 '16 at 23:22
  • You have to manually rename after the reshaping. – A5C1D2H2I1M1N2O1R2T1 Jan 27 '16 at 02:38
  • 1
    To remove sequence in all column names: `names(reshapedf) <- gsub("sequence.", "", names(reshapedf))` – Parfait Jan 27 '16 at 03:03
  • What do you intend on doing with this? A list might be more useful for further analysis. – JeremyS Jan 27 '16 at 04:05
  • @JeremyS I will have a wide format data frame with a list of potential sequences that match an enzyme...this is the format for a database file that is used for a gene-set enrichment analysis type application and that's what I have to work with. – Adam Rabalski Jan 27 '16 at 04:34

0 Answers0