I have checked out the following questions on StackOverFlow and links to other R help books:
R: Reshape Data Long to Wide - understanding reshape parameters
Reshape long to wide with multiple groupings
How to reshape data from long to wide format?
http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
I want to take my data which is in long format (two columns, with 6200 entries) with non-unique names in column "id" with different sequences in "sequence" and reshape into a wide format where the column headers are now the "id" with all of the sequences from "sequence" listed underneath each id.
id sequence
1 CK1alpha TPSIAsDISLP
2 CK1alpha IASDIsLPIAT
3 CDK1 SVPSSsPGTSV
4 CDK1 EGCQGsPQRRG
5 CK1alpha DICEDsDIDGD
6 PKCepsilon IHGSDsVKSAE
What I would like to get:
id CK1alpha CDK1 PKCepsilon
sequence TPSIAsDISLP SVPSSsPGTSV
IASDIsLPIAT EGCQGsPQRRG
DICEDsDIDGD
I have tried to use reshape
kinase_sub_wide <- reshape(kinase_substrate, idvar = "id", timevar = "sequence", direction = "wide")
However I get warning messages saying multiple rows match:
Warning messages:
1: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=TPSIAsDISLP: first taken
2: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=IASDIsLPIAT: first taken
3: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=RSQSRsNSPLP: first taken
I have also tried to use spread
kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
but get errors with duplicate identifiers:
> kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
Error: Duplicate identifiers for rows (1812, 1813, 4469), (906, 3349), (253, 285, 2114, 2174, 3022, 4385, 4501), (155, 203, 218, 261, 316, 542, 682, 1021, 1123, 1238, 1492, 1919, 1938, 1997, 2064, 2139, 2323, 2387, 2597, 2826, 3058, 3377, 3899, 4024, 4135, 4241, 4314, 4617, 4733, 5055, 5289, 5467, 5726, 5952, 6165), (72, 272, 749, 1100, 2792, 3573, 3858, 4254, 4257), (209, 548, 637, 653, 1034, 1038, 1213, 1387, 1445, 1475, 1476, 1692, 1735, 2635, 3180, 4005, 4661, 4988, 5672, 5870, 6042), (21, 1802), (23, 24, 30, 49, 60, 86, 122, 127, 137, 177, 182, 227, 250, 260, 268, 270, 299, 347, 356, 361, 400, 424, 425, 448, 483, 488, 494, 509, 510, 512, 522, 523, 524, 540, 559, 572, 612, 614, 616, 622, 720, 750, 774, 794, 816, 820, 829, 866, 868, 912, 916, 918, 940, 946, 955, 962, 984, 992, 1004, 1013, 1054, 1055, 1070, 1073, 1083, 1086, 1105, 1140, 1154, 1164, 1179, 1222, 1228, 1230, 1284, 1295, 1316, 1318, 1333, 1334, 1348, 1356, 1375, 1383, 1389, 1390, 1406, 1421, 1444, 1458, 1473, 1474, 1490
How can I use either of the functions above in order to get the data into wide format and bring each sequence that it corresponds under the column of the id?
Thanks in advance.
EDIT #1
Using the suggestion to include an index from David's comment gets me there
reshape(transform(df, indx = ave(as.character(id), id, FUN = seq)), idvar = "indx", timevar = "id", direction = "wide")
resulting in this:
indx sequence.CK1alpha sequence.CDK1 sequence.PKCepsilon sequence.GRK2 sequence.ICK sequence.CDK5 sequence.PKCbeta sequence.PAK1 sequence.GSK3beta
1 1 TPSIAsDISLP SVPSSsPGTSV IHGSDsVKSAE DIDESsPGTEW VDRLQsEPESI AQAPSsPRVTE GAQAPsSPRVT AQERPsQAAPA NIDNLsPKASH
2 2 IASDIsLPIAT EGCQGsPQRRG KLSGLsFKRNR EKKEEsEESDD DNRVPsPPPTG PAEVKsPEKAK DESTGsIAKRL RSRTPsASNDD FNYNPsPRKSS
5 3 DICEDsDIDGD TLNSGsPEKTC TALAPsTMKIK EESEEsDDDMG PDTKDsPVCPH QKPAAsPRPRR IVENLsSRCSW KQKVDsLLENL SSGAKsPSKSG
7 4 TFEDLsDVEGG HVAVSsPTPET VAKRLsLTMGG MNSSIsSGSGS LKVEGsPTEEA DFTCGsPTAAG YPVSPsDKVLI RALRAsESGI_ FPDDLsLDHSD
16 5 PRSGRsPTGNT TEVPRsPKHAH EKLVLsKLYEE RPTSIsWDGLD ESERGsGSQSS SDTVTsPQRAG EKKVVsLNGEL PGSPLsSQPVL YSDSIsPFNKS
29 6 MSDTGsPGMQR KYSPTsPTYSP EILNRsPRNRK KNRPTsISWDG <NA> GRGAEsPFEEK LVNSAsAQKRS SSKTAsLPGYG PSRTAsFSESR
EDIT #2
Is there are way in the reshape function to avoid putting in "sequence." in front of each name? Or will I have to turn to a regex to rename all of the column names?
EDIT #3
Used gsub
to remove "sequence."
from the column names and assigned it to a variable:
new_col_names <- names(DF) <- gsub("sequence.", "", names(DF))
followed by applying the new_col_names
to the data frame
colnames(DF) <- new_col_names
Thanks for helping me out everyone!