Updated answer (21 Oct 2013)
These steps can be combined if you use concat.split.multiple
from my "splitstackshape" package. This function makes use of count.fields
automatically, so it shouldn't suffer from the problem mentioned in the comments:
library(splitstackshape)
out <- concat.split.multiple(mydf, "Ids", seps=",", "long")
head(out)
# v1 v2 v3 v4 v5 time Ids
# 1 9186 639 9045 316 28396 1 548|14721
# 2 9041 598 8897 283 28054 1 548|14721
# 3 8799 588 8669 246 27433 1 548|14722
# 4 8914 614 8765 273 27347 1 548|14721
# 5 4113 497 4050 270 27267 1 716|18316
# 6 8829 589 8713 254 25270 1 548|14722
tail(out)
# v1 v2 v3 v4 v5 time Ids
# 31 9186 639 9045 316 28396 6 752|18716
# 32 9041 598 8897 283 28054 6 752|18716
# 33 8799 588 8669 246 27433 6 752|18716
# 34 8914 614 8765 273 27347 6 752|18715
# 35 4113 497 4050 270 27267 6 <NA>
# 36 8829 589 8713 254 25270 6 752|18716
Original Answer (27 Feb 2013)
You need to do a combination of splitting the "Ids" strings followed by "reshaping" your data (if I understand your goal correctly).
Instead of splitting the way you did, I've gone ahead and taken advantage of read.csv
and the argument fill = TRUE
. Note that if the values in your "Ids" column are currently factors, you'll need to use text = as.character(mydf$Ids)
instead.
Here's how I'd proceed:
mydf2 <- cbind(read.csv(text = mydf$Ids, fill = TRUE, header = FALSE), mydf[-1])
mydf2
# V1 V2 V3 V4 V5 V6 v1 v2 v3 v4 v5
# 1 548|14721 678|17604 716|18316 732|18505 745|18626 752|18716 9186 639 9045 316 28396
# 2 548|14721 678|17603 716|18316 732|18507 745|18626 752|18716 9041 598 8897 283 28054
# 3 548|14722 678|17603 716|18316 732|18507 745|18626 752|18716 8799 588 8669 246 27433
# 4 548|14721 678|17603 716|18316 732|18505 745|18626 752|18715 8914 614 8765 273 27347
# 5 716|18316 745|18626 4113 497 4050 270 27267
# 6 548|14722 678|17604 716|18316 732|18507 745|18626 752|18716 8829 589 8713 254 25270
That data is currently in a "wide" format. Let's make it into a "long" format. The reshape
function needs a little bit of information on how to proceed. In particular, it needs to know:
- Which columns are indicative of "id" variables. Counter-intuitively to the names from your dataset, these are not the values from your original "Ids" variable, but the other variables that were present. ("v1" - "v5" (lower-case "v") at positions 7 through 11 in the "mydf2" dataset). Obviously, for your actual data, you need to specify the actual columns that are to be treated as
idvar
s.
- Which columns "vary" and need to be "stacked" in the "long" format. In this case, it's the new variables that were created when we used
read.csv
, and by manually checking their indexes, we can see that they are in positions 1 through 6. Obviously, you would need to specify the column numbers from your actual dataset.
For the example dataset you provided, we would use reshape
as follows:
mydf3 <- reshape(mydf2, direction = "long", idvar=7:ncol(mydf2),
varying=1:6, sep = "")
rownames(mydf3) <- NULL
Here's the head
and tail
of the resulting data.frame
:
> head(mydf3)
v1 v2 v3 v4 v5 time V
1 9186 639 9045 316 28396 1 548|14721
2 9041 598 8897 283 28054 1 548|14721
3 8799 588 8669 246 27433 1 548|14722
4 8914 614 8765 273 27347 1 548|14721
5 4113 497 4050 270 27267 1 716|18316
6 8829 589 8713 254 25270 1 548|14722
> tail(mydf3)
v1 v2 v3 v4 v5 time V
31 9186 639 9045 316 28396 6 752|18716
32 9041 598 8897 283 28054 6 752|18716
33 8799 588 8669 246 27433 6 752|18716
34 8914 614 8765 273 27347 6 752|18715
35 4113 497 4050 270 27267 6
36 8829 589 8713 254 25270 6 752|18716
This is all assuming we are starting with an object named "mydf" that looks like this:
mydf <- structure(list(Ids = c("548|14721,678|17604,716|18316,732|18505,745|18626,752|18716",
"548|14721,678|17603,716|18316,732|18507,745|18626,752|18716",
"548|14722,678|17603,716|18316,732|18507,745|18626,752|18716",
"548|14721,678|17603,716|18316,732|18505,745|18626,752|18715",
"716|18316,745|18626", "548|14722,678|17604,716|18316,732|18507,745|18626,752|18716"
), v1 = c(9186L, 9041L, 8799L, 8914L, 4113L, 8829L), v2 = c(639L,
598L, 588L, 614L, 497L, 589L), v3 = c(9045L, 8897L, 8669L, 8765L,
4050L, 8713L), v4 = c(316L, 283L, 246L, 273L, 270L, 254L), v5 = c(28396L,
28054L, 27433L, 27347L, 27267L, 25270L)), .Names = c("Ids", "v1",
"v2", "v3", "v4", "v5"), class = "data.frame", row.names = c(NA,
-6L))