0

I currently have a table set up in a basket format so that an irregular amount of data is associated with each row of the table. Such as:

01,item1,item2,item3
02,item1,item2,
03,item1,item2,item3,item4
04,item1

However, I need to change it to a normalized transactional format with only one item on each row. Such as:

01,item1
01,item2
01,item3
02,item1
02,item2
03,item1

...and so on. Is there an easy automated or programmatic way to do this? The data is currently in a MySQL database that I can export in a variety of file types, and I also have access to RStudio, and Microsoft Excel to try to do this. All the transactional resources I could find for RStudio assume that the data was already in the second format, which is what I'm trying to get to.

blahdiblah
  • 33,069
  • 21
  • 98
  • 152
  • 1
    Have you searched SO for this? Pretty sure this is a duplicate. http://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows and http://stackoverflow.com/questions/8464312/r-converting-comma-separated-entry-to-columns and http://stackoverflow.com/questions/5413842/extracting-values-after-pattern/5426837#5426837 AND I will bet a bunch more. – IRTFM Feb 11 '13 at 20:15
  • Neither of those are my issue and I did search for it. – user1424324 Feb 11 '13 at 20:21
  • 1
    All you need to do is paste( . , . sep=",") to the results of those operations in the other citations. – IRTFM Feb 11 '13 at 20:23
  • 1
    If neither of those questions describe your problem, then I don't think you've fully described the issue you're facing. Can you provide a specific example data frame like the one you're trying to reshape? – joran Feb 11 '13 at 20:25

2 Answers2

1

I am assuming I understand the way your data set will look once you read it into R, i.e., it will be a rectangular data frame where NA's are filled in to make the rows the same length. So this should solve the problem:

#Create your dataset (this step is not for you)
row1 = c("01","item1","item2","item3",NA)
row2 = c("02","item1","item2",NA,NA)
row3 = c("03","item1","item2","item3","item4")
row4 = c("04","item1",NA,NA,NA)

Data = rbind(row1,row2,row3,row4)

#Now do the reconstruction (this step is for you)
col1 = NULL
col2 = NULL

for(i in 1:nrow(Data)){
    col1 = c(col1,rep(Data[i],ncol(Data)-1))
    col2 = c(col2,Data[i,-1])
}

NewData = cbind(col1,col2)[!is.na(col2),]

So, what you get is the following

> Data
     [,1] [,2]    [,3]    [,4]    [,5]   
row1 "01" "item1" "item2" "item3" NA     
row2 "02" "item1" "item2" NA      NA     
row3 "03" "item1" "item2" "item3" "item4"
row4 "04" "item1" NA      NA      NA     
> 
> NewData
      col1 col2   
 [1,] "01" "item1"
 [2,] "01" "item2"
 [3,] "01" "item3"
 [4,] "02" "item1"
 [5,] "02" "item2"
 [6,] "03" "item1"
 [7,] "03" "item2"
 [8,] "03" "item3"
 [9,] "03" "item4"
[10,] "04" "item1"

So hopefully that helps.

0

This question is really similar to this one. As @DWin mention in his comment you need to apply paste( . , . sep=",") to a splitted list.

xx <- read.table(text ='01,item1,item2,item3
02,item1,item2,
03,item1,item2,item3,item4
04,item1')

unlist(lapply(strsplit(as.character(xx$V1), ','), 
       function(x) paste(x[1],x[-1],sep=',')))

[1] "01,item1" "01,item2" "01,item3" "02,item1" "02,item2" "03,item1" "03,item2" "03,item3" "03,item4" "04,item1"
Community
  • 1
  • 1
agstudy
  • 119,832
  • 17
  • 199
  • 261