-1

I asked a similar question about this at R - generate dynamic number of columns and substring column values but the details in my question have now changed and so I am reposting as I would require a different solution.

I have attached a picture which illustrates my starting dataset and the end point that I am trying to achieve. I need an R solution, which uses base R as the platform I will be using cannot utilise other packages.

The original data set has multiple columns. For some of the columns i.e. L1, L2, L3, I want to;

1) Generate a dynamic number of columns based on the maximum length of any string in the column e.g. L1 max length = 6, therefore 6 new columns each labelled 'L1_1' to 'L1_6'

2) Separate the original string into substrings, each containing 3 characters starting from the left. the penultimate column will contain 2 characters, the final column will contain 1 character. (different to original question)

3) perform a calculation on these substrings i.e. (number of 'a' * 1) + (number of 'b' * 3) + (number of 'c'*7) and return the value of this calculation in the new column.

Does anybody have any ideas about how to do this?

Thanks in advance.

dput(original_data):
    structure(list(ID = 1:5, L1 = structure(c(3L, 2L, 4L, 1L, 5L), .Label = c("",                                                                          "AAAAAA", "AABBCC", "BBACB", "BCBDAB"), class = "factor"), L2 = structure(c(3L,                                                                        
4L, 3L, 1L, 2L), .Label = c("", "ACAA", "BACA", "BACBA"), class = "factor"),                                                                           L3 = structure(c(1L, 3L, 2L, 1L, 4L), .Label = c("", "CABAC",                                                                                     "CACCC", "CBABA"), class = "factor")), .Names = c("ID", "L1",                                                                                      
"L2", "L3"), class = "data.frame", row.names = c(NA, -5L))   

dput(interim_data):
structure(list(ID = 1:5, L1 = structure(c(3L, 2L, 4L, 1L, 5L), .Label = c("",                                                                          
"AAAAAA", "AABBCC", "BBACB", "BCBDAB"), class = "factor"), L2 = structure(c(3L,                                                                        
4L, 3L, 1L, 2L), .Label = c("", "ACAA", "BACA", "BACBA"), class = "factor"),                                                                           
    L3 = structure(c(1L, 3L, 2L, 1L, 4L), .Label = c("", "CABAC",                                                                                      
    "CACCC", "CBABA"), class = "factor"), L1_1 = structure(c(3L,                                                                                       
    2L, 4L, 1L, 5L), .Label = c("", "AAA", "AAB", "BBA", "BCB"                                                                                         
    ), class = "factor"), L1_2 = structure(c(3L, 2L, 4L, 1L,                                                                                           
    5L), .Label = c("", "AAA", "ABB", "BAC", "CBD"), class = "factor"),                                                                                
    L1_3 = structure(c(4L, 2L, 3L, 1L, 5L), .Label = c("", "AAA",                                                                                      
    "ACB", "BBC", "BDA"), class = "factor"), L1_4 = structure(c(3L,                                                                                    
    2L, 4L, 1L, 5L), .Label = c("", "AAA", "BCC", "CB", "DAB"                                                                                          
    ), class = "factor"), L1_5 = structure(c(5L, 2L, 4L, 1L,                                                                                           
    3L), .Label = c("", "AA", "AB", "B", "CC"), class = "factor"),                                                                                     
    L1_6 = structure(c(4L, 2L, 1L, 1L, 3L), .Label = c("", "A",                                                                                        
    "B", "C"), class = "factor"), L2_1 = structure(c(3L, 3L,                                                                                           
    3L, 1L, 2L), .Label = c("", "ACA", "BAC"), class = "factor"),                                                                                      
    L2_2 = structure(c(2L, 3L, 2L, 1L, 4L), .Label = c("", "ACA",                                                                                      
    "ACB", "CAA"), class = "factor"), L2_3 = structure(c(3L,                                                                                           
    4L, 3L, 1L, 2L), .Label = c("", "AA", "AC", "CBA"), class = "factor"),                                                                             
    L2_4 = structure(c(2L, 3L, 2L, 1L, 2L), .Label = c("", "A",                                                                                        
    "BA"), class = "factor"), L2_5 = structure(c(1L, 2L, 1L,                                                                                           
    1L, 1L), .Label = c("", "A"), class = "factor"), L3_1 = structure(c(1L,                                                                            
    3L, 2L, 1L, 4L), .Label = c("", "CAB", "CAC", "CBA"), class = "factor"),                                                                           
    L3_2 = structure(c(1L, 3L, 2L, 1L, 4L), .Label = c("", "ABA",                                                                                      
    "ACC", "BAB"), class = "factor"), L3_3 = structure(c(1L,                                                                                           
    4L, 3L, 1L, 2L), .Label = c("", "ABA", "BAC", "CCC"), class = "factor"),                                                                           
    L3_4 = structure(c(1L, 4L, 2L, 1L, 3L), .Label = c("", "AC",                                                                                       
    "BA", "CC"), class = "factor"), L3_5 = structure(c(1L, 3L,                                                                                         
    3L, 1L, 2L), .Label = c("", "A", "C"), class = "factor")), .Names = c("ID",                                                                        
"L1", "L2", "L3", "L1_1", "L1_2", "L1_3", "L1_4", "L1_5", "L1_6",                                                                                      
"L2_1", "L2_2", "L2_3", "L2_4", "L2_5", "L3_1", "L3_2", "L3_3",                                                                                        
"L3_4", "L3_5"), class = "data.frame", row.names = c(NA, -5L))  

Example

EDIT: Code provided by @Onyambu;

interim=sapply(df, as.character)
interim[,1]=as.numeric(interim[,1]
funfun = function(u){
  if(is.numeric(u)) return(u)
  s = unique(unlist(strsplit(u,"")))
  w = sapply(s,function(x)length(unlist(gregexpr(x,u))))
 ifelse(length(s)>0,sum(w["A"]*1,w["B"]* 3,w["C"]*7,na.rm = T),NA)
}

ADD_char=function(x) mapply(funfun,x)
sapply(interim,ADD_char)
dat1 <- cbind(interim[,1:4],sapply(interim[,-(1:4)],ADD_char))

Results in parse error

Ash_23S
  • 115
  • 1
  • 11
  • Posting picture of data really isn't that helpful. See how to share a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – MrFlick Aug 03 '17 at 17:39
  • apologies, I've included dput(data) now – Ash_23S Aug 03 '17 at 17:44
  • Do you really need the interrim data or do you just need the final column? – ekstroem Aug 03 '17 at 21:21
  • thanks, just the final column, the interim data table is just to show how the strings should be separated.. – Ash_23S Aug 03 '17 at 21:34

2 Answers2

0

First before trying this code, ensure you check the classes of your columns: sapply(Interim,class) If the code above gives you "factor" instead of "character" then you have to change your dataframe to be in characters instead of factors. simply byinterim=data.frame(t(t(interim)),stringsAsFactors = F) or by interim=sapply(Interim, as.character) should do the job. then change the ID column to numeric.ie interim[,1]=as.numeric(interim[,1]). After you have ensured that the data is now in characters, you can run the following code:

funfun = function(u){
  if(is.numeric(u)) return(u)
  s = unique(unlist(strsplit(u,"")))
  w = sapply(s,function(x)length(unlist(gregexpr(x,u))))
 ifelse(length(s)>0,sum(w["A"]*1,w["B"]* 3,w["C"]*7,na.rm = T),NA)
}

 ADD_char=function(x) mapply(funfun,x)

 sapply(Interim,ADD_char)
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

First before trying this code, ensure you check the classes of your columns: sapply(Interim,class) If the code above gives you "factor" instead of "character" then you have to change your dataframe to be in characters instead of factors. simply byinterim=data.frame(t(t(interim)),stringsAsFactors = F) or by interim=sapply(Interim, as.character) should do the job. then change the ID column to numeric.ie interim[,1]=as.numeric(interim[,1]). After you have ensured that the data is now in characters, you can run the following code:

funfun = function(u){
  if(is.numeric(u)) return(u)
  s = unique(unlist(strsplit(u,"")))
  w = sapply(s,function(x)length(unlist(gregexpr(x,u))))
 ifelse(length(s)>0,sum(w["A"]*1,w["B"]* 3,w["C"]*7,na.rm = T),NA)
}

 ADD_char=function(x) mapply(funfun,x)


 funfun1=function(u){
   if(nchar(u)<2) return(u)
   a = unlist(strsplit(u,""))
   m = length(a)
   if(m>0)
   sapply(1:m, function(i)paste0(a[i:(ifelse(i+2<m,i+2,m))],collapse = ""))
 }  

 funfun2=function(data){
   char_split = function(x) mapply(funfun1,x)
   s = lapply(apply(data,1,char_split),unlist)
   nam = lapply(s,names)
   slen = sapply(nam,length)
   ans=`names<-`(do.call(rbind.data.frame,
              lapply(s,function(i){length(i)=max(slen);i})),
                 nam[[which.max(slen)]])
   ans=data.frame(t(t(ans)),stringsAsFactors=FALSE)
   fn=sapply(ans,function(j) ifelse(is.na(j), "",j))
  as.data.frame(fn,stringsAsFactors=FALSE)
 }

k=funfun2(interim[,1:4])
mapply(class,k)
k[,1]=as.numeric(k[,1])
sapply(k,ADD_char)

     ID L11 L12 L13 L14 L15 L16 L21 L22 L23 L24 L25 L31 L32 L33 L34 L35
[1,]  1   5   7  13  17  14   7  11   9   8   1  NA  NA  NA  NA  NA  NA
[2,]  2   3   3   3   3   2   1  11  11  11   4   1  15  15  21  14   7
[3,]  3   7  11  11  10   3  11   9   8   1  11   5  11   8   7  NA  NA
[4,]  4  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
[5,]  5  13  10   4   4   4   3   9   9   2   1  11   7   5   4   1  NA
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • thanks! I've tried the suggestion and get this error: ' 'Error in parse(text = script) : parse error in text argument: parse error at 'funfun' (line 3, characters 1 through 6)'.' I've put the code in the edit of the qu – Ash_23S Aug 04 '17 at 13:45
  • I have posted the results that I do get. On using the function above. I cant really tell why you get the error. But is seems there must be a problem somewhere. Is interim data your whole data?. Let me check and see whether I can be able to help. Thank you – Onyambu Aug 04 '17 at 14:04
  • thanks, original data is my whole data, the interim data is just to show how they should be substring. I have shown this in the picture – Ash_23S Aug 04 '17 at 14:05
  • Can you try running `ADD_char(interim[,2])` what do you get?? – Onyambu Aug 04 '17 at 14:06
  • ohh okay so your data has not yet been subset?According to the `interim` data you presented above, there was no need to subset it since it was already subset. I thought you have already done the subsetting. Let me alsotry and include the subbsetting in the function. – Onyambu Aug 04 '17 at 14:09
  • does the code above run on your interim data?? Does it produce the required results? or it still has an error? – Onyambu Aug 04 '17 at 14:41
  • thanks, no I still get a parse error at the same line – Ash_23S Aug 04 '17 at 15:05
  • What of `ADD_char (interim[,2])`. Try running this alone and tell me whether you still get an error. For this line alone. – Onyambu Aug 04 '17 at 15:58
  • or letting `u="AABBC"` can you run the variables `s ` and `w` in the function `funfun` line by line and see whether you get results? and run the `ifelse` statement to see whether you get 13??.. I have the split code. But it wont be of any help if This cannot work. – Onyambu Aug 04 '17 at 16:08
  • thanks for your help, I think I'm going to go about this a different way but I have incorporated your function and got it to work. – Ash_23S Aug 04 '17 at 21:39