2

So I have this kind of data:

a <- data.table("a"=c("1    42","     84","2    10","3    12","     24"))
        a
1: 1    42
2:      84
3: 2    10
4: 3    12
5:      24

The data is a single sorted vector consisting of strings with IDs 1, 2 and 3 and data on the IDs.

For further processing I need to fill in the ID numbers in the missing places. The number of spaces between ID and the data, as well as the number of spaces before the data when there is no ID is fixed. In the example there are 4 spaces between ID and the data, and 5 spaces if there is no ID. The result should look like this:

a <- data.table("a"=c("1    42","1    84","2    10","3    12","3    24"))
        a
1: 1    42
2: 1    84
3: 2    10
4: 3    12
5: 3    24

I have solved this in Excel by copying the first cell A1 into B1 and using this formula in B2: =IF(LEFT(A2,5)=" ",LEFT(B1,5)&A2,A2). I am aware that this adds additional spaces, but that doesn't matter for the further processing of the data.

I am struggling to find a solution in R, but I found this and this on how to refer to a previous value in a data.table. However, I'm stuck on how to use those commands to check the if-condition and paste the strings together.

Community
  • 1
  • 1
iraserd
  • 669
  • 1
  • 8
  • 26
  • 2
    You have a single column in the input dataset. I think it is better to have it in two columns in the output – akrun Jun 16 '15 at 11:12
  • 1
    Also, in your Excel example you also have `B` column as opposed to the example above. – David Arenburg Jun 16 '15 at 11:13
  • @akrun, Well, that is what I did in Excel, but I don't know how to start in R. I got `a[, B := c(NA, a[seq_len(.N-1)])][]` that gives me column `B` with the previous values of `a`. But I feel that is not the right starting point. – iraserd Jun 16 '15 at 11:13
  • Try `a[, tstrsplit(a, '\\s+')][, V1:=cumsum(V1!='')]` – akrun Jun 16 '15 at 11:14

1 Answers1

4

Try

library(data.table)#v1.9.5+
a[, c("a", "b") := tstrsplit(a, "\\s+")][, a:=a[a!=''], cumsum(a!='')]
#    a  b
#1:  1 42
#2:  1 84
#3:  2 10
#4:  3 12
#5:  3 24

Or a modification suggested by @David Arenburg

 a[, c("a", "b") := tstrsplit(a, "\\s+", type.convert = TRUE)][,
                 a := cumsum(!is.na(a))]

Or using na.locf and base R

 library(zoo)
 m1 <- do.call(rbind,lapply(strsplit(a$a, '\\s+'), as.numeric))
 m1[,1] <- na.locf(m1[,1])

Or

 m1[,1] <- ave(m1[,1], cumsum(!is.na(m1[,1])), FUN=function(x) x[!is.na(x)])

 do.call(paste, as.data.frame(m1))
 #[1] "1 42" "1 84" "2 10" "3 12" "3 24"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I was thinking the same, but maybe it's better to add `type.convert` and columns names, something like `a[, c("a", "b") := tstrsplit(a, "\\s+", type.convert = TRUE)][, a := cumsum(!is.na(a))]`. Also, I'm wondering it `cumsum` will always work... – David Arenburg Jun 16 '15 at 11:18
  • Well, I need the result to be a character vector again for the rest of the program, but I think I could just concatenate the columns back together. It seems my main problem at the moment is that I cannot get an updated version of R and data.table (still at 3.0.2 and 1.8.10). – iraserd Jun 16 '15 at 11:21
  • @DavidArenburg Thanks, the only reason I used `cumsum` to act as a grouping variable is because of the doubt you mentioned. If the values are random, it could be a problem. – akrun Jun 16 '15 at 11:21
  • btw, your solution isn't modifying the data in place. If you type `a` afterwords, you are back with the original data set. – David Arenburg Jun 16 '15 at 11:22
  • @Iraserd The data.table version you are using is old and R also needs to be upgraded. Do you need a base R solution? – akrun Jun 16 '15 at 11:29
  • @akrun I would appriciate a base R / old data.table solution. But only if it's not a lot of additional work for you. I requested an update for R for when the new update is out on Thursday. – iraserd Jun 16 '15 at 11:31
  • @Iraserd The reason I suggested a base R is I am not sure which functions won't work in your version as there were lots of improvements in data.table – akrun Jun 16 '15 at 11:32
  • @Iraserd I updated with another option. Please check if that works for you – akrun Jun 16 '15 at 11:39
  • @akrun Yes, I managed to adapt your alternative to my full data. As there is lots of additional data with spaces to the right of the data values I gave in the example, I had to insert a custom delimiter after the first 5 characters and `strsplit` on that. `a <- tbl a <- sub("(.{5})","\\1Æ",a$a) m1 <- do.call(rbind,lapply(strsplit(a, 'Æ'), as.character)) m1[,1] <- sub("\\s{5}",NA,m1[,1]) m1[,1] <- na.locf(m1[,1]) b <-do.call(paste, as.data.frame(m1)) ` – iraserd Jun 16 '15 at 12:13
  • 1
    @akrun Yes, thanks a lot! Now I have a full R script that I can run and no longer need to do that step manually in Excel. Comparing both ways with `identical()` is TRUE. As soon as I get my R updated, I'll implement the `data.table` routine of course. – iraserd Jun 16 '15 at 12:20