2

I have some data in a form similar to this:

xmpl <- data.frame(x = c("022406391116","034506611298", "015410661242"))
xmpl

    X
  1 022406391116
  2 034506611298
  3 015410661242

Every value is made up of pairs of numbers (two digit each):
Item number, item value, item number, item value.

So for the first line in the example, I have value 24 for item #2, value 39 for item #6, value 16 for item #11. in line two I have item #3 with value 45 etc'. In the example maximum item number is 12.

I want to "unfold" the data so I have a new column for each item number that comes up, with it's value in the corresponding line. In the example it should look like this:

  X             item1  item2  item3  item6  item11  item12
1 022406391116    NA     24     NA     39     16      NA
2 034506611298    NA     NA     45     61     NA      98
3 015411161242    54     NA     NA     NA     16      42

To get to that I tried to use a double loop:

for (nq in c(0,1,2)) {
  for (qs in 1:12) {
    if (as.numeric(substr(xmpl$x, 4 * nq + 1, 4 * nq + 2)) == qs) {
      xmpl[[paste0("item", qs)]] <- as.numeric(substr(xmpl$x, 4 * nq + 3, 4 * nq + 4))
    }
  }
}

I get this warning for every time the if is run in the loop:

In if (as.numeric(substr(xmpl$x, 4 * nq + 1, 4 * nq + ... : the condition has length > 1 and only the first element will be used

And sure enough the (bad) result is:

> xmpl
             x item2 item6 item11
1 022406391116    24    39     16
2 034506611298    45    61     98
3 015410661242    54    66     42

New columns are created only for the first line, while the rest of the values are interpreted accurately but only put in the existing columns, defined for the first line.

How can I get this to work on each line separately? Or if it can't be done this way (please explain why) - what is a better strategy?

EDIT: Just to clarify - I already have this working, but only through a longish process (splitting, reshaping to long and back to wide). This loop is my attempt at shortening the process, and I need help in understanding why the loop won't work.

eli-k
  • 10,898
  • 11
  • 40
  • 44

6 Answers6

2

Here is a flavour you could also consider:

library(magrittr) # for %>% which I use just for readability
library(data.table) # for dcast()

xmplsp <- gsub("(\\d{2})", "\\1 ", xmpl$x) %>% strsplit(" ")
xmpl2 <- data.frame(
  x       = rep(xmpl$x, each = 3),
  item_no = lapply(xmplsp, function(x) x[c(1,3,5)]) %>% unlist(),
  value   = lapply(xmplsp, function(x) x[-c(1,3,5)]) %>% unlist() %>% as.integer()
)
xmpl2
             x item_no value
1 022406391116      02    24
2 022406391116      06    39
3 022406391116      11    16
4 034506611298      03    45
5 034506611298      06    61
6 034506611298      12    98
7 015410661242      01    54
8 015410661242      10    66
9 015410661242      12    42

dcast(xmpl2, x ~ paste0("item", item_no))

             x item01 item02 item03 item06 item10 item11 item12
1 015410661242     54     NA     NA     NA     66     NA     42
2 022406391116     NA     24     NA     39     NA     16     NA
3 034506611298     NA     NA     45     61     NA     NA     98

So the logic builds on strsplit() instead of substr(), but I first used gsub() to add spaces between values.

s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Apparently others have thought deeply about how the first step can be done: https://stackoverflow.com/questions/2247045/chopping-a-string-into-a-vector-of-fixed-width-character-elements – s_baldur Jul 27 '18 at 08:38
  • Those are some nice insights about working with the string, but my main problem is with the loop strategy. I actually have a similar restructure solution right now, but I'm trying to improve on it with the loop I described - can you help with that? – eli-k Jul 27 '18 at 10:38
  • My suggestion was to drop the looping strategy altogether. – s_baldur Jul 27 '18 at 10:40
  • Ok, this would indeed make the process somewhat shorter than my present method, but I imagined this could be done in one step like in my loop, without needing a restructure – eli-k Jul 27 '18 at 10:59
  • @eli-k Added another answer focusing on the loop – s_baldur Jul 27 '18 at 11:22
1

In base R:

xmpl <- data.frame(x = c("022406391116","034506611298", "015410661242"))

want <- do.call(rbind, lapply(strsplit(as.character(xmpl$x), ""), 
                              function(x) {
                                res <- t(matrix(unlist(x), nrow = 4))
                                items <- paste0(res[,1], res[,2])
                                values <- paste0(res[,3], res[,4])
                                id <- paste(x, collapse = "")
                                res <- data.frame(x = id, items = items,
                                                  values = as.numeric(values))
                            res
                          }))

library(reshape2)
want <- dcast(want, x ~ paste0("item", items), value.var = "values")
want

#             x item01 item02 item03 item06 item10 item11 item12
#1 022406391116     NA     24     NA     39     NA     16     NA
#2 034506611298     NA     NA     45     61     NA     NA     98
#3 015410661242     54     NA     NA     NA     66     NA     42

# modified:

xmpl <- data.frame(x = c("022406391116","034506611298", "015410661242"))

dummy <- matrix(strsplit(paste(as.character(xmpl$x), collapse = ""), "")[[1]], nrow = 4)
want <- data.frame(x = rep(as.character(xmpl$x), each = 3), 
                   items = paste0(dummy[1,], dummy[2,]),
                   values = paste0(dummy[3,], dummy[4,]))
library(reshape2)
(want <- dcast(want, x ~ paste0("item", items), value.var = "values"))

#              x item01 item02 item03 item06 item10 item11 item12
#1 015410661242     54   <NA>   <NA>   <NA>     66   <NA>     42
#2 022406391116   <NA>     24   <NA>     39   <NA>     16   <NA>
#3 034506611298   <NA>   <NA>     45     61   <NA>   <NA>     98
r.user.05apr
  • 5,356
  • 3
  • 22
  • 39
  • Thanks for the reply but please look at the question again - you have reached exactly the BAD results that I have reached, and not the wanted results.... – eli-k Jul 27 '18 at 10:42
  • Sorry, rectified now. – r.user.05apr Jul 27 '18 at 10:54
  • So as with @snoram's answer, this is a nice improvement on my present method (see edit at bottom of question) (thanks!), but I imagined this could be done in one step with a double loop without then needing a restructure - why won't that loop work? – eli-k Jul 27 '18 at 11:11
  • I think restructuring in this case has its benefits. It's possible to get rid of all loops. – r.user.05apr Jul 27 '18 at 11:38
  • Guess you may be right, being an SPSS veteran I may be stuck in my old ways - will look more into the loop avoiding idea. Thanks! – eli-k Jul 31 '18 at 07:54
1

Convoluted in some places, but this works:

require(tidyverse)
require(stringr)

xmpl <- data_frame(x = c("022406391116","034506611298", "015410661242"))

fn <- function(x, strt, end) {str_sub(x, strt, end) %>% as.integer()}

tmp <- xmpl %>% 
  mutate(
  key_1 = str_sub(x, 1,2),
  val_1 = fn(x, 3,4),
  key_2 = str_sub(x, 5,6),
  val_2 = fn(x, 7,8),
  key_3 = str_sub(x, 9,10),
  val_3 = fn(x, 11,12)
)

long <- reduce(
  .x = list(
   tmp %>% select(x, key = key_1, val = val_1),
   tmp %>% select(x, key = key_2, val = val_2),
   tmp %>% select(x, key = key_3, val = val_3)  ),
  bind_rows
) 

long %>% 
  transmute(x = x ,item = paste0("item_", key), val = val) %>% 
  spread(item, val)

# A tibble: 3 x 8 x item_01 item_02 item_03 item_06 item_10 item_11 item_12 <chr> <int> <int> <int> <int> <int> <int> <int> 1 015410661242 54 NA NA NA 66 NA 42 2 022406391116 NA 24 NA 39 NA 16 NA 3 034506611298 NA NA 45 61 NA NA 98

JonMinton
  • 1,239
  • 2
  • 8
  • 26
  • Thanks. I've edited to show the output. The above solution does show the NAs by item. Is that what you want? – JonMinton Jul 27 '18 at 15:39
1

To answer the question (1) what is wrong with the current loop and (2) how this can be done via a loop (although this is probably not an optimal solution).

(1)

if() only takes a single value not a vector so you are forcing each value into the column that the first value belongs in.

(2)

Here is an example of a loop that does the job. The logic is dealing with row-by-row and then each item_number-value pair in that row.

# Preset the vector
xmpl[1 + 1:12] <- vector(mode = "integer", length = 3)
names(xmpl) <- c(names(xmpl)[1], paste0("item", 1:12))


# Iterate through the df row by row
for (row in seq_len(nrow(xmpl))) { 
  # Iterate through each entry which has 3 item_number-value pairs
  for (pair in seq_len(3)) {
    item_number <- as.integer(
      substr(xmpl[["x"]][row], 4 * (pair - 1) + 1, 4 * (pair - 1) + 2)
    )
    value <- as.integer(
      substr(xmpl[["x"]][row], 4 * (pair - 1) + 3, 4 * (pair - 1) + 4)
    )
    xmpl[row, paste0("item", item_number)] <- value 
  }
}
xmpl
             x item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 item11 item12
1 022406391116     0    24     0     0     0    39     0     0     0      0     16      0
2 034506611298     0     0    45     0     0    61     0     0     0      0      0     98
3 015410661242    54     0     0     0     0     0     0     0     0     66      0     42
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Thanks for both answers! – eli-k Jul 31 '18 at 08:08
  • I first selected this answer as it did exactly as I required in the question. But now I understand better the suggestion to avoid the loop, and for future readers' benefit I will change my selection to the best non-loop answer. – eli-k Sep 07 '18 at 09:12
1

Why another answer?

The OP has stated:

I already have this working, but only through a longish process (splitting, reshaping to long and back to wide). This loop is my attempt at shortening the process [...]

If "longish" and "shortening the process" refer to run times, then the approach below is much faster and less memory consuming than the loop approach which is verified by a benchmark.

Reshaping with tstrsplit(), melt(), dcast()

xmpl <- data.frame(x = c("022406391116","034506611298", "015410661242"))

library(data.table)
library(magrittr)
setDT(xmpl) %>% 
  .[, c(tstrsplit(x, "(?<=[0-9]{2})", perl = TRUE, names = TRUE, type.convert = TRUE), 
        .(x = x))] %>% 
  melt(id.var = "x", measure.vars = list(seq(1, ncol(.) - 1, 2), seq(2, ncol(.) - 1, 2)), 
       value.name = c("item", "val")) %>% 
  dcast(x ~ sprintf("Item%02i", item), value.var = "val")
              x Item01 Item02 Item03 Item06 Item10 Item11 Item12
1: 015410661242     54     NA     NA     NA     66     NA     42
2: 022406391116     NA     24     NA     39     NA     16     NA
3: 034506611298     NA     NA     45     61     NA     NA     98
  • tstrsplit() splits after each 2 digits using a regular expression with lookbehind, then transposes the result to create columns which are converted to integer.
  • melt() reshapes two measure variables simultaneously from wide to long form. The odd numbered columns are the items, the even numbered columns are the values.
  • Finally, dcast() is used to reshape back to wide form. The new column names are created using sprintf() to ensure that the column numbers below 10 have a leading 0 to ensure proper column order.

Benchmark

For benchmarking, the given dataset is too small. So I have created dummy data for a varying range of parameters:

  • The number of pairs which determines the length of x can vary from 3 to 10.
  • The number of rows varies from 10 to 1000.

I have tested separately (not shown here) that the maximum number of items has less impact on the benchmark timings, so it is fixed at 15.

Most of the codes posted so far had the parameters hardcoded and could not be modified to work with other parameters. So, three different approaches are included:

The codes were slightely modified to deal with varying parameters.

library(bench)
bm <- press(
  n_pair = c(3, 5, 10),
  n_row = 10^(1:3),
  {
    set.seed(1)
    max_items <- 15L
    xmpl0 <- 
      sapply(seq_len(n_row), function(x) {
        sprintf("%02i%02i", 
                sample(max_items, n_pair, FALSE),
                sample(99, n_pair, TRUE)) %>% 
          paste0(collapse = "") 
      }) %>% 
      data.frame(x = ., stringsAsFactors = FALSE)
    mark(
      snoram_loop = {
        xmpl <- copy(xmpl0)
        nc <- max_items
        xmpl[1 + 1:nc] <- vector(mode = "integer", length = 3)
        names(xmpl) <- c(names(xmpl)[1], sprintf("item%02i", 1:nc))
        np <- max(nchar(xmpl$x)) / 4
        # Iterate through the df row by row
        for (row in seq_len(nrow(xmpl))) { 
          # Iterate through each entry which has 3 item_number-value pairs
          for (pair in seq_len(np)) {
            item_number <- as.integer(
              substr(xmpl[["x"]][row], 4 * (pair - 1) + 1, 4 * (pair - 1) + 2)
            )
            value <- as.integer(
              substr(xmpl[["x"]][row], 4 * (pair - 1) + 3, 4 * (pair - 1) + 4)
            )
            xmpl[row, sprintf("item%02i", item_number)] <- value 
          }
        }
        xmpl
      },
      snoram_reshape = {
        xmpl <- copy(xmpl0)
        xmplsp <- gsub("(\\d{2})", "\\1 ", xmpl$x) %>% strsplit(" ")
        np <- max(lengths(xmplsp)) / 2
        xmpl2 <- data.frame(
          x       = rep(xmpl$x, each = np),
          item_no = lapply(xmplsp, function(x) x[seq(1, 2*np, 2)]) %>% unlist(),
          value   = lapply(xmplsp, function(x) x[-seq(1, 2*np, 2)]) %>% unlist() %>% as.integer()
        )
        result <- dcast(xmpl2, x ~ paste0("item", item_no))
        result
      },
      uwe_reshape = {
        xmpl <- copy(xmpl0)
        result <- setDT(xmpl) %>% 
          .[, c(tstrsplit(x, "(?<=[0-9]{2})", perl = TRUE, names = TRUE, type.convert = TRUE), 
                .(x = x))] %>% 
          melt(id.var = "x", measure.vars = list(seq(1, ncol(.) - 1, 2), seq(2, ncol(.) - 1, 2)), 
               value.name = c("item", "val")) %>% 
          dcast(x ~ sprintf("item%02i", item), value.var = "val")
        result
      },
      check = FALSE
    )
  })

The check has been turned off because the loop approach creates columns also for non-existing items and uses 0 instead of NA.

ggplot2::autoplot(bm)

enter image description here

The approach using tstrsplit(), melt(), dcast() is almost always faster and the loop approach almost always slower than the other approaches - except for cases with 10 rows. Please, note the logarithmic time scale.

The table below shows also the memory allocation. The loop approach allocates up to 20 times more memory than the reshape approaches.

tail(bm, 9)
# A tibble: 9 x 16
  expression n_pair n_row      min     mean   median     max `itr/sec` mem_alloc  n_gc n_itr total_time result
  <chr>       <dbl> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:t>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>
1 snoram_lo~      3  1000 145.04ms 148.78ms 148.67ms 152.8ms      6.72   12.27MB     4     4      595ms <data~
2 snoram_re~      3  1000  49.18ms  57.54ms  53.49ms  82.6ms     17.4     1.63MB     3     9      518ms <data~
3 uwe_resha~      3  1000   8.11ms   9.09ms   8.87ms  13.9ms    110.    925.19KB     0    56      509ms <data~
4 snoram_lo~      5  1000 246.04ms 248.31ms 247.39ms 251.5ms      4.03   19.96MB     5     3      745ms <data~
5 snoram_re~      5  1000  54.67ms  59.71ms  58.14ms  69.5ms     16.7     2.41MB     2     9      537ms <data~
6 uwe_resha~      5  1000  11.43ms  12.84ms  12.55ms  21.1ms     77.9     1.12MB     1    39      501ms <data~
7 snoram_lo~     10  1000 500.29ms 500.29ms 500.29ms 500.3ms      2.00   39.33MB     3     1      500ms <data~
8 snoram_re~     10  1000  65.59ms   69.1ms  66.53ms  77.4ms     14.5     4.41MB     2     8      553ms <data~
9 uwe_resha~     10  1000  18.41ms  20.71ms  20.61ms    29ms     48.3     1.88MB     1    25      518ms <data~
# ... with 3 more variables: memory <list>, time <list>, gc <list>
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

I am not sure I complete understand how you want the numbers distributed, but if it is just pairs of numbers, I would do something like this:

xmpl <- data.frame(x = c("022406391116","034506611298", "015410661242"))
mytable <- do.call(rbind, lapply(xmpl$x, substring, seq(1,11,2), seq(2,12,2)))
colnames(mytable) <- paste("Item",1:6)
cbind(xmpl, mytable)

             x Item 1 Item 2 Item 3 Item 4 Item 5 Item 6
1 022406391116     02     24     06     39     11     16
2 034506611298     03     45     06     61     12     98
3 015410661242     01     54     10     66     12     42
Esben Eickhardt
  • 3,183
  • 2
  • 35
  • 56
  • Indeed you misunderstood: in each line, the first two digits in the string describe the item number where I need to put the value of the next two digits. The same goes for the next pairs of numbers. So each line has three values that need to go into item columns. – eli-k Jul 27 '18 at 10:46