5

I have a dataset that has widths at every pixel position along a central skeleton. The width is output as a single string that is comma delimited.

cukeDatatest <- read.delim("https://gist.githubusercontent.com/bhive01/e7508f552db0415fec1749d0a390c8e5/raw/a12386d43c936c2f73d550dfdaecb8e453d19cfe/widthtest.tsv")
str(cukeDatatest) # or dplyr::glimpse(cukeDatatest)

I need to keep the File and FruitNum identifiers with the widths.

The output I want has three columns File, FruitNum, ObjectWidth, but File and FruitNum are repeated for the length of ObjectWidth for that fruit. Position is important so sorting these vectors would be really bad. Also, every fruit is a different length (if that matters for your method).

I've used str_split() before to dissect a few elements from a string, but never something this large, nor so many of them (I have 8000 of them). Processing time is a concern, but would wait for correct result.

I'm more used to dplyr than data.table, but I see that there are some efforts from Arun in this: R split text string in a data.table columns

bhive01
  • 251
  • 1
  • 12

3 Answers3

5

Using splitstackshape package

library(splitstackshape)
res <- cSplit(cukeDatatest, splitCols = "ObjectWidth", sep = ",", direction = "long")

# result
head(res)
#                            File FruitNum ObjectWidth
# 1: IMG_7888.JPGcolcorrected.jpg        1           4
# 2: IMG_7888.JPGcolcorrected.jpg        1          10
# 3: IMG_7888.JPGcolcorrected.jpg        1          14
# 4: IMG_7888.JPGcolcorrected.jpg        1          15
# 5: IMG_7888.JPGcolcorrected.jpg        1          22
# 6: IMG_7888.JPGcolcorrected.jpg        1          26
zx8754
  • 52,746
  • 12
  • 114
  • 209
4

A Hadleyverse option, with some sensible type-conversion tacked on:

library(dplyr)
library(tidyr)

cukeDatatest %>% 
    # split ObjectWidth into a nested column containing a vector
    mutate(ObjectWidth = strsplit(as.character(.$ObjectWidth), ',')) %>% 
    # unnest nested column, melting data to long form
    unnest() %>% 
    # convert data to integer
    mutate(ObjectWidth = as.integer(ObjectWidth))

# Source: local data frame [39,830 x 3]
# 
#                            File FruitNum ObjectWidth
#                          (fctr)    (int)       (int)
# 1  IMG_7888.JPGcolcorrected.jpg        1           4
# 2  IMG_7888.JPGcolcorrected.jpg        1          10
# 3  IMG_7888.JPGcolcorrected.jpg        1          14
# 4  IMG_7888.JPGcolcorrected.jpg        1          15
# 5  IMG_7888.JPGcolcorrected.jpg        1          22
# 6  IMG_7888.JPGcolcorrected.jpg        1          26
# 7  IMG_7888.JPGcolcorrected.jpg        1          26
# 8  IMG_7888.JPGcolcorrected.jpg        1          28
# 9  IMG_7888.JPGcolcorrected.jpg        1          34
# 10 IMG_7888.JPGcolcorrected.jpg        1          35
# ..                          ...      ...         ...

Edit

Here's an equivalent version with a more typical tidyr approach. One issue with this approach here is the irregular number of terms in ObjectWidth makes making column names somewhat difficult, as separate annoyingly contains no defaults for its into parameter.

A simple workaround here is to deliberately create more columns than you need (the rest of which will be filled with NAs, which will be subsequently deleted by gather). While less than ideally efficient, the code still runs instantaneously, so it's not enough of a performance hit to matter. If it bugs you, figure out the length of the longest row with max(sapply(strsplit(as.character(cukeDatatest$ObjectWidth), ','), length)).

cukeDatatest %>%
    # tbl_df conversion is unnecessary, but nice for printing purposes
    tbl_df() %>%
    # split ObjectWidth on commas into individual columns
    separate(ObjectWidth, into = paste0('X', 1:2500), 
        sep = ',', fill = 'right', convert = TRUE) %>% 
    # gather into long form
    gather(var, ObjectWidth, starts_with('X'), na.rm = TRUE) %>% 
    # remove key column identifying term number within initial ObjectWidth string
    select(-var)

If you have a fixed number of terms in each ObjectWidth string, plain old read.csv called upon ObjectWidth pasted together is a good way to go. read.csv estimates the number of columns from the first five rows, which is fine if the number is constant.

If that doesn't work (like for this data, where the longest row is the seventh), you face the same names issue as above, which can be sorted by offering col.names a set of names of the proper length. The same workaround as above works here, too, if necessary.

read.csv(text = paste(as.character(cukeDatatest$ObjectWidth), collapse = '\n'), 
         header = FALSE, col.names = paste0('V', 1:2179)) %>% 
    bind_cols(cukeDatatest[,-3]) %>% 
    gather(var, ObjectWidth, starts_with('V'), na.rm = TRUE) %>% 
    select(-var)

Both approaches return a tbl_df exactly equivalent to the original approach above.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • I'm gonna end up using this probably due to wanting to stick with hadleyverse (it fits in my current pipeline). Thanks. – bhive01 May 17 '16 at 03:16
  • Thanks for taking it further. As you say, I have vectors of variable length. The initial approach seems the most flexible in this regard. I have one suggested edit (since I don't have the cred to do it), change the strsplit to .$ObjectWidth. This way if someone copies and pastes they only need to change the variable name. – bhive01 May 17 '16 at 18:47
4

I'd normally start with simple strsplit:

dt[, strsplit(ObjectWidth, ",", fixed = T)[[1]], by = .(File, FruitNum)]

If that's too slow, I'd run strsplit on the entire column and then rearrange the data to my liking:

l = strsplit(dt$ObjectWidth, ",", fixed = T)

dt[inverse.rle(list(lengths = lengths(l), values = seq_along(l))),
   .(File, FruitNum)][, col := unlist(l)][]
eddi
  • 49,088
  • 6
  • 104
  • 155