A Hadleyverse option, with some sensible type-conversion tacked on:
library(dplyr)
library(tidyr)
cukeDatatest %>%
# split ObjectWidth into a nested column containing a vector
mutate(ObjectWidth = strsplit(as.character(.$ObjectWidth), ',')) %>%
# unnest nested column, melting data to long form
unnest() %>%
# convert data to integer
mutate(ObjectWidth = as.integer(ObjectWidth))
# Source: local data frame [39,830 x 3]
#
# File FruitNum ObjectWidth
# (fctr) (int) (int)
# 1 IMG_7888.JPGcolcorrected.jpg 1 4
# 2 IMG_7888.JPGcolcorrected.jpg 1 10
# 3 IMG_7888.JPGcolcorrected.jpg 1 14
# 4 IMG_7888.JPGcolcorrected.jpg 1 15
# 5 IMG_7888.JPGcolcorrected.jpg 1 22
# 6 IMG_7888.JPGcolcorrected.jpg 1 26
# 7 IMG_7888.JPGcolcorrected.jpg 1 26
# 8 IMG_7888.JPGcolcorrected.jpg 1 28
# 9 IMG_7888.JPGcolcorrected.jpg 1 34
# 10 IMG_7888.JPGcolcorrected.jpg 1 35
# .. ... ... ...
Edit
Here's an equivalent version with a more typical tidyr
approach. One issue with this approach here is the irregular number of terms in ObjectWidth
makes making column names somewhat difficult, as separate
annoyingly contains no defaults for its into
parameter.
A simple workaround here is to deliberately create more columns than you need (the rest of which will be filled with NA
s, which will be subsequently deleted by gather
). While less than ideally efficient, the code still runs instantaneously, so it's not enough of a performance hit to matter. If it bugs you, figure out the length of the longest row with max(sapply(strsplit(as.character(cukeDatatest$ObjectWidth), ','), length))
.
cukeDatatest %>%
# tbl_df conversion is unnecessary, but nice for printing purposes
tbl_df() %>%
# split ObjectWidth on commas into individual columns
separate(ObjectWidth, into = paste0('X', 1:2500),
sep = ',', fill = 'right', convert = TRUE) %>%
# gather into long form
gather(var, ObjectWidth, starts_with('X'), na.rm = TRUE) %>%
# remove key column identifying term number within initial ObjectWidth string
select(-var)
If you have a fixed number of terms in each ObjectWidth
string, plain old read.csv
called upon ObjectWidth
pasted together is a good way to go. read.csv
estimates the number of columns from the first five rows, which is fine if the number is constant.
If that doesn't work (like for this data, where the longest row is the seventh), you face the same names issue as above, which can be sorted by offering col.names
a set of names of the proper length. The same workaround as above works here, too, if necessary.
read.csv(text = paste(as.character(cukeDatatest$ObjectWidth), collapse = '\n'),
header = FALSE, col.names = paste0('V', 1:2179)) %>%
bind_cols(cukeDatatest[,-3]) %>%
gather(var, ObjectWidth, starts_with('V'), na.rm = TRUE) %>%
select(-var)
Both approaches return a tbl_df exactly equivalent to the original approach above.