4

From a csv file I loaded date into an R dataframe that looks like this:

> head(mydata)
  row lengthArray                         sports num_runs percent_runs
1   0           4               [24, 18, 24, 18]        0            0
2   1          10 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]        0            0
3   2           4                   [0, 0, 0, 0]        0            0
4   3           2                         [0, 0]        0            0
5   4           2                       [18, 18]        0            0
6   5           1                            [0]        0            0

I can access and get the types for the integer data frames no problem, but I can't figure out how to access sports:

> class(mydata[4,3])
[1] "factor" 
>  string_factor = mydata[1,3]
> string_factor
[1] [24, 18, 24, 18]
6378 Levels: [0] [0, 0] [0, 0, 0] [0, 0, 0, 0] ... [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]
> class(string_factor)
[1] "factor"
> string_factor_numeric = as.numeric(string_factor)
> string_factor_numeric
[1] 5181

I guess the best R response would be "don't do this", but this is how the data is coming, so I am wondering how I can get those numbers out of the array so that I can use them.

I should also mention that this Convert data.frame columns from factors to characters gave no error message but had no effect, as the array column continued to be classed as factors.

UPDATE: from the comments, you can see this can get you somewhere:
mydata[,3]  <- as.character(mydata[,3])

However this still does not get you to an array with individually accessible elements.

Community
  • 1
  • 1
sunny
  • 3,853
  • 5
  • 32
  • 62
  • Convert the sports column into character.`mydata[4,3]<-as.character(mydata[4,3])` – user227710 Jun 15 '15 at 22:32
  • @user227710 thanks for the suggestion, but that had no effect > mydata[[1,3]<-as.character(mydata[1,3]) Error: unexpected assignment in "mydata[[1,3]<-" >mydata[1,3]<-as.character(mydata[1,3]) > class(mydata[1,3]) [1] "factor" – sunny Jun 15 '15 at 22:37
  • `mydata[[1,3]<-` should be `mydata[1,3]<-` – user227710 Jun 15 '15 at 22:39
  • @user227710 that was a typo while transcribing, I think. Here's pasted directly:> mydata[1,3] <- as.character(mydata[1,3]) > class(mydata[1,3]) [1] "factor" – sunny Jun 15 '15 at 22:42
  • 1
    You can't make just the first row a character, you have to make the whole column character: `mydata[, 3] <- as.character(mydata[, 3])`. – Gregor Thomas Jun 15 '15 at 22:43
  • Also "did not work" isn't informative. Did it give an error message? A warning message? You would also do well to give your desired outcome. Do you want to turn the numbers in `sports` into columns? Do you want to reshape the wide sports column to long? – Gregor Thomas Jun 15 '15 at 22:45
  • @Gregor you are correct, I tried to make my question more descriptive. My goal is to create a column for each distinct integer and then for each row that column's value will be the number of times that particular integer appeared in the array. So yes, I do want to reshape the wide sports column to long if I understand that correctly. – sunny Jun 15 '15 at 22:47
  • @sunny: You should follow the advice of Gregor. It must work. – user227710 Jun 15 '15 at 22:48
  • @user227710 it did work, you are correct. – sunny Jun 15 '15 at 22:48
  • @user227710 this in some ways leads me to another puzzle because this character type because now I have > f = mydata[1,3] > f [1] "[{u'sport': 24}, {u'sport': 18}, {u'sport': 24}, {u'sport': 18}]" > 24 %in% f [1] FALSE > 18 %in% f [1] FALSE > class(f) [1] "character" – sunny Jun 15 '15 at 22:50
  • @user227710 I am happy to delete if it's not helpful, but I still cannot access the individual members of the array. – sunny Jun 15 '15 at 22:50
  • It's not an array, it's just a string. – Gregor Thomas Jun 15 '15 at 22:55
  • @Gregor if it were a string, I should be able to access individual characters? I don't think it's just a string because the output looks funny. The as.character converted this:[24,18,24,18] to this: {u'sport': 24}, {u'sport': 18}, {u'sport': 24}, {u'sport': 18} – sunny Jun 15 '15 at 22:57

3 Answers3

4

Here's another idea using splitstackshape:

library(splitstackshape)
library(dplyr)
mydata %>% 
  mutate(sports = gsub("\\[|\\]", "", sports)) %>%
  cSplit("sports", sep = ",", direction = "wide")

Which gives:

   row lengthArray num_runs percent_runs sports_01 sports_02 sports_03 sports_04 sports_05 sports_06 sports_07 sports_08 sports_09 sports_10
1:   0           4        0            0        24        18        24        18        NA        NA        NA        NA        NA        NA
2:   1          10        0            0         2         2         2         2         2         2         2         2         2         2
3:   2           4        0            0         0         0         0         0        NA        NA        NA        NA        NA        NA
4:   3           2        0            0         0         0        NA        NA        NA        NA        NA        NA        NA        NA
5:   4           2        0            0        18        18        NA        NA        NA        NA        NA        NA        NA        NA
6:   5           1        0            0         0        NA        NA        NA        NA        NA        NA        NA        NA        NA

Or as per @thelatemail comment, you could also store a list as a column:

library(stringi)
df <- mydata %>%
  mutate(sports = as.list(stri_extract_all(sports, regex = "[:digit:]")))

Which will give you the following data structure:

> str(df)
#'data.frame':  6 obs. of  5 variables:
# $ row         : int  0 1 2 3 4 5
# $ lengthArray : int  4 10 4 2 2 1
# $ sports      :List of 6
#  ..$ : chr  "2" "4" "1" "8" ...
#  ..$ : chr  "2" "2" "2" "2" ...
#  ..$ : chr  "0" "0" "0" "0"
#  ..$ : chr  "0" "0"
#  ..$ : chr  "1" "8" "1" "8"
#  ..$ : chr "0"
# $ num_runs    : int  0 0 0 0 0 0
# $ percent_runs: int  0 0 0 0 0 0 

You can then access the elements of the list like this:

> df$sports[[1]][1] #first element of first list
#[1] "2"
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
1

Here's your data with dput:

mydata = structure(list(row = 0:5, lengthArray = c(4L, 10L, 4L, 2L, 2L, 
1L), sports = structure(c(6L, 5L, 1L, 2L, 4L, 3L), .Label = c("[0, 0, 0, 0]", 
"[0, 0]", "[0]", "[18, 18]", "[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]", 
"[24, 18, 24, 18]"), class = "factor"), num_runs = c(0L, 0L, 
0L, 0L, 0L, 0L), percent_runs = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("row", 
"lengthArray", "sports", "num_runs", "percent_runs"), class = "data.frame", row.names = c(NA, 
-6L))

First we convert the sports column to a character

mydata$sports = as.character(mydata$sports)

Now I'll get rid of the brackets and spaces (leaving the commas)

library(stringr)
mydata$sports = str_replace_all(mydata$sports, pattern = "\\[|\\]| ", "")

And lastly separate the sports column into multiple columns

library(tidyr)
mydata = separate(mydata, sports, into = paste0("sport", 1:max(mydata$lengthArray)), sep = ",", extra = "drop")

mydata
#  row lengthArray sport1 sport2 sport3 sport4 sport5 sport6 sport7 sport8 sport9 sport10 num_runs percent_runs
#1   0           4     24     18     24     18   <NA>   <NA>   <NA>   <NA>   <NA>    <NA>        0            0
#2   1          10      2      2      2      2      2      2      2      2      2       2        0            0
#3   2           4      0      0      0      0   <NA>   <NA>   <NA>   <NA>   <NA>    <NA>        0            0
#4   3           2      0      0   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>    <NA>        0            0
#5   4           2     18     18   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>    <NA>        0            0
#6   5           1      0   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>    <NA>        0            0
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 2
    It's also fine to store a list as a column in a data.frame - e.g. - `mydata$sports <- strsplit(gsub("^\\[|\\]$","",as.character(mydata$sports)),", |\\[|\\]")` - which you can then access subcomponents of. – thelatemail Jun 15 '15 at 23:10
0

Recreating your data:

text = "
row lengthArray                            sports num_runs percent_runs
   0           4               '[24, 18, 24, 18]'        0            0
   1          10 '[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]'        0            0
   2           4                   '[0, 0, 0, 0]'        0            0
   3           2                         '[0, 0]'        0            0
   4           2                       '[18, 18]'        0            0
   5           1                            '[0]'        0            0"

data <- read.table(text = text, header= TRUE)

You probably shoud take the values in sports and create new columns... but, if want to create the vectors inside the sports column, you can actually do that:

data$sports <- as.character(data$sports)
data$sports <- lapply(data$sports, function(x) eval(parse(text = paste0("c(", gsub("\\[|\\]", "", x),")"))))

Now, for example, if you want to get the third value of the first line of sports:

data$sports[[1]][[3]]
[1] 24
Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66