1

My data set looks like this:

                    key      date   census  
    1: 01_35004_10-14_+_M 11NOV2001 2.934397
    2: 01_35004_10-14_+_M 06JAN2002 3.028231
    3: 01_35004_10-14_+_M 07APR2002 3.180712
    4: 01_35004_10-14_+_M 02JUN2002 3.274546
    5: 01_35004_10-14_+_M 28JUL2002 3.368380
    6: 01_35004_10-14_+_M 22SEP2002 3.462214
    7: 01_35004_10-14_+_M 22DEC2002 3.614694
    8: 01_35004_10-14_+_M 16FEB2003 3.708528
    9: 01_35004_10-14_+_M 13JUL2003 3.954843
    10:01_35004_10-14_+_M 07SEP2003 4.048677

Certain characters within the column "key" correspond to different variables. For instance: 01 is the State, 35004 is the Zip Code, 10-14 is the Age Group, + is the Race, M is the Gender

I want to extract each of these characters to create separate variables for them (i.e. a column for state filled with 01, a column for Zip Code filled with 35004, etc)

Here is my code:

    Var = c("State","Zip_Code", "Age_Group", "Race", "Gender")
    for(j in Var){
    play$j = gsub("_.*$","",play$key) 
    }

Clearly this is not correct. I would like the loop to iterate through each observation in the "key" column and produce a variable with the extracted character associated with the variable.

Please Help

Keith
  • 103
  • 1
  • 9
  • You should take a look at `?strsplit`. I think `strsplit(key, "_")` will give you a matrix with all of your variables sorted into columns. – Benjamin May 27 '16 at 13:01
  • `cSplit(df, "key", sep = "_",direction="wide")` using `splitstackshape` library. – mtoto May 27 '16 at 13:03
  • Is there a way to put this in a loop like a for loop where it iterativiley scans each observation and then extracts and puts into the column? – Keith May 27 '16 at 14:37
  • 3
    I think you already asked this: http://stackoverflow.com/q/37469652/ You might want to look into linking your two accounts. – Frank May 27 '16 at 14:38
  • They work, and believe me, I much prefer them. But the way it is, I am working for a consulting group and they asked me to specifically create a function that could do such a thing. Also, because my remote computer -which i use to work on confidential stuff - does not have access to internet, I have to download packages externally and save them on the remote computer and downloading several packages for R to do tasks is very time consuming. – Keith May 27 '16 at 15:03
  • 1
    @Keith, I have edited my answer and provide an answer involving an implicit loop, although I think this is really not what you would want to use. – coffeinjunky May 27 '16 at 15:53

2 Answers2

2

Try

library(tidyr)
df_sep <- separate(df, key, into=c("State","Zip_Code", "Age_Group", "Race", "Gender"), sep="_")

   State Zip_Code Age_Group Race Gender      date   census
1     01    35004     10-14    +      M 11NOV2001 2.934397
2     01    35004     10-14    +      M 06JAN2002 3.028231
3     01    35004     10-14    +      M 07APR2002 3.180712
4     01    35004     10-14    +      M 02JUN2002 3.274546
5     01    35004     10-14    +      M 28JUL2002 3.368380
6     01    35004     10-14    +      M 22SEP2002 3.462214
7     01    35004     10-14    +      M 22DEC2002 3.614694
8     01    35004     10-14    +      M 16FEB2003 3.708528
9     01    35004     10-14    +      M 13JUL2003 3.954843
10    01    35004     10-14    +      M 07SEP2003 4.048677

Edit: Alright, in your comments you have made it clear that you really want to have a solution that loops through observations, which is an inefficient approach and for a good reason typically considered bad practice. Having expressed my objections, let me show you one approach:

First, we need to populate the dataframe with the columns. To use your approach, this would be:

Var = c("State","Zip_Code", "Age_Group", "Race", "Gender")
for(j in Var){
  df <- within(df, assign(j, NA))
}

However, a more efficient approach would be:

df[, Var]<- NA

Both give:

head(df)
                 key      date   census State Zip_Code Age_Group Race Gender
1 01_35004_10-14_+_M 11NOV2001 2.934397    NA       NA        NA   NA     NA
2 01_35004_10-14_+_M 06JAN2002 3.028231    NA       NA        NA   NA     NA
3 01_35004_10-14_+_M 07APR2002 3.180712    NA       NA        NA   NA     NA
4 01_35004_10-14_+_M 02JUN2002 3.274546    NA       NA        NA   NA     NA
5 01_35004_10-14_+_M 28JUL2002 3.368380    NA       NA        NA   NA     NA
6 01_35004_10-14_+_M 22SEP2002 3.462214    NA       NA        NA   NA     NA

Now, for each observation, we want to split key into components and fill columns 4 to 8 with the corresponding elements. This will be achieved with the following:

df[, Var] <- t(sapply(df$key, function(x) unlist(strsplit(as.character(x[1]), "_"))))

Here, sapply loops through the elements of df$key and passes each element as argument the the function that I have defined, and collects the result in an array.

See:

sapply(df$key, function(x) unlist(strsplit(as.character(x[1]), "_")))
     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]  
[1,] "01"    "01"    "01"    "01"    "01"    "01"    "01"    "01"    "01"    "01"   
[2,] "35004" "35004" "35004" "35004" "35004" "35004" "35004" "35004" "35004" "35004"
[3,] "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14"
[4,] "+"     "+"     "+"     "+"     "+"     "+"     "+"     "+"     "+"     "+"    
[5,] "M"     "M"     "M"     "M"     "M"     "M"     "M"     "M"     "M"     "M"    

Transposing it t() makes sure that it "fits" into the dataframe df[, Var], and here you see that the results are identical:

identical(df[,Var], df_sep[Var])
[1] TRUE

I assume that some of the entries in df$key differ in their format, which is why you may want to check each value first. To do so, you can just embellish the function in the sapply call.

coffeinjunky
  • 11,254
  • 39
  • 57
  • 1
    Awesome, man. You're the man! Thank you for your help! – Keith Jun 01 '16 at 13:38
  • Just one small thing, in: "as.character(x[1])" , what does the x[1] signify? Thanks! – Keith Jun 01 '16 at 13:39
  • That is actually redundant. I think I changed the code from supplying the entire row as an argument, to only supplying the key. So, you can probably ignore it. – coffeinjunky Jun 01 '16 at 13:43
  • That is, you can ignore the `[1]`. Just use `x`. – coffeinjunky Jun 01 '16 at 13:49
  • Since I've got you here, I figured I would ask you about this: I would like to use the statement in a function in the form: f = function(x,y,z) { x[,y] = NA x[,y] = sapply(x$z, function(x) unlist(strsplit(as.character(x), "_"))} – Keith Jun 01 '16 at 15:07
  • Then have the user input the there own values: f(df,var,key) to essentially create a data frame just as your code did. I want to do this because the code should be able to extract from a column from any data set depending on the users choice. The df I was using was just for practice. – Keith Jun 01 '16 at 15:09
  • However, when I create that function, then try to input the arguments in the function, I get the error : " Error in `[<-.data.table`(`*tmp*`, , y, value = list()) : Supplied 5 columns to be assigned an empty list (which may be an empty data.table or data.frame since they are lists too). To delete multiple columns use NULL instead. To add multiple empty list columns, use list(list()). – Keith Jun 01 '16 at 15:10
  • Is there any advice or recommendation you could give me to create this "universal function" with your statement? – Keith Jun 01 '16 at 15:12
  • You should post that as a new question. – coffeinjunky Jun 01 '16 at 15:20
1

Here is a base R method that should get you started:

demogVars <- data.frame(do.call(rbind, strsplit(df$key, split="_")))
names(demogVars) <- c("State", "ZipCode", "AgeGroup", "Race", "Gender")

Note that this requires the "key" variable to be of character type. You can easily convert this using

df$key <- as.character(df$key)

This method returns a data.frame where all of the variables are factors. Again, You can convert them to other types, such are character using the as. family of functions and can do it in one go if desired using sapply.

demogVars[] <- sapply(demogVars, as.character)

Since these datasets are arranged in the same order, you can use cbind to put them together:

df <- cbind(demogVars, df[, -1])

data

df <- read.table(header=T, text="key      date   census  
01_35004_10-14_+_M 11NOV2001 2.934397
01_35004_10-14_+_M 06JAN2002 3.028231
01_35004_10-14_+_M 07APR2002 3.180712
01_35004_10-14_+_M 02JUN2002 3.274546
01_35004_10-14_+_M 28JUL2002 3.368380
01_35004_10-14_+_M 22SEP2002 3.462214
01_35004_10-14_+_M 22DEC2002 3.614694
01_35004_10-14_+_M 16FEB2003 3.708528
01_35004_10-14_+_M 13JUL2003 3.954843
01_35004_10-14_+_M 07SEP2003 4.048677", as.is=T)
coffeinjunky
  • 11,254
  • 39
  • 57
lmo
  • 37,904
  • 9
  • 56
  • 69
  • 1
    Thanks. I'll add that caveat to the answer – lmo May 27 '16 at 13:41
  • Thank you!!!while it works, Is there a way to put this in a loop like a for loop where it iterativiley scans each observation and then extracts and puts into the column? – Keith May 27 '16 at 14:38
  • 1
    Why would you want to put it in a loop that scans each observation? That will be a lot less efficient both in terms of time and memory. – lmo May 27 '16 at 14:41