How to create columns/variables by extracting characters from given column in R

Question

My data set looks like this:

                    key      date   census  
    1: 01_35004_10-14_+_M 11NOV2001 2.934397
    2: 01_35004_10-14_+_M 06JAN2002 3.028231
    3: 01_35004_10-14_+_M 07APR2002 3.180712
    4: 01_35004_10-14_+_M 02JUN2002 3.274546
    5: 01_35004_10-14_+_M 28JUL2002 3.368380
    6: 01_35004_10-14_+_M 22SEP2002 3.462214
    7: 01_35004_10-14_+_M 22DEC2002 3.614694
    8: 01_35004_10-14_+_M 16FEB2003 3.708528
    9: 01_35004_10-14_+_M 13JUL2003 3.954843
    10:01_35004_10-14_+_M 07SEP2003 4.048677

Certain characters within the column "key" correspond to different variables. For instance: 01 is the State, 35004 is the Zip Code, 10-14 is the Age Group, + is the Race, M is the Gender

I want to extract each of these characters to create separate variables for them (i.e. a column for state filled with 01, a column for Zip Code filled with 35004, etc)

Here is my code:

    Var = c("State","Zip_Code", "Age_Group", "Race", "Gender")
    for(j in Var){
    play$j = gsub("_.*$","",play$key) 
    }

Clearly this is not correct. I would like the loop to iterate through each observation in the "key" column and produce a variable with the extracted character associated with the variable.

Please Help

You should take a look at `?strsplit`. I think `strsplit(key, "_")` will give you a matrix with all of your variables sorted into columns. — Benjamin, May 27 '16 at 13:01
`cSplit(df, "key", sep = "_",direction="wide")` using `splitstackshape` library. — mtoto, May 27 '16 at 13:03
Is there a way to put this in a loop like a for loop where it iterativiley scans each observation and then extracts and puts into the column? — Keith, May 27 '16 at 14:37
I think you already asked this: http://stackoverflow.com/q/37469652/ You might want to look into linking your two accounts. — Frank, May 27 '16 at 14:38
They work, and believe me, I much prefer them. But the way it is, I am working for a consulting group and they asked me to specifically create a function that could do such a thing. Also, because my remote computer -which i use to work on confidential stuff - does not have access to internet, I have to download packages externally and save them on the remote computer and downloading several packages for R to do tasks is very time consuming. — Keith, May 27 '16 at 15:03
@Keith, I have edited my answer and provide an answer involving an implicit loop, although I think this is really not what you would want to use. — coffeinjunky, May 27 '16 at 15:53

coffeinjunky · Accepted Answer · 2016-05-27T16:04:57.560

Try

library(tidyr)
df_sep <- separate(df, key, into=c("State","Zip_Code", "Age_Group", "Race", "Gender"), sep="_")

   State Zip_Code Age_Group Race Gender      date   census
1     01    35004     10-14    +      M 11NOV2001 2.934397
2     01    35004     10-14    +      M 06JAN2002 3.028231
3     01    35004     10-14    +      M 07APR2002 3.180712
4     01    35004     10-14    +      M 02JUN2002 3.274546
5     01    35004     10-14    +      M 28JUL2002 3.368380
6     01    35004     10-14    +      M 22SEP2002 3.462214
7     01    35004     10-14    +      M 22DEC2002 3.614694
8     01    35004     10-14    +      M 16FEB2003 3.708528
9     01    35004     10-14    +      M 13JUL2003 3.954843
10    01    35004     10-14    +      M 07SEP2003 4.048677

Edit: Alright, in your comments you have made it clear that you really want to have a solution that loops through observations, which is an inefficient approach and for a good reason typically considered bad practice. Having expressed my objections, let me show you one approach:

First, we need to populate the dataframe with the columns. To use your approach, this would be:

Var = c("State","Zip_Code", "Age_Group", "Race", "Gender")
for(j in Var){
  df <- within(df, assign(j, NA))
}

However, a more efficient approach would be:

df[, Var]<- NA

Both give:

head(df)
                 key      date   census State Zip_Code Age_Group Race Gender
1 01_35004_10-14_+_M 11NOV2001 2.934397    NA       NA        NA   NA     NA
2 01_35004_10-14_+_M 06JAN2002 3.028231    NA       NA        NA   NA     NA
3 01_35004_10-14_+_M 07APR2002 3.180712    NA       NA        NA   NA     NA
4 01_35004_10-14_+_M 02JUN2002 3.274546    NA       NA        NA   NA     NA
5 01_35004_10-14_+_M 28JUL2002 3.368380    NA       NA        NA   NA     NA
6 01_35004_10-14_+_M 22SEP2002 3.462214    NA       NA        NA   NA     NA

Now, for each observation, we want to split key into components and fill columns 4 to 8 with the corresponding elements. This will be achieved with the following:

df[, Var] <- t(sapply(df$key, function(x) unlist(strsplit(as.character(x[1]), "_"))))

Here, sapply loops through the elements of df$key and passes each element as argument the the function that I have defined, and collects the result in an array.

See:

sapply(df$key, function(x) unlist(strsplit(as.character(x[1]), "_")))
     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]    [,10]  
[1,] "01"    "01"    "01"    "01"    "01"    "01"    "01"    "01"    "01"    "01"   
[2,] "35004" "35004" "35004" "35004" "35004" "35004" "35004" "35004" "35004" "35004"
[3,] "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14" "10-14"
[4,] "+"     "+"     "+"     "+"     "+"     "+"     "+"     "+"     "+"     "+"    
[5,] "M"     "M"     "M"     "M"     "M"     "M"     "M"     "M"     "M"     "M"

Transposing it t() makes sure that it "fits" into the dataframe df[, Var], and here you see that the results are identical:

identical(df[,Var], df_sep[Var])
[1] TRUE

I assume that some of the entries in df$key differ in their format, which is why you may want to check each value first. To do so, you can just embellish the function in the sapply call.

Just one small thing, in: "as.character(x[1])" , what does the x[1] signify? Thanks! — Keith, Jun 01 '16 at 13:39
That is actually redundant. I think I changed the code from supplying the entire row as an argument, to only supplying the key. So, you can probably ignore it. — coffeinjunky, Jun 01 '16 at 13:43
Since I've got you here, I figured I would ask you about this: I would like to use the statement in a function in the form: f = function(x,y,z) { x[,y] = NA x[,y] = sapply(x$z, function(x) unlist(strsplit(as.character(x), "_"))} — Keith, Jun 01 '16 at 15:07
Then have the user input the there own values: f(df,var,key) to essentially create a data frame just as your code did. I want to do this because the code should be able to extract from a column from any data set depending on the users choice. The df I was using was just for practice. — Keith, Jun 01 '16 at 15:09
However, when I create that function, then try to input the arguments in the function, I get the error : " Error in `[<-.data.table`(`*tmp*`, , y, value = list()) : Supplied 5 columns to be assigned an empty list (which may be an empty data.table or data.frame since they are lists too). To delete multiple columns use NULL instead. To add multiple empty list columns, use list(list()). — Keith, Jun 01 '16 at 15:10
Is there any advice or recommendation you could give me to create this "universal function" with your statement? — Keith, Jun 01 '16 at 15:12

score 1 · Answer 2 · edited May 27 '16 at 14:14

Here is a base R method that should get you started:

demogVars <- data.frame(do.call(rbind, strsplit(df$key, split="_")))
names(demogVars) <- c("State", "ZipCode", "AgeGroup", "Race", "Gender")

Note that this requires the "key" variable to be of character type. You can easily convert this using

df$key <- as.character(df$key)

This method returns a data.frame where all of the variables are factors. Again, You can convert them to other types, such are character using the as. family of functions and can do it in one go if desired using sapply.

demogVars[] <- sapply(demogVars, as.character)

Since these datasets are arranged in the same order, you can use cbind to put them together:

df <- cbind(demogVars, df[, -1])

data

df <- read.table(header=T, text="key      date   census  
01_35004_10-14_+_M 11NOV2001 2.934397
01_35004_10-14_+_M 06JAN2002 3.028231
01_35004_10-14_+_M 07APR2002 3.180712
01_35004_10-14_+_M 02JUN2002 3.274546
01_35004_10-14_+_M 28JUL2002 3.368380
01_35004_10-14_+_M 22SEP2002 3.462214
01_35004_10-14_+_M 22DEC2002 3.614694
01_35004_10-14_+_M 16FEB2003 3.708528
01_35004_10-14_+_M 13JUL2003 3.954843
01_35004_10-14_+_M 07SEP2003 4.048677", as.is=T)

Thank you!!!while it works, Is there a way to put this in a loop like a for loop where it iterativiley scans each observation and then extracts and puts into the column? — Keith, May 27 '16 at 14:38
Why would you want to put it in a loop that scans each observation? That will be a lot less efficient both in terms of time and memory. — lmo, May 27 '16 at 14:41

How to create columns/variables by extracting characters from given column in R

2 Answers2