0

My data set looks like this:

                  key      date   census  j
1: 01_35004_10-14_+_M 11NOV2001 2.934397 01
2: 01_35004_10-14_+_M 06JAN2002 3.028231 01
3: 01_35004_10-14_+_M 07APR2002 3.180712 01
4: 01_35004_10-14_+_M 02JUN2002 3.274546 01
5: 01_35004_10-14_+_M 28JUL2002 3.368380 01
6: 01_35004_10-14_+_M 22SEP2002 3.462214 01
7: 01_35004_10-14_+_M 22DEC2002 3.614694 01
8: 01_35004_10-14_+_M 16FEB2003 3.708528 01
9: 01_35004_10-14_+_M 13JUL2003 3.954843 01
10: 01_35004_10-14_+_M 07SEP2003 4.048677 01

Certain characters within the column "key" correspond to different variables. For instance: 01 is the State, 35004 is the Zip Code, 10-14 is the Age Group, + is the Race, M is the Gender

I want to extract each of these characters to create separate variables for them (i.e. a column for state filled with 01, a column for Zip Code filled with 35004, etc)

Here is my code:

Var = c("State","Zip_Code", "Age_Group", "Race", "Gender")
for(j in Var){
play$j = gsub("_.*$","",play$key) 
}

Clearly this is not correct. I would like the loop to iterate through each observation in the "key" column and produce a variable with the extracted character associated with the variable.

Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 1
    Please create a [mcve] instead of link to images of your data and code – talat May 26 '16 at 19:54
  • Welcome to StackOverflow. Please read the following tips on creating a [minimum example](http://stackoverflow.com/help/mcve) and this post on producing a [reproducible R example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Then edit your question accordingly. – lmo May 26 '16 at 19:54
  • @docendo Please reopen. Thank you!!! – Keith Siopes May 27 '16 at 12:54
  • @KeithSiopes, I voted to reopen (but more votes required). You can use for example `library(tidyr); newcols <- c("State", "ZipCode", "AgeGroup", "Race", "Gender"); DF2 <- separate(DF, col = key, into = newcols, sep = "_")` – talat May 27 '16 at 13:04
  • How "big" is your data set (an performance requirements?)? – R Yoda May 28 '16 at 22:55
  • You can find a similar question + answers here using `strplit`: [http://stackoverflow.com/questions/33585493/repeated-rows-in-r-with-different-sequence] – R Yoda May 28 '16 at 23:03
  • I guess the "Age_Group" column has no "association" (lookup table) but should be splitted into two columns (upper and lower age)? – R Yoda May 28 '16 at 23:18
  • I remember answering this question... isn't `splitstackshape::cSplit(df, 'key', '_', 'wide')` what you need? – Sotos May 29 '16 at 11:48

1 Answers1

0

The basic solution (without expecting a good performance) uses read.csv:

# excerpt of your data (only the "coordinate" column containing the model point coordinates)
x <- c("01_35004_10-14_+_M", "01_35004_10-14_+_M")

# simple way is treating the string as CSV row :-)
y <- read.csv(text = x, sep="_", header=FALSE)

# Fix the wrong column names
names(y) <- c("State","Zip_Code", "Age_Group", "Race", "Gender")

# Now recode one example column by using translation ("lookup") table
gender.lookup <- data.frame( gender.code=c("M", "F"), gender.name=c("Male", "Female"))

# Add the recoded value as new column. Note: Lookup failures are ignored
y$GenderName <- gender.lookup$gender.name[match(y$Gender, gender.lookup$gender.code)]

I am leaving the implementation of the loop to your imagination since I don't have more lookup data in your question... (e. g. use lapply and a list of lookup tables with the same index positions as the column indices).

R Yoda
  • 8,358
  • 2
  • 50
  • 87