I need to write an R code, that will do the following:
- go through the column using a loop
- split each value by the comma and assign those into a variable
- compare the values in that variable to existing column names
- if the column name does not exist, create a new column, one for each comma-separated values
- populate '1' into the observation for that new column
- if the column name exists, add '1' into that observation's value for the existing column with that name
The data(column) before manipulations looks like this:
jobTitle
1 <NA>
2 <NA>
3 <NA>
4 Functional Architect, Business Technology
5 <NA>
6 <NA>
7 <NA>
8 <NA>
9 <NA>
10 Founder and President
11 Product Manager
12 <NA>
13 <NA>
14 <NA>
15 Head of Customer Experience & Online Sales
16 <NA>
17 <NA>
18 Founder and President
19 <NA>
20 <NA>
21 Product Manager
22 <NA>
23 Customer Value Manager
24 <NA>
25 Lead Software Developer
...
The output I need is:
Founder and President Product Manager
0 1
1 0
0 1
1 0
The output I am getting is:
Founder and President Product Manager Founder and President Product Manager
0 1 0 0
1 0 0 0
0 0 1 0
0 0 0 1
The code I have is:
library(plyr)
library(stringr)
library(gdata)
library(readxl)
train <- read_excel("data.xlsx")
#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){
if ((!is.na(train[i,4])) {
#split every value by the comma, convert to lower case
list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
for(j in 1:length(list2char[[1]])) {
#populate the current observation for the newly created column with 1
if(!(list2char[[1]][j] %in% names(train))){
#if the name does not match existing column name, create a new column and assign 1
train[i, str_trim(list2char[[1]][j])] <- 1
}else{
#if the name matches an existing column name, assign 1 to that column
}
}
}
}
#replace all NAs with 0s
train[is.na(train)] <- 0