R: How do I compare a string to an existing column names list?

Question

I need to write an R code, that will do the following:

go through the column using a loop
split each value by the comma and assign those into a variable
compare the values in that variable to existing column names
if the column name does not exist, create a new column, one for each comma-separated values
populate '1' into the observation for that new column
if the column name exists, add '1' into that observation's value for the existing column with that name

The data(column) before manipulations looks like this:

                                     jobTitle
1                                        <NA>
2                                        <NA>
3                                        <NA>
4   Functional Architect, Business Technology
5                                        <NA>
6                                        <NA>
7                                        <NA>
8                                        <NA>
9                                        <NA>
10                      Founder and President
11                            Product Manager
12                                       <NA>
13                                       <NA>
14                                       <NA>
15 Head of Customer Experience & Online Sales
16                                       <NA>
17                                       <NA>
18                      Founder and President
19                                       <NA>
20                                       <NA>
21                            Product Manager
22                                       <NA>
23                     Customer Value Manager
24                                       <NA>
25                    Lead Software Developer
  ...

The output I need is:

Founder and President  Product Manager
       0                       1        
       1                       0      
       0                       1
       1                       0

The output I am getting is:

Founder and President  Product Manager  Founder and President  Product Manager
       0                       1                   0                 0      
       1                       0                   0                 0     
       0                       0                   1                 0      
       0                       0                   0                 1

The code I have is:

library(plyr)
library(stringr)
library(gdata) 
library(readxl)

train <- read_excel("data.xlsx")

#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){ 
        if ((!is.na(train[i,4])) {
            #split every value by the comma, convert to lower case
            list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
            for(j in 1:length(list2char[[1]])) {
                    #populate the current observation for the newly created column with 1
                    if(!(list2char[[1]][j] %in% names(train))){
                            #if the name does not match existing column name, create a new column and assign 1
                            train[i, str_trim(list2char[[1]][j])] <- 1
                    }else{
                            #if the name matches an existing column name, assign 1 to that column

                    }

            }

    }
}

#replace all NAs with 0s
train[is.na(train)] <- 0

data.frames aside, lists don't have rows or columns. What does your desired output look like? — alistaire, Dec 15 '16 at 22:54
In order for us to better understand your question (and for you to have a better chance of it getting answered) it is best to include a reproducible example. To start with, you can share your data by typing `dput(myVariable)` into the console and copying and pasting the output into your question. Similarly, you should include an example of the desired output. Complete instructions for creating a reproducible example in R can be found [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Barker, Dec 16 '16 at 00:03
Worth noting `strsplit()` is vectorized so you can call it on columns of your data, thus the loops may be unnecessary. — gfgm, Dec 16 '16 at 00:31

score 0 · Answer 1 · answered Dec 16 '16 at 02:26

I think you're trying to count the frequency of each variable in a comma-delimited string?

    s<-data.frame(A=c("A1,B", "A2,C1"),B=c("B1,B2","C1,A1"), C=c("C1,C2,C3","C4"))
    #      A     B        C
    #1  A1,B B1,B2 C1,C2,C3
    #2 A2,C1 C1,A1       C4

    table( unlist(apply(s,1, function(s.row) {
       strsplit(s.row,",")
    })) )

    #A1 A2  B B1 B2 C1 C2 C3 C4 
    #2  1  1  1  1  3  1  1  1

R: How do I compare a string to an existing column names list?

1 Answers1