0

I need to write an R code, that will do the following:

  • go through the column using a loop
  • split each value by the comma and assign those into a variable
  • compare the values in that variable to existing column names
  • if the column name does not exist, create a new column, one for each comma-separated values
  • populate '1' into the observation for that new column
  • if the column name exists, add '1' into that observation's value for the existing column with that name

The data(column) before manipulations looks like this:

                                     jobTitle
1                                        <NA>
2                                        <NA>
3                                        <NA>
4   Functional Architect, Business Technology
5                                        <NA>
6                                        <NA>
7                                        <NA>
8                                        <NA>
9                                        <NA>
10                      Founder and President
11                            Product Manager
12                                       <NA>
13                                       <NA>
14                                       <NA>
15 Head of Customer Experience & Online Sales
16                                       <NA>
17                                       <NA>
18                      Founder and President
19                                       <NA>
20                                       <NA>
21                            Product Manager
22                                       <NA>
23                     Customer Value Manager
24                                       <NA>
25                    Lead Software Developer
  ...

The output I need is:

Founder and President  Product Manager
       0                       1        
       1                       0      
       0                       1
       1                       0

The output I am getting is:

Founder and President  Product Manager  Founder and President  Product Manager
       0                       1                   0                 0      
       1                       0                   0                 0     
       0                       0                   1                 0      
       0                       0                   0                 1

The code I have is:

library(plyr)
library(stringr)
library(gdata) 
library(readxl)

train <- read_excel("data.xlsx")

#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){ 
        if ((!is.na(train[i,4])) {
            #split every value by the comma, convert to lower case
            list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
            for(j in 1:length(list2char[[1]])) {
                    #populate the current observation for the newly created column with 1
                    if(!(list2char[[1]][j] %in% names(train))){
                            #if the name does not match existing column name, create a new column and assign 1
                            train[i, str_trim(list2char[[1]][j])] <- 1
                    }else{
                            #if the name matches an existing column name, assign 1 to that column

                    }

            }

    }
}

#replace all NAs with 0s
train[is.na(train)] <- 0
Feyzi Bagirov
  • 1,292
  • 4
  • 28
  • 46
  • data.frames aside, lists don't have rows or columns. What does your desired output look like? – alistaire Dec 15 '16 at 22:54
  • In order for us to better understand your question (and for you to have a better chance of it getting answered) it is best to include a reproducible example. To start with, you can share your data by typing `dput(myVariable)` into the console and copying and pasting the output into your question. Similarly, you should include an example of the desired output. Complete instructions for creating a reproducible example in R can be found [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Barker Dec 16 '16 at 00:03
  • Worth noting `strsplit()` is vectorized so you can call it on columns of your data, thus the loops may be unnecessary. – gfgm Dec 16 '16 at 00:31
  • Barker, I edited the question with more details, see above – Feyzi Bagirov Dec 16 '16 at 15:05

1 Answers1

0

I think you're trying to count the frequency of each variable in a comma-delimited string?

    s<-data.frame(A=c("A1,B", "A2,C1"),B=c("B1,B2","C1,A1"), C=c("C1,C2,C3","C4"))
    #      A     B        C
    #1  A1,B B1,B2 C1,C2,C3
    #2 A2,C1 C1,A1       C4

    table( unlist(apply(s,1, function(s.row) {
       strsplit(s.row,",")
    })) )

    #A1 A2  B B1 B2 C1 C2 C3 C4 
    #2  1  1  1  1  3  1  1  1
Jean
  • 1,480
  • 15
  • 27