0

I have a data frame with one column that contains strings of different length. For each row, I need to split the long string based on ', ' separator into individual string elements. Then, for each possible individual string I need to create a new column that contains a 1 if that string is present in the row and a 0 otherwise.

I've done it using loops below. However, maybe there is a more elegant way of doing it - e.g., using some existing data wrangling package? Thanks a lot! Here is my code:

# Create an example data frame with one column with strings:
df = data.frame(a = c("one, two, three",
                      "one, three",
                      "two, three, four, five",
                      "one, four, five",
                      "two"), stringsAsFactors = FALSE)
df
str(df$a)

# Split column 'a' into individual strings:
library(stringr)
split_list <- str_split(df$a, ", ")
split_list  # the result is a list of strings

# Grab unique values of all strings:
unique_strings <- sort(unique(unlist(split_list)))
unique_strings

# For each string in unique_strings create a variable with zeros:
df[unique_strings] <- 0
df

# Replace a zero with a 1 in a column if that row contains that string:
for(row in 1:nrow(df)){             # loop through rows
  for(string in split_list[[row]]){ # split a string; populate relevant columns
    df[row, string] <- 1
  }
}
df
user3245256
  • 1,842
  • 4
  • 24
  • 51
  • Your question is a bit open IMO because each row does not have the same number of strings. Do you know already how many columns total you will need to support every row? – Tim Biegeleisen Oct 15 '17 at 15:38
  • Following from the dupe shared, try `library(splitstackshape); cSplit_e(df, "a", ",", type = "character", fill = 0)`. Look also at `mtabulate` from the "qdaptools" package. – A5C1D2H2I1M1N2O1R2T1 Oct 15 '17 at 15:45
  • Possibly `uniq <- unique(trimws(unlist(strsplit(df$a,",")))); df[,uniq] <- lapply(uniq, function(x) as.numeric(grepl(x, df$a)))`? – Mike H. Oct 15 '17 at 15:47

0 Answers0