-2

I need some help setting up a code in R for a solving a problem:

I want to give some string data to R as input which contains several words (phrases, tweets, whatever you want). The strings could also have multiple " " or "," as seperators.

sample input data

enter image description here

I want R to setup a variable for each unique word within all input strings and set 1 (or TRUE, or anything else) when the string contains this specific word.

So my desired output looks something to this:

sample output

enter image description here

The empty spaces in the columns should contain 0, for easier reading I left them out.

To be honest I am no expert for loops and think there could be an easier solution with a package. I appreciate any support from your site on this topic, as I have several different projects where the solution could save me a lot of time.

Edit: I want to keep the original ID & String for further processing.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
AMWiedl
  • 17
  • 4
  • Can you explain a bit more about how you might do this in R? Are you familiar with dataframes? – Chris Feb 25 '20 at 22:29
  • I am familiar with dataframes in general but not how I could apply a simple dataframe here. I thought there could be a 2 step solution. First identyfing all unique values and make them columns and step 2 filling columns by testing if the word is present in the string. But I hope that there's a ready-to-use solution which also saves runtime. – AMWiedl Feb 25 '20 at 22:37

2 Answers2

1

First off, for future posts please provide sample data in a reproducible and copy&paste-able format. Screenshots are not a good idea because we can't easily extract data from an image. For more details, please review how to provide a minimal reproducible example/attempt.

That aside, here is a tidyverse solution

library(tidyverse)
df %>%
    separate_rows(Text, sep = " ") %>%
    mutate(n = 1) %>%
    pivot_wider(names_from = "Text", values_from = "n", values_fill = list(n = 0))
## A tibble: 5 x 6
#  ID      Peanut Butter Jelly Storm  Wind
#  <fct>    <dbl>  <dbl> <dbl> <dbl> <dbl>
#1 ID-0001      1      1     1     0     0
#2 ID-0002      1      0     0     0     0
#3 ID-0003      0      1     0     0     0
#4 ID-0004      0      0     0     1     0
#5 ID-0005      0      1     0     1     1

Explanation: We use separare_rows to split entries in Text on white spaces and reshape data into long format; we then add a count column; finally we reshape data from long to wide with pivot_wider, and fill missing values with 0.


Or in base R using xtabs

df2 <- transform(df, Text = strsplit(as.character(Text), " "))
xtabs(n ~ ., data.frame(
    ID = with(df2, rep(ID, vapply(Text, length, 1L))),
    Text = unlist(df2$Text),
    n = 1))
#ID        Butter Jelly Peanut Storm Wind
#  ID-0001      1     1      1     0    0
#  ID-0002      0     0      1     0    0
#  ID-0003      1     0      0     0    0
#  ID-0004      0     0      0     1    0
#  ID-0005      1     0      0     1    1

Sample data

df <- read.table(text =
"ID Text
ID-0001   'Peanut Butter Jelly'
ID-0002   Peanut
ID-0003   Butter
ID-0004   Storm
ID-0005   'Storm Wind Butter'", header = T)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • The xtabs solution works best for me, as I receiva a error when using "pivot_wider". – AMWiedl Feb 26 '20 at 10:36
  • Hi @AMWiedl; you may have to update `tidyr`; `pivot_wider` was introduced in `tidyr_1.0.0` in September 2019, and is meant to replace `spread` (in the same way that `pivot_longer` replaces `gather`). – Maurits Evers Feb 26 '20 at 21:13
0

In base R your desired two-step solution would look like this:

# Extract all words, keep only unique words, sort in alphabetic order:
all_words <- sort(unique(unlist(strsplit(df$strings, "\\W"))))

# Fill columns with 1 or 0 depending on whether the word is present in each string
cbind(df, sapply(all_words, function(x) 1 * grepl(x, df$strings)))
#>       ID             strings Butter Jelly Peanut Storm Wind
#> 1 ID0001 Peanut Butter Jelly      1     1      1     0    0
#> 2 ID0002              Peanut      0     0      1     0    0
#> 3 ID0003              Butter      1     0      0     0    0
#> 4 ID0004               Storm      0     0      0     1    0
#> 5 ID0005   Storm Wind Butter      1     0      0     1    1

Data used:

df <- structure(list(ID = c("ID0001", "ID0002", "ID0003", "ID0004", 
      "ID0005"), strings = c("Peanut Butter Jelly", "Peanut", "Butter", 
      "Storm", "Storm Wind Butter")), class = "data.frame", row.names = c(NA, -5L))

Created on 2020-02-25 by the reprex package (v0.3.0)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87