0

I am doing topic modeling but need to remove certain characters. Specifically bullet points remain in my terms list.

USAID_stops <- c("performance", "final", "usaidgov", "kaves", "evaluation", "*", "[[:punct:]]", "U\2022")
#for (i in 1:length(chapters_1)) {

  a <- SimpleCorpus(VectorSource(chapters_1[1]))
  dtm_questions <- DocumentTermMatrix(a)
  report_topics <- LDA(dtm_questions, k = 4)
  topic_weights <- tidy(report_topics, matrix = "beta")
  top_terms <- topic_weights %>%
    group_by(topic) %>%
    slice_max(beta, n = 10) %>% 
    ungroup() %>%
    arrange(topic, -beta) %>%
    filter(!term %in% stop_words$word) %>%
    filter(!term %in% USAID_stops)
topic term    beta
<int> <chr>   <dbl>
1   chain     0.009267748       
2   •         0.009766040       
2   chain     0.009593995       
2   change    0.008294549       
3   nutrition 0.017117040       
3   related   0.009621772       
3   strategy  0.008523203       
4   •         0.021312755       
4   chain     0.010974153       
4   ftf       0.008146484   

These remain. How and where can I remove them from?

AndrewGB
  • 16,126
  • 5
  • 18
  • 49
Matt Dietz
  • 37
  • 5
  • 1
    [See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with. It's probably better to clean the bullets and punctuation from earlier on in the process than this output data – camille Dec 28 '21 at 17:49
  • I think you want to remove them from `a` after you create the corpus. You don't want to do it after running the analysis. Take a look at `tm_map()`. –  Dec 28 '21 at 17:51

1 Answers1

1

You can use mutate and str_remove to remove the bullets.

library(tidyverse)

df %>%
  mutate(across(everything(), ~ str_remove(., "•")))

Output

   topic      term        beta
1      1     chain 0.009267748
2      2           0.009766040
3      2     chain 0.009593995
4      2    change 0.008294549
5      3 nutrition 0.017117040
6      3   related 0.009621772
7      3  strategy 0.008523203
8      4           0.021312755
9      4     chain 0.010974153
10     4       ftf 0.008146484

Or you can use gsub from base R.

df$term <- gsub("•","",as.character(df$term))

You could also replace earlier before running LDA.

dtm_questions[["dimnames"]][["Terms"]] <- 
  gsub("•","NA",dtm_questions[["dimnames"]][["Terms"]])

If you want to replace the bullets with something else, then you can do this:

df %>% 
  mutate(across(term, ~ str_replace(., "•", "NA")))

# Or in base R
df$term <- gsub("•","NA",as.character(df$term))

Output

   topic      term        beta
1      1     chain 0.009267748
2      2        NA 0.009766040
3      2     chain 0.009593995
4      2    change 0.008294549
5      3 nutrition 0.017117040
6      3   related 0.009621772
7      3  strategy 0.008523203
8      4        NA 0.021312755
9      4     chain 0.010974153
10     4       ftf 0.008146484

Data

df <-
  structure(list(
    topic = c(1, 2, 2, 2, 3, 3, 3, 4, 4, 4),
    term = c(
      "chain", "•", "chain", "change", "nutrition",
      "related", "strategy", "•",  "chain", "ftf"
    ),
    beta = c(
      0.009267748, 0.00976604, 0.009593995, 0.008294549,
      0.01711704, 0.009621772, 0.008523203, 0.021312755,
      0.010974153, 0.008146484
    )
  ),
  class = "data.frame",
  row.names = c(NA, -10L))
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
  • 1
    This could work, but I suspect you'd want to do this before the LDA. –  Dec 28 '21 at 17:53
  • @Adam I agree, but the OP didn't include a reproducible example. However, I updated my answer to replace the bullets after `dtm_questions <- DocumentTermMatrix(a)`, as a possibility. Thanks! – AndrewGB Dec 28 '21 at 18:11