Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Related tags

294 questions
24
votes
3 answers

Having trouble viewing more than 10 rows in a tibble

First off - I am a beginner at programming and R, so excuse me if this is a silly question. I am having trouble viewing more than ten rows in a tibble that is generated from the following code. The code below is meant to find the most common words…
Meraj Shah
  • 243
  • 1
  • 2
  • 4
18
votes
1 answer

ggplot 'non-finite values' error

I have an R dataframe (df) that looks like this: blogger; word; n; total joe; dorothy; 17; 718 paul; sheriff; 10; 354 joe; gray; 9; 718 joe; toto; 9; 718 mick; robin; 9; 607 paul; robin; 9; 354 ... I want to use ggplot2 to plot n divided by total…
Simon Lindgren
  • 2,011
  • 12
  • 32
  • 46
10
votes
2 answers

Opposite of unnest_tokens

This is most likely a stupid question, but I've googled and googled and can't find a solution. I think it's because I don't know the right way to word my question to search. I have a data frame that I have converted to tidy text format in R to get…
Kate
  • 512
  • 4
  • 12
8
votes
2 answers

Preserve punctuations using unnest_tokens() in tidytext in R

I am using tidytext package in R to do n-gram analysis. Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens function automatically removes all punctuations and convert text…
JungHwan Yang
  • 181
  • 2
  • 5
6
votes
1 answer

TidyText Clustering

I want to cluster words that are similar using R and the tidytext package. I have created my tokens and would now like to convert it to a matrix in order to cluster it. I would like to try out a number of token techniques to see which provides the…
John Smith
  • 2,448
  • 7
  • 54
  • 78
6
votes
1 answer

tidytext, quanteda, and tm returning different tf-idf scores

I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining libraries, but I am getting different values. Is…
Radim
  • 455
  • 2
  • 11
6
votes
1 answer

How to use bigrams and trigrams using tidy text

I'm trying to use both a bigram and a trigram using tidytext. What code could I use for the token to look for 2 and 3 words. This is the code for using bigrams only: library(tidytext) library(janeaustenr) austen_bigrams <- austen_books() %>% …
Claudia
  • 105
  • 2
  • 7
6
votes
1 answer

How to cast data from long to wide format in H2O?

I have data in a normalised, tidy "long" data structure I want to upload to H2O and if possible analyse on a single machine (or have a definitive finding that I need more hardware and software than currently available). The data is large but not…
Peter Ellis
  • 5,694
  • 30
  • 46
5
votes
5 answers

Filter all rows with word next to a specified word in R

I have a column with string content temp <- c(NA, NA, "grocery pantry all offers", NA, "grocery offers today low price", "grocery offers today low price", "tide soap", "tide soap bar", "tide detergent powders 2kg", NA, "tide", "tide detergent…
Vaibhav Singh
  • 1,159
  • 1
  • 10
  • 25
5
votes
2 answers

Numbers of columns of arguments do not match

I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is: library(tm) library(tidyverse) library(tidytext) library(glue) library(stringr) library(dplyr) library(wordcloud) require(reshape2) files <-…
Michael
  • 159
  • 1
  • 2
  • 14
5
votes
3 answers

tidytext R in spanish - any alternative?

I'm doing sentiment analysis from twitter but my tweets are on Spanish so I can't use tidytext to classify the words. Does anyone know if there is a similar package for Spanish?
Suanbit
  • 471
  • 1
  • 4
  • 12
4
votes
2 answers

How to apply stopwords accurately in French using R

I'm trying to pull a book using the Gutenberg library and then remove French stopwords. I've been able to do this accurately in English by doing this: twistEN <- gutenberg_download(730) twistEN <- twistEN[118:nrow(twistEN),] twistEN <- twistEN %>% …
Litmon
  • 247
  • 3
  • 18
4
votes
3 answers

could not find function "unnest_tokens"

I'm trying to split a column into tokens using the tokenizers package but I keep receiving an error: could not find function "unnest_tokens". I am using R 3.5.3 and have installed and reinstalled dplyr, tidytext, tidyverse, tokenizers, tidyr, but…
GoodbyeJane
  • 63
  • 1
  • 5
4
votes
1 answer

Removing ngrams containing stopwords using tidytext

UPDATE: Thanks for the input so far. I rewritten the question and added a better example to highlight the implicit requirements that were not covered in my first example. Question I am looking for a general tidy solution to removing ngrams…
Benjamin Schwetz
  • 624
  • 5
  • 17
4
votes
1 answer

How to remove specific words in a column

I have a Column consisting of several Country Offices associated a with a company, where I would like to shorten fx: China Country Office and Bangladesh Country Office, to just China or Bangladesh- In other words removing the words "Office" and…
BloopFloopy
  • 139
  • 1
  • 2
  • 12
1
2 3
19 20