Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like dplyr can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Text Mining with R: A Tidy Approach

Related tags

R's tm, quanteda, dplyr, tidyr, and broom packages

294 questions

votes

3 answers

Having trouble viewing more than 10 rows in a tibble

First off - I am a beginner at programming and R, so excuse me if this is a silly question. I am having trouble viewing more than ten rows in a tibble that is generated from the following code. The code below is meant to find the most common words…

r dplyr tidyverse tibble tidytext

asked Mar 06 '18 at 02:05

Meraj Shah

votes

1 answer

ggplot 'non-finite values' error

I have an R dataframe (df) that looks like this: blogger; word; n; total joe; dorothy; 17; 718 paul; sheriff; 10; 354 joe; gray; 9; 718 joe; toto; 9; 718 mick; robin; 9; 607 paul; robin; 9; 354 ... I want to use ggplot2 to plot n divided by total…

r ggplot2 tidyverse tidytext

asked Apr 18 '17 at 15:02

Simon Lindgren

2,011
12
32
46

votes

2 answers

Opposite of unnest_tokens

This is most likely a stupid question, but I've googled and googled and can't find a solution. I think it's because I don't know the right way to word my question to search. I have a data frame that I have converted to tidy text format in R to get…

r tidyr tidyverse tidytext

asked Oct 13 '17 at 16:44

Kate

votes

2 answers

Preserve punctuations using unnest_tokens() in tidytext in R

I am using tidytext package in R to do n-gram analysis. Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens function automatically removes all punctuations and convert text…

r twitter text-mining punctuation tidytext

asked Jun 12 '17 at 23:23

JungHwan Yang

votes

1 answer

TidyText Clustering

I want to cluster words that are similar using R and the tidytext package. I have created my tokens and would now like to convert it to a matrix in order to cluster it. I would like to try out a number of token techniques to see which provides the…

r cluster-analysis tidytext

asked Feb 03 '21 at 15:48

John Smith

2,448
7
54
78

votes

1 answer

tidytext, quanteda, and tm returning different tf-idf scores

I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining libraries, but I am getting different values. Is…

r text-mining tm quanteda tidytext

asked Feb 15 '18 at 11:56

Radim

votes

1 answer

How to use bigrams and trigrams using tidy text

I'm trying to use both a bigram and a trigram using tidytext. What code could I use for the token to look for 2 and 3 words. This is the code for using bigrams only: library(tidytext) library(janeaustenr) austen_bigrams <- austen_books() %>% …

r token tidytext

asked Aug 13 '17 at 18:21

Claudia

votes

1 answer

How to cast data from long to wide format in H2O?

I have data in a normalised, tidy "long" data structure I want to upload to H2O and if possible analyse on a single machine (or have a definitive finding that I need more hardware and software than currently available). The data is large but not…

r sparse-matrix tidyr h2o tidytext

asked Dec 27 '16 at 06:26

Peter Ellis

5,694
30
46

votes

5 answers

Filter all rows with word next to a specified word in R

I have a column with string content temp <- c(NA, NA, "grocery pantry all offers", NA, "grocery offers today low price", "grocery offers today low price", "tide soap", "tide soap bar", "tide detergent powders 2kg", NA, "tide", "tide detergent…

r tidyverse tidyr tidytext

asked Feb 13 '20 at 12:48

Vaibhav Singh

1,159
1
10
25

votes

2 answers

Numbers of columns of arguments do not match

I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is: library(tm) library(tidyverse) library(tidytext) library(glue) library(stringr) library(dplyr) library(wordcloud) require(reshape2) files <-…

r tidyverse sentiment-analysis tidytext

asked Jun 12 '18 at 15:40

Michael

votes

3 answers

tidytext R in spanish - any alternative?

I'm doing sentiment analysis from twitter but my tweets are on Spanish so I can't use tidytext to classify the words. Does anyone know if there is a similar package for Spanish?

r sentiment-analysis tidytext

asked Nov 02 '17 at 12:21

Suanbit

votes

2 answers

How to apply stopwords accurately in French using R

I'm trying to pull a book using the Gutenberg library and then remove French stopwords. I've been able to do this accurately in English by doing this: twistEN <- gutenberg_download(730) twistEN <- twistEN[118:nrow(twistEN),] twistEN <- twistEN %>% …

r stop-words tidytext project-gutenberg

asked Sep 21 '19 at 01:04

Litmon

votes

3 answers

could not find function "unnest_tokens"

I'm trying to split a column into tokens using the tokenizers package but I keep receiving an error: could not find function "unnest_tokens". I am using R 3.5.3 and have installed and reinstalled dplyr, tidytext, tidyverse, tokenizers, tidyr, but…

r tidytext unnest

asked Apr 19 '19 at 18:24

GoodbyeJane

votes

1 answer

Removing ngrams containing stopwords using tidytext

UPDATE: Thanks for the input so far. I rewritten the question and added a better example to highlight the implicit requirements that were not covered in my first example. Question I am looking for a general tidy solution to removing ngrams…

r tidyverse tidytext

asked Mar 20 '19 at 15:11

Benjamin Schwetz

votes

1 answer

How to remove specific words in a column

I have a Column consisting of several Country Offices associated a with a company, where I would like to shorten fx: China Country Office and Bangladesh Country Office, to just China or Bangladesh- In other words removing the words "Office" and…

r string tm tidytext

asked Apr 23 '18 at 16:19

BloopFloopy

2 3

…

19 20 Next