0

I'm a student of literature lost in data sciences. I'm trying to analyse a corpus of 70 .txt-files, which are all in one directory.

My final goal is to get a table containing the filename (or something similar), the sentence and word counts, a Flesch-Kincaid readability score and a MTLD lexical diversity score.

I've found the packages koRpus and tm (and tm.plugin.koRpus) and have tried to understand their documentation, but haven't come far. With the help of the RKward IDE and the koRpus-Plugin I manage to get all of these measure for one file at a time and can copy that data into a table manually, but that is very cumbersome and still a lot of work.

What I've tried so far is this command to create a corpus of my files:

simpleCorpus(dir = "/home/user/files/", lang = "en", tagger = "tokenize",
encoding = "UTF-8", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text", source = "Wikipedia", format = "file",
mc.cores = getOption("mc.cores", 1L))

But I always get the error:

Error in data.table(token = tokens, tag = unk.kRp):column or argument 1 is NULL).

If someone could help an absolute newbie to R I'd be incredibly grateful!

SamVimes
  • 39
  • 7
  • 2
    Welcome to Stack Overflow, please take a time to go through [the welcome tour](https://stackoverflow.com/tour) to know your way around here (and also to earn your first badge), read how to [create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and also check [How to Ask Good Questions](https://stackoverflow.com/help/how-to-ask) so you increase your chances to get feedback and useful answers. – Josef Adamcik Jul 24 '17 at 14:43
  • It seems like your question is more like a series of questions, most of which, I'd guess, already have answers. The first one, f.ex.: [how to read in multiple text files](https://stackoverflow.com/questions/3397885/how-do-you-read-in-multiple-txt-files-into-r). Start at the beginning and solve one problem at a time. If you hit a real snag, then ask a question. "Teach me how to code" type of questions has no place on StackOverflow. – AkselA Jul 24 '17 at 16:01
  • I'm sorry for that, I'll try to edit my post and make my problem a bit clearer (The problem is not reading in texts per se, but getting them into koRpus. – SamVimes Jul 24 '17 at 16:08
  • If you `setwd()` to the directory containing the files and run `corp <- simpleCorpus(lang="en", tagger="tokenize")`, what happens? – AkselA Jul 24 '17 at 16:37
  • I then also get the error: `Fehler in data.table(token = tokens, tag = "unk.kRp") : column or argument 1 is NULL` – SamVimes Jul 24 '17 at 16:39

3 Answers3

0

This is a very comprehensive walkthrough...I would go step by step through this if I were you.

http://tidytextmining.com/tidytext.html

pyll
  • 1,688
  • 1
  • 26
  • 44
  • Thank you! I've already had a look at that, but I'm interested in analyses on the sentence/text level rather than on the word level, so I think it is not fitting for my use case. I have saved the link nevertheless, because it could really be interesting in the future. – SamVimes Jul 24 '17 at 16:04
0

I have found the solution with the help of unDocUMeantIt, the author of the package (thank you!). An empty file in the directory caused the error, after removal I've managed to get everything running.

SamVimes
  • 39
  • 7
0

I suggest you take a look at our vignette for quanteda, Digital Humanities Use Case: Replication of analyses from Text Analysis with R for Students of Literature, which replicates Matt Jocker's book of the same title.

For what you are looking for above, the following would work:

require(readtext)
require(quanteda)

# reads in all of your texts and puts them into a corpus
mycorpus <- corpus(readtext("/home/user/files/*"))

# sentence and word counts
(output_df <- summary(mycorpus))

# to compute Flesch-Kincaid readability on the texts
textstat_readability(mycorpus, "Flesch.Kincaid")

# to compute lexical diversity on the texts
textstat_lexdiv(dfm(mycorpus))

The textstat_lexdiv() function does not currently have MLTD, but we are working on it, and it does have a half dozen others.

Ken Benoit
  • 14,454
  • 27
  • 50