1

I wrote a small R script. Input are text files (thousands of journal articles). I generated the metadata (including the publication year) from the file names. Now I want to calculate the total number of tokens per year. However, I am not getting anywhere here.

# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", 
                        docvarnames = c("Unit", "Year", "Volume", "Issue")) 
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)

Does anyone here know a solution?

I used tokens_by function of the quanteda package which seems to be outdated.

Peter
  • 23
  • 3
  • 1
    "Outdated"? The package on CRAN was [updated a few days ago](https://cran.r-project.org/web/packages/quanteda/index.html), and its [repo activity](https://github.com/quanteda/quanteda/commits/master) looks somewhat regular (if not highly frequent). If you aren't running `quanteda-3.2.4`, have you tried to update it? I see your code but I see no warnings/error, and since the question is not reproducible, I don't know offhand how to figure it out myself. Could you make this more reproducible? Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info – r2evans Dec 11 '22 at 18:13
  • `tokens_by` was never in the namespace of either **readtext** or **quanteda**. Whether your code works will depend on the structure of the filenames (not provided in your question) and the name of the `text_field` (also not in the question). – Ken Benoit Dec 11 '22 at 18:46

2 Answers2

1

Thanks! I could not get your script to work. But it inspired me to develop an alternative solution:

# Load the necessary libraries
library(readtext)
library(dplyr)
library(quanteda)

# Set the directory containing the text files
dir <- "/Textfiles/SPARA_paragraphs"

# Read in the text files using the readtext function
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", docvarnames = c("Unit", "Year", "Volume", "Issue"))

# Extract the year from the file name
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)

# Group the data by year and summarize by tokens
rawdata_SPARA_grouped <- rawdata_SPARA %>% 
    group_by(Year) %>% 
    summarize(tokens = sum(ntoken(text)))

# Print number of absolute tokens per year

print(rawdata_SPARA_grouped)
Peter
  • 23
  • 3
0

You do not need to substring substr(rawdata_SPARA$Year, 0, 4). While calling readtext function, it extracts the year from the file name. In the example below the file names have the structure like EU_euro_2004_de_PSE.txt and automatically 2004 will be inserted into readtext object. As it inherits from data.frame you can use standard data manipulation functions, e.g. from dplyr package.

Then just group_by by year and summarize by tokens. Number of tokens was calculated by quantedas ntoken function.

See the code below:

library(readtext)
library(quanteda)

# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
                 docvarsfrom = "filenames",
                 docvarnames = c("unit", "context", "year", "language", "party"),
                 encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)


# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)

# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))

Output:

# A tibble: 3 × 2
   year total_tokens
  <int>        <int>
1  2005         5681
2  2006        26564
3  2007        24119
Artem
  • 3,304
  • 3
  • 18
  • 41