0

I have a list of names (e.g: authors) and a pdf file which includes those names. I need to calculate how many times those authors are mentioned in the pdf file.

Let's say my table of authors is named "author" and the pdf file's name is "pdf" (I converted and stored this pdf file in R already using pdf_text already)

I've tried the following:

author$count <- 0
author$count <- for (i in author$name) { sum(str_count(pdf, i))}

But it didn't work. When I printed author$count, the results were NULL. Is there a way to fix this?

user438383
  • 5,716
  • 8
  • 28
  • 43
Emma
  • 13
  • 1
  • 1
    Unfortunately a `for` loop doesn’t have a value. You can use one of the *apply functions instead, e.g. `vapply`. Or the `map_dbl` function from ‘purrr’. – Konrad Rudolph Sep 29 '21 at 18:02
  • [See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with. Right now we don't have any data, so we can't run your code, and it's unclear what exactly you're trying to get or what hasn't worked – camille Sep 29 '21 at 18:08
  • Hi thanks so much! Because I'm new so I dont really know how to product a set but my problem is I have a data frame listing many famous authors in the world and a pdf file (very long) whose purpose is to introduce about authors in the world. My task is to calculate how many times each author in my data frame is mentioned on the pdf file and plot it in a graph. @KonradRudolph can you pls tell me more about how use sapply in this case? because I need to use 2 functions: sum and str_count? – Emma Sep 29 '21 at 18:38

2 Answers2

1

Unlike most other functions, for does not return a value in R, which unfortunately makes it much less useful. Instead, in most situations one of the vector mapping functions (lapply, vapply etc.) is more suitable to the task.

In your case, vapply does the trick:

author$count <- vapply(author$name, \(i) sum(str_count(pdf, i)), integer(1L))

(If you’re using an older version of R, you need to replace \(i) with function (i).)

Note that you do not need to assign 0 to author$count beforehand. That value would be overwritten anyway.


A note on vapply vs. sapply

vapply ensures that the result of the function call actually conforms to the expected format (here: integer(1L), i.e. every element is a single integer). sapply doesn’t do this, which makes using sapply risky in non-interactive code, since it won’t notify you if there’s an error with the data. purrr::map_* behaves similarly to vapply.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
0

We may need to assign within the loop. Also, loop across the sequence to do the assignment

for(i in seq_along(author$name)) {
     author$count[i] <- sum(str_count(pdf, author$name[i]))
}
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi thank you so much! but when I did like you said, no NULL result anymore (Yes), but all the results equal 0 (while if I dont store the result to a variable, it displayed different results). What can be the problem here? – Emma Sep 29 '21 at 18:32
  • @Emma It implies that there is no pattern in the `pdf` that matches the 'name' element – akrun Sep 29 '21 at 18:33
  • @Emma without a small reproduibcle example it is not clear why it wouldn't match – akrun Sep 29 '21 at 18:34
  • okay thanks so much, I will try to give a small reproducible example – Emma Sep 29 '21 at 18:40
  • Hi @akrun, thank you so much I found my problem! In my code str_count, I put pattern = i only, instead of pattern = author$name[i] as your suggestion. When I changed it to author$name[i], it worked! So could you please explain more for me the reason why we need to identify the author$name column here? – Emma Sep 30 '21 at 10:23