-1

I have a bunch of repetitive code that I think I can make more efficient by putting into a for loop; been struggling with how to make them objects in R though.

Folder called input has 10 files titled "2010.txt, 2011.txt, ... 2019.txt"

LOOP ONE

files <- list.files("../input")

#Each Year File Path

y2010 <- read_file(glue("../input/", files[1], sep = ""))
y2011 <- read_file(glue("../input/", files[2], sep = ""))
...
y2019 <- read_file(glue("../input/", files[10], sep = ""))

From this I would like to do the following:

##Dataframe of each year's data
all_text <- rbind(y2010,y2011,y2012,y2013,y2014,y2015,y2016,y2017,y2018,y2019)

LOOP TWO Now I would like to take each year and make new "tok201x" objects.

###Each year
tok2010 <- data_frame(text = y2010) %>%
  unnest_tokens(word, text)

tok2011 <- data_frame(text = y2011) %>%
  unnest_tokens(word, text)

...

tok2019 <- data_frame(text = y2019) %>%
  unnest_tokens(word, text)

LOOP THREE Lastly, take the "tok201x" and feed them in to the sentiment code.


#2010
nrc2010 <- tok2010 %>%
  inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
  count(sentiment) %>% # count each 
  spread(sentiment, n, fill = 0)# made data wide rather than narrow

#2011
nrc2011 <- tok2011 %>%
  inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
  count(sentiment) %>% # count each 
  spread(sentiment, n, fill = 0)# made data wide rather than narrow

...

#2019
nrc2019 <- tok2019 %>%
  inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
  count(sentiment) %>% # count each 
  spread(sentiment, n, fill = 0)# made data wide rather than narrow

And have these all stored in a list.

I was playing around with assign() but it was not working out the way I hoped.

EDIT: Using @desval's code with lapply(), I broke the function up. The purpose of this is to combine the lists into one df. How do i accomplish this though?

custom.function1 <- function(x){
  #debug x <- files[1]
  tmp <- read_file(x)
  tmp <- tibble(text = tmp)
return(tmp)
}

custom.function2 <- function(x){
tmp <- tmp %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
  count(sentiment) %>% # count each 
  spread(sentiment, n, fill = 0)
return(tmp)
}

out1 <- lapply(files, function1)

##Take all year data and combine into one dataframe, previously...
outYEAR <- matrix(unlist(out1), ncol = 10, byrow = TRUE)
outYEAR <- outYEAR %>% 
    pivot_longer(everything(), names_to = 'year', values_to = 'text') 
##This does not work....

out2 <- lapply(out1, function2)

##Again, combine to one dataframe, previously...
out2YEAR <- matrix(unlist(out2), ncol = 10, byrow = TRUE)
out2YEAR <- out2YEAR %>% 
    pivot_longer(everything(), names_to = 'year', values_to = 'text') 
#THIS DOES NOT WORK.

The collective df's need to be "matrix" not "tbl_df".

Johnny Thomas
  • 623
  • 5
  • 13

1 Answers1

2

I think you might be better off using lapply. I am not sure why it is necessary to read in all the files, rbind them, and the separate them again. If it is not, something along these lines could work:

 library(janeaustenr)
library(tidytext)
library(textdata)
library(tidyverse)
library(data.table)

# some generated data in your directory
d <-  tibble(txt = prideprejudice[1:10])
writeLines(d$txt, "2010.txt")
writeLines(d$txt, "2011.txt")

# list of files
files <- list.files(pattern = "\\d{4}")

custom.function1 <- function(x){
  tmp <- read_file(x)
  tmp <- tibble(text = tmp)
  return(tmp)
}
out1 <- lapply(files, custom.function1)


custom.function2 <- function(x){
  tmp <- x %>% unnest_tokens(word, text) %>%
    inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
    count(sentiment) %>% # count each 
    spread(sentiment, n, fill = 0)
  tmp <- setDT
  return(tmp)
}
out2 <- lapply(out1, custom.function2)

Now rowbind them (maybe be possible without data.table, but it s very convenient):

out1_all <- out1
out1_all <- lapply(out1_all, setDT) %>% rbindlist(. , id="id_var")

out2_all <- out2
out2_all <- lapply(out2_all, setDT) %>% rbindlist(. , id="id_var")
desval
  • 2,345
  • 2
  • 16
  • 23
  • 1
    I added also a version with loops, and one object for each of the operations. – desval Apr 17 '20 at 17:28
  • Getting this error on second for loop: no applicable method for 'unnest_tokens_' applied to an object of class "function" – Johnny Thomas Apr 17 '20 at 17:36
  • 1
    sorry, I pasted the wrong version, now it should work. The main difference was g2[[i]] vs g2[i]. That is one of the advantages of using lapply: you dont need to create any list, they are created automatically when you apply the function to different objects – desval Apr 17 '20 at 17:46
  • Okay I see now; I think you are right, the lapply() route is better. I need to break these up into two distinct data frames though; one with all text and one with the sentiments. I think i did the functions correctly but I tried as.data.frame(out1) and it did not work correctly. See edit to post. – Johnny Thomas Apr 17 '20 at 17:56
  • setDT will not work it seems, error on last line :: Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table' – Johnny Thomas Apr 17 '20 at 18:45
  • 1
    `lapply` is a loop. While it's possible to write loops in a bad way so that they are slow (like `rbind` inside the loop), loops are not generally slower than `lapply`. See the 10 year old FAQ [Is R's apply family more than just syntactic sugar](https://stackoverflow.com/a/2276001/903061), or the [Iteration Section in R for Data Science](https://r4ds.had.co.nz/iteration.html#the-map-functions), which says *"Some people will tell you to avoid for loops because they are slow. They’re wrong! (Well at least they’re rather out of date, as for loops haven’t been slow for many years.)"* – Gregor Thomas Apr 17 '20 at 18:46
  • @desval I've tried a few things like as.data.frame() but it needs to be a matrix not tbl_df – Johnny Thomas Apr 18 '20 at 00:04
  • 1
    @GregorThomas Thanks, I deleted that part of the answer. I was taught that some years ago and never really questioned it. – desval Apr 18 '20 at 08:30