I am processing a large file, I read in chucks of it and process it and save what I extract. Then after rm(list=ls())
to clear memory (sometime have to use .rs.restartR() as well but that is not of concern in this post), I run the same script after adding 1 in two numbers in my script.
This seemed like a opportunity to try writing a loop but - between trying to initialize all the object that are used in the loop and given that I am not very good with writing loops it got really confusing.
I posted this here to hear some suggestion, I apologize in advance if my question is too vague. Thanks.
####################### A:11
####################### B:12
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*11, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## First use colSums(), saves a numeric vector in `final_dfm_1`
## tib is the desired oject I will save with new name ea. time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq12.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below I will run the same script but change 11 to 12 in fread()
, and change 12 to 13 in saveRDS()
command.
####################### A:12
####################### b:13
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*12, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq13.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below is a list of all the objects (thanks this post) in my working environment, that are cleared from the working environment before running the the same chunk with A+1, and B+1.
Type Size Rows Columns
dfm_1 dfmSparse 174708600 166836 1731410
bi_tkn_one tokens 152494696 166836 NA
tib tbl_df 148109248 1731410 2
final_dfm_1 numeric 148108544 1731410 NA
text_tbl data.table 22485264 166836 1
I spent some time trying to figure out how to write this loop, found a post on SO about how to initialize a data.table
with a character
column, but there are still other objects that I think I need to initialize. I am unsure of how plausible it is to write such a loop.
I have copied and pasted the same script back-to-back as shown above and run it all at once. It's silly, since I am just adding one in two places.
Feel free comment on my approach, I would like to learn something out of this. Best
On a side note: I read about adding .rs.restartR()
to the loop, and came across post that suggested using batch-files or scheduling tasks in R, I will have to pass on learning those for now.