0

I am processing a large file, I read in chucks of it and process it and save what I extract. Then after rm(list=ls()) to clear memory (sometime have to use .rs.restartR() as well but that is not of concern in this post), I run the same script after adding 1 in two numbers in my script.

This seemed like a opportunity to try writing a loop but - between trying to initialize all the object that are used in the loop and given that I am not very good with writing loops it got really confusing.

I posted this here to hear some suggestion, I apologize in advance if my question is too vague. Thanks.

#######################         A:11
#######################         B:12

                # A    I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*11, nrows = 166836, header = FALSE, col.names = "text")



bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)

dfm_1 <- dfm(bi_tkn_one)

## First use colSums(), saves a numeric vector in `final_dfm_1`
## tib is the desired oject I will save with new name ea. time.

final_dfm_1 <- colSums(dfm_1)


tib <- tbl_df(final_dfm_1) %>% add_rownames()  
# This is what I wanted to extract 'the freq of each token'


            # B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq12.Rda")

rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)

Below I will run the same script but change 11 to 12 in fread(), and change 12 to 13 in saveRDS() command.

#######################         A:12
#######################         b:13

            # A    I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*12, nrows = 166836, header = FALSE, col.names = "text")



bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)

dfm_1 <- dfm(bi_tkn_one)

## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.

final_dfm_1 <- colSums(dfm_1)


tib <- tbl_df(final_dfm_1) %>% add_rownames()  
# This is what I wanted to extract 'the freq of each token'


            # B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq13.Rda")

rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)

Below is a list of all the objects (thanks this post) in my working environment, that are cleared from the working environment before running the the same chunk with A+1, and B+1.

                  Type      Size    Rows Columns
dfm_1        dfmSparse 174708600  166836 1731410
bi_tkn_one      tokens 152494696  166836      NA
tib             tbl_df 148109248 1731410       2
final_dfm_1    numeric 148108544 1731410      NA
text_tbl    data.table  22485264  166836       1  

I spent some time trying to figure out how to write this loop, found a post on SO about how to initialize a data.table with a character column, but there are still other objects that I think I need to initialize. I am unsure of how plausible it is to write such a loop.

I have copied and pasted the same script back-to-back as shown above and run it all at once. It's silly, since I am just adding one in two places.

Feel free comment on my approach, I would like to learn something out of this. Best

On a side note: I read about adding .rs.restartR() to the loop, and came across post that suggested using batch-files or scheduling tasks in R, I will have to pass on learning those for now.

Community
  • 1
  • 1
Bhail
  • 385
  • 1
  • 2
  • 18
  • 1
    I think restarting R is pointless, this could all be done in a single session. So, I agree, what you are doing is silly. In particular, I doubt if clearing your namespace will do anything - you would get the same effect just by assigning new values to those variables. And, yes of course you should write a loop, and no, it's not complicated. In fact, you should not even think of learning R without learning how to write a loop! That's like learning to drive a car without learning how to turn left. So... why not post an attempt at it? –  Mar 23 '17 at 00:13
  • @dash2 - This was very simple, _ I didn't have to initialize any objects_ , must have been doing something when I first tried to run this. And I see your point that there is no point of clearing namespace. – Bhail Mar 23 '17 at 03:19
  • That has me thinking when do I need to initialize objects for the loops – Bhail Mar 23 '17 at 03:25

1 Answers1

0

This was very simple, I didn't have to initialize any objects , that is what I was trying to do. Only things I had to load was the required packages upon starting R and run the loop.

 ls()
    character(0)
From an empty environment, just a simple loop.

library(data.table)
library(quanteda)
library(dplyr)

    for (i in 4:19){
                    # A    I change the multiple each time here.
        text_tbl <- fread("tlm_s_words", skip = 166836*i, nrows = 166836, header = FALSE, col.names = "text")



        bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 3, concatenator =" ", verbose = TRUE)

        dfm_1 <- dfm(bi_tkn_one)

        ## Using colSums(), gives a numeric vector`final_dfm_1`
        ## tib is the desired oject I will save with new name each time.

        final_dfm_1 <- colSums(dfm_1)
        print(setNames(length(final_dfm_1), "no. N-grams in this batch"))
            # no. N-grams


        tib <- tbl_df(final_dfm_1) %>% add_rownames()  
        # This is what I wanted to extract 'the freq of each token'


             # B Here I change the name `tib`` is saved uneder each time.
        iplus = i+1
        saveRDS(tib, file = paste0("titr",iplus,".Rda"))

        rm(list=ls())
        Sys.sleep(10)
        gc()
        Sys.sleep(10)

    }

Without initializing any data.table, or other objects the result of above loop was 16 files saved in my working directory.

That makes me think, when do we need to initialize vectors, matrices and other objects that are used to our loop?

Bhail
  • 385
  • 1
  • 2
  • 18
  • 1
    An interesting question if it hasn't been asked already. In general, the only reason to do that is to save time/memory. Take a look at the R Inferno chapter 2 for more details: http://www.burns-stat.com/pages/Tutor/R_inferno.pdf . BTW, why the `Sys.sleep(10)`? –  Mar 23 '17 at 12:17
  • Thanks for pointing to 'Inferno', first impression is I could learn a ton here, not to mention, I am confused about which resource to focus on at this to master R, by the first looks 'Inferno' i first help me defrag what I already know and then lead me comfort with advanced topics. – Bhail Mar 23 '17 at 17:11
  • `Sys.sleep(10)` was introduced because some chunks I ran were consuming 90% or more of my RAM and wanted to try something let memory to clear from `rm()`, it was just a shot in the dark - to be frank I was trying to be sensitive towards my laptop, let it breathe before drowning it into another intense processing. – Bhail Mar 23 '17 at 17:15
  • 1
    You need a `Give.laptop.sugar.lump()` function... but seriously, what's wrong with using your RAM? That's what it is there for. Ditto with the `gc()` call unless you actually know you need it. –  Mar 23 '17 at 17:37
  • The amount of mistakes I am making, I ought to resort to step back n say Wuusaa more often. Best – Bhail Mar 24 '17 at 00:24