1

I already have this working but looking to optimize this. It takes a really long time to extract the article data related to this because my methodology is using a for-loop. I need to go row-by-row and it takes alittle more than a second to run each row. However, in my actual dataset I have about 10,000 rows and it is taking a really long time. Is there a way to extract the full article other than a for-loop? I am doing the same methodology for every row so I'm wondering if there is a function in R similar to like multiplying a column by a number which is super quick.

Creation of dummy dataset:

date<- as.Date(c('2020-06-25', '2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25'))

text <- c('Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays', 
      'GMRC now a law; to be integrated in school curriculum',
      'QC to impose stringent measures to screen applicants for PWD ID',
      '‘Baka kalaban ka:’ Cops intimidate dzBB reporter',
      'Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so',
      'PNP records highest single-day COVID-19 tally as cases rise to 579',
      'IBP tells new lawyers: ‘Excel without sacrificing honor’',
      'Senators express concern over DepEd’s preparedness for upcoming school year',
      'Angara calls for probe into reported spread of ‘fake’ PWD IDs',
      'Grab PH eyes new scheme to protect food couriers vs no-show customers')
link<- c('https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays',  
     'https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum',                           
     'https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id',                 
     'https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter',                                  
     'https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so',
     'https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579',             
     'https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor',                         
     'https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year',                      
     'https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids',                   
     'https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers')

df<-data.frame(date, text, link)

Dummy dataset:

df
         date                                                                         text                                                 link
1  2020-06-25 Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays   https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays
2  2020-06-25                        GMRC now a law; to be integrated in school curriculum   https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum
3  2020-06-25              QC to impose stringent measures to screen applicants for PWD ID   https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id
4  2020-06-25                             ‘Baka kalaban ka:’ Cops intimidate dzBB reporter   https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter
5  2020-06-25      Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so   https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so
6  2020-06-25           PNP records highest single-day COVID-19 tally as cases rise to 579   https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579
7  2020-06-25                     IBP tells new lawyers: ‘Excel without sacrificing honor’   https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor
8  2020-06-25  Senators express concern over DepEd’s preparedness for upcoming school year   https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year
9  2020-06-25                Angara calls for probe into reported spread of ‘fake’ PWD IDs   https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids
10 2020-06-25        Grab PH eyes new scheme to protect food couriers vs no-show customers   https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers

Code to get article data for every link:

now<-Sys.time()
for(i in 1:nrow(df)) {
  test_article<- read_html(df[i, 3]) %>% 
    html_nodes(".article_align div p") %>% 
    html_text() %>%
    toString() 

  text_df <- tibble(test_article)
  df[i,4]<-test_article
  print(paste(i,"/",nrow(df), sep = ""))
}
finish<-Sys.time()
finish-now

So just for 10 articles, it took 10 seconds which I feel like is really long. Looking to see if there is a faster way to do this.

nak5120
  • 4,089
  • 4
  • 35
  • 94
  • not sure the for-loop is the reason for the delay : you'll save a few milliseconds, not seconds. You could try to use parallel processing to send multiple data queries at the same time. – Waldi Jun 26 '20 at 13:12
  • even for 10,000 rows it would only save a few milliseconds if I do a different approach? – nak5120 Jun 26 '20 at 13:13
  • 3
    I suspect that what is taking time is the html query, not the loop itself. To save time you first need to find a solution to the part of the code which is taking the most time. – Waldi Jun 26 '20 at 14:09
  • Is there a way to apply the html query to all rows at the same time or does it need to be a for-loop which is 1 row at a time is basically what I am asking. – nak5120 Jun 26 '20 at 14:11
  • This I what I tried to expalin in my first comment : you need to [parallelize the loop](https://stackoverflow.com/questions/38318139/run-a-for-loop-in-parallel-in-r) – Waldi Jun 26 '20 at 14:33
  • Thanks for sharing the link. This is what I don't know how to do but will look through the documentation. If you have done this in the past, can you provide a parallelized loop query for this specific example? – nak5120 Jun 26 '20 at 14:35
  • See my answer for parallelization on your example – Waldi Jun 26 '20 at 14:52
  • This is really quick! I don't see the result though in the 4th column. Where do I find the articles? – nak5120 Jun 26 '20 at 14:58
  • I didn't see you wanted the result in the 4th column, right now the only result is the final paste. Give me a few minutes for editing what you need in the answer. – Waldi Jun 26 '20 at 15:03

1 Answers1

1

You can parallelize the loop :

#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
now <- Sys.time()
result <- foreach(i =1:nrow(df),.combine=rbind,.packages=('dplyr','rvest') %dopar% { 
  test_article <- read_html(df[i, 3]) %>% 
    html_nodes(".article_align div p") %>% 
    html_text() %>%
    toString() 
  
  data.frame( test_article = test_article, ID = paste(i,"-",nrow(df), sep = ""))
  }

finish<-Sys.time()
finish-now
#stop cluster
stopCluster(cl)

Note that you can't write into the original dataframe from inside the foreach loop because each task runs in a separate environment.

Waldi
  • 39,242
  • 6
  • 30
  • 78