5

I would like to use R to extract all URLs that are currently opened in a web browser. Consider the following example:

How could I extract these two URLs from within R, to get the following output?

my_urls <- c("https://www.google.de/", "https://www.amazon.com/")
my_urls
### [1] "https://www.google.de/"  "https://www.amazon.com/"

After some research, I'm suspecting that this may be possible with the RSelenium package, but unfortunately I couldn't figure out the appropriate R code.

Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
Joachim Schork
  • 2,025
  • 3
  • 25
  • 48

3 Answers3

3

You can do it using RSQLite package.

Get the path of your Firefox profile.

Go to %APPDATA%\Mozilla\Firefox\Profiles\ in your explorer. You will see the folder of your Firefox profile.

enter image description here

Open the folder and copy the location of the profile folder

Set the db to the copied location adding 'places.sqlite' at the end. Once this is set, you don't have to change the db name next time.

db<- 'C:\\Users\\{user}\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\{profilefolder}\\places.sqlite'

Then, proceed with the following:

library(RSQLite)

con <- dbConnect(drv=RSQLite::SQLite(), dbname=db)
tables <- dbListTables(con)

dt = dbGetQuery(con,'select * from moz_places' )

urls<- dt$url[dt$visit_count>0]
urls

Output:

[1] "https://duckduckgo.com/"                                        
[1] "http://linkedin.com/"                                           
[2] "https://linkedin.com/"                                          
[3] "https://www.linkedin.com/"                                      
[4] "https://www.sciencedirect.com/"                                 
[5] "http://stackexchange.com/"                                      
[6] "https://stackexchange.com/"

Edit:

If you want have the browsing history of present day, use this:

dt = dbGetQuery(con,'select * from moz_places' )

dt$last_visit_date<- (as.Date(as.POSIXct(dt$last_visit_date/1000000, 
                                         origin="1970-01-01")))
urls<- dt$url[dt$visit_count>0 & dt$last_visit_date==Sys.Date()]
urls
Mohanasundaram
  • 2,889
  • 1
  • 8
  • 18
  • 1
    Thanks a lot for your help! When I run your code, R is returning a huge vector of URLs (>14K). It seems like it is showing my whole browser history. This is already very helpful, since I can just extract the tail of the URLs to get the currently opened URLs. However, is it possible to extract only the currently opened URLs? – Joachim Schork Apr 17 '20 at 05:35
  • 1
    @JSP Using this method, It is not possible to filter the websites of current session. however, you can view the history of present day. I have updated the code for the same. – Mohanasundaram Apr 17 '20 at 08:58
  • Thanks for getting back to me and for the great extension! – Joachim Schork Apr 17 '20 at 10:13
  • FYI, I gave the bounty to H 1, since his answer is automatically extracting currently opened URLs. However, thanks again for your very helpful responses! – Joachim Schork Apr 17 '20 at 11:32
  • @JSP That is the prefect answer, after the latest update. I am upvoting it as well. – Mohanasundaram Apr 17 '20 at 11:37
3

Here is one way you can do this (on Windows but the same idea applicable to other platforms).

Firefox stores this info in a json recovery file in the user's profile directory. It should be straightforward to extract this data except Firefox saves it using a custom version of lz4 compression. I couldn't find a way to automatically uncompress this file using Firefox itself without causing a potential security issue so instead have to rely on a third party tool, dejsonlz4 located here on GitHub. Once you've downloaded and extracted the tool you can run the following. Just keep in mind there may be a small delay between opening / closing a tab and this information being written to the recovery file.

library(jsonlite)
library(dplyr)
library(purrr)

# Filepaths
recovery_filepath <- "C:/Users/{NAME}/appdata/Roaming/Mozilla/Firefox/Profiles/{PROFILE}/sessionstore-backups/recovery.jsonlz4"
filepath_to_tool <- "C:/Tools/dejsonlz4.exe"
output_file <- "rcvry.json"

# Uncompress recovery file
invisible(system(paste(filepath_to_tool, recovery_filepath, paste(dirname(recovery_filepath), output_file, sep = "/"))))

# Read uncompressed file
recovery_info <- read_json(paste(dirname(recovery_filepath), output_file, sep = "/"))

# Extract open tab information (expected result 2 pages)
recovery_info %>%
  pluck("windows", 1, "tabs") %>%
  map_df( ~ map_df(pluck(.x, "entries"),
                   ~ keep(.x, names(.) %in% c("url", "title")))[pluck(.x, "index"), ])

# A tibble: 2 x 2
  url                                                      title                                            
  <chr>                                                    <chr>                                            
1 https://stackoverflow.com/questions/61104900/create-vec~ webbrowser control - Create Vector of Currently ~
2 https://github.com/avih/dejsonlz4                        GitHub - avih/dejsonlz4: Decompress Mozilla Fire~
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
  • @H 1 Thank you for your response! When I run your code, I get the following error message: `Error: Argument 1 must have names`. This error occurs after applying the map_df function. I found [this thread](https://stackoverflow.com/questions/52505923/error-in-bind-rows-x-id-argument-1-must-have-names), which may be helpful, but unfortunately, I wasn't able to fix it myself. Do you have any idea why I get this error message? – Joachim Schork Apr 17 '20 at 10:20
  • 1
    Thank you so much, this is exactly what I was looking for! I used the version for the current dplyr package and this worked perfectly. – Joachim Schork Apr 17 '20 at 11:29
2

You could use the "export tabs" addin in Firefox and read from clipboard in R.

enter image description here

Browser Addins:

Reading from Clipboard code in R:

  • Windows: readClipboard()
  • (Ubuntu) Linux: read.table(pipe("xclip -selection clipboard -o", open = "r")), see R Copy from Clipboard in Ubuntu Linux.
  • Ctrl+V (pasting) would yield the plain text.

Note that RSelenium uses a headless browser, so you would not have access to your current ("non-headless") browser which you have opened. The same holds for other interfaces like Chromote.

Tonio Liebrand
  • 17,189
  • 4
  • 39
  • 59
  • Thanks a lot for your response. This works great and will save me a huge amount of time. Thank you very much! Do you eventually know a way how to automatize the manual clicking in the browser (i.e. the click on "Export Tabs URLs" and "Copy to clipboard")? It would be perfect, if I could do everything within an R script without any manual interaction with the browser. – Joachim Schork Apr 11 '20 at 06:20
  • 1
    One could assign a keyboard shortcut to the addin functionality for sure, so that you could replace the two clicks by a keystroke. Not sure if its part of the addin or would have to be programmed. Controlling your "non-headless" browser via R would be against security etc. restrictions of browsers I guess. Maybe, there is some kind of dirty workaround playing with the clipboard: https://stackoverflow.com/questions/28964764/access-chrome-clipboard-events-with-extension, but i am not sure its worth saving a key stroke. But maybe someone else finds a way, would be sceptical but curious :) – Tonio Liebrand Apr 11 '20 at 06:51
  • With the [KeyboardSimulator package](https://cran.r-project.org/web/packages/KeyboardSimulator/KeyboardSimulator.pdf) I may apply a keystroke within my R script. Could you elaborate how to create such a keystroke for "Export Tabs URLs" and "Copy to clipboard"? If I apply the method described [here](https://winaero.com/blog/assign-keyboard-shortcuts-extensions-firefox/), my firefox Add-on manager says that Export Tabs URLs does not have shortcuts. – Joachim Schork Apr 11 '20 at 07:57
  • 1
    Does not seem to be part of the addin, yet. So you could file an issue and request it here: https://github.com/alct/export-tabs-urls/issue or you fork the repo and try it yourself. See, https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/manifest.json/commands and https://github.com/alct/export-tabs-urls/blob/master/manifest.json. As i never did that for a firefox addin, i cant help much with that, but the addin creators can for sure,.. – Tonio Liebrand Apr 11 '20 at 08:09
  • 1
    Thanks again for your help! I opened a [new issue at github](https://github.com/alct/export-tabs-urls/issues/37). – Joachim Schork Apr 11 '20 at 08:33
  • 1
    FYI, I gave the bounty to H 1, since his answer is automatically extracting currently opened URLs without any manual clicking. However, thanks again for your very helpful responses and for getting back to me so quickly! – Joachim Schork Apr 17 '20 at 11:33