13

I have been using readLines() to scrape information from a website in an R tutorial. I now wish to extract data from my own website (specifically the awstats data) however the domain is password protected.

Is there a way that I can pass the url for the specific awstats data I require with a username and password.

the format of the url is:

http://domain.name:port/awstats.pl?month=02&year=2011&config=domain.name&lang=en&framename=mainright&output=alldomains

Thanks.

John
  • 41,131
  • 31
  • 82
  • 106
  • is this a http basic authentication? Ie, you get a password prompt in a pop-up window, and possibly a 401 Unauthorized Error when entering the wrong password – Martin Mar 24 '11 at 14:26

4 Answers4

8

If it is indeed a http basic access authentication, the documentation on connections provides some help:

URLs

Note that https:// connections are only supported if --internet2 or setInternet2(TRUE) was used (to make use of Internet Explorer internals), and then only if the certificate is considered to be valid. With that option only, the http://user:pass@site notation for sites requiring authentication is also accepted.

So your URL string should look like this:

http://username:password@domain.name:port/awstats.pl?month=02&year=2011&config=domain.name&lang=en&framename=mainright&output=alldomains

This might be Windows-only though.

Hope this helps!

Martin
  • 1,622
  • 4
  • 16
  • 27
8

You can embed the username and password in the url like :

http://userid:passw@domain.name:port/...

This you can try to use with readLines(). If that doesn't work, you can always try a workaround using url() to open the connection :

zz <- url("http://userid:passw@domain.name:port/...")
readLines(zz)
close(zz)

You can also download the file and save it somewhere using download.file()

download.file("theurl","/path/to/file/filename",method="wget")

This saves the file on the local path that is specified.

EDIT :

as csgillespie said, you shouldn't include your username and password in the script. If you run scripts with source() or interactively, you could add eg :

user <- readline("Give the username : ")
passw <- readline("Give the password : ")

Url <- paste("http://",user,":",passw,"@domain.name...")
readLines(Url,...)

When running from the commandline, you could pass the arguments after --args and access them using commandArgs (see ?commandArgs)

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
3

Formatting the url as http://username:password@domain... for use with download.file didn't work for me, but R.utils provides the function downloadFile that works perfectly:

require(R.utils)
downloadFile(myurl, myfile, username = "myusername", password ="mypassword")

See @joris-meys answer for a way to avoid including your username and password in plain text in your script.

EDIT Except it looks like downloadFile just reformats the URL to http://username:password@domain...? Hmm...

mikeck
  • 3,534
  • 1
  • 26
  • 39
3

If you have access to the box, you could always just read the awstats log files. If you can ssh into the box, then you could easily sync the latest file using rsync.

The slight snag with using

http://username:password@domain...

is that you are putting your password in an R script - best to avoid this. Of course you can secure it the script, but it only takes one slip. For example,

csgillespie
  • 59,189
  • 14
  • 150
  • 185
  • +1 for the warning. Off course one should construct the url after asking for the username and password using eg readline() or passing it as a parameter to the script. But the "if you have access to the box" requires a solution from outside R. – Joris Meys Mar 24 '11 at 14:51
  • @Joris: "solution from outside R" - I'm sure R must have a library for `ssh` ;) I suspect that the OP may have access to the box from the way the example url is constructed, but that's only a guess. – csgillespie Mar 24 '11 at 14:57
  • it's not that straightforward to get it done, and pretty dependent on architecture of the involved machine and server. The only way I've seen it happen is using `system()` in the R script, but that's far from an optimal solution, as you -again- get trouble with passwords... – Joris Meys Mar 24 '11 at 15:04
  • Thanks @csgillespie. I also needed --internet2 option which I did not realise. – John Mar 24 '11 at 20:34