Using R to download gzipped data file, extract, and import data

Question

A follow up to this question: How can I download and uncompress a gzipped file using R? For example (from the UCI Machine Learning Repository), I have a file of insurance data. How can I download it using R?

Here is the data url: http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz.

In library(archive) there is also read_csv(archive_read("archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", file = 1), col_types = cols()) or archive_extract("archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", dir=XXX) - that worked very well for me & is faster than the unbuilt untar() — Tom Wenseleers, Jul 11 '22 at 15:55

score 22 · Accepted Answer · answered Aug 12 '11 at 19:33

22

I like Ramnath's approach, but I would use temp files like so:

tmpdir <- tempdir()

url <- 'http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz'
file <- basename(url)
download.file(url, file)

untar(file, compressed = 'gzip', exdir = tmpdir )
list.files(tmpdir)

The list.files() should produce something like this:

[1] "TicDataDescr.txt" "dictionary.txt"   "ticdata2000.txt"  "ticeval2000.txt"  "tictgts2000.txt"

which you could parse if you needed to automate this process for a lot of files.

answered Aug 12 '11 at 19:33

JD Long

59,675
58
202
294

+1 nice approach to automate the process. maybe download + unzip should be a function in its own right as it is a very common operation. – Ramnath Aug 12 '11 at 19:37
Yes, that's more or less what was in my answer to the question Zach already linked to: http://stackoverflow.com/questions/3053833/using-r-to-download-zipped-data-file-extract-and-import-data – Dirk Eddelbuettel Aug 12 '11 at 19:41
I thought the use of basename() and list.files() was worth illustrating. – JD Long Aug 12 '11 at 20:11

score 8 · Answer 2 · answered Aug 12 '11 at 19:00

8

Here is a quick way to do it.

# create download directory and set it
.exdir = '~/Desktop/tmp'
dir.create(.exdir)
.file = file.path(.exdir, 'tic.tar.gz')

# download file
url = 'http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz'
download.file(url, .file)

# untar it
untar(.file, compressed = 'gzip', exdir = path.expand(.exdir))

answered Aug 12 '11 at 19:00

Ramnath

54,439
16
125
152

As I said, that is virtually identical to what I wrote [in this SO question](http://stackoverflow.com/questions/3053833/using-r-to-download-zipped-data-file-extract-and-import-data) -- modulo the tar vs zip file content issue and the fact that you do not use a proper temp. diretory. I think the whole question could be closed as duplicate. – Dirk Eddelbuettel Aug 13 '11 at 15:44
Dirk, i still fail to understand how it is a duplicate. `unz` only works with zip files that contain a single file. so the difference between `untar` and `unz` is reasonably significant in my mind to merit a different question. am i missing something completely here? – Ramnath Aug 13 '11 at 15:53
So now for the the third time: downloading a remote file, expanding it in a temp location and working on the content is all the same between both answers. The only minor difference is what operation you use to extract the content, depending on whether it is a zip or tarfile. Is that really that difficult to grasp? – Dirk Eddelbuettel Aug 13 '11 at 15:57
5

I understand that quite well Dirk. But by that count several questions on SO would have to be closed as duplicate if all that mattered was the underlying concept behind the answers. In my humble opinion, a reader wanting to extract a downloaded archive would not be able to achieve his purpose based on the other question. I don't want to prolong this discussion, but if there are several others who see this as a simple extension and a duplicate, please feel free to shut this question down. – Ramnath Aug 13 '11 at 16:07
+1. Quick question: Is `path.expand` necessary for the code to work or is it merely best practice to use the full path instead of relying on **R** doing the tilde expansion? – Steve S Mar 12 '15 at 04:13

score 2 · Answer 3 · answered Aug 12 '11 at 18:54

2

Please the content of help(download.file) for that. If the file in question is merely a gzipped but otherwise readable file, you can feed the complete URL to read.table() et al too.

answered Aug 12 '11 at 18:54

Dirk Eddelbuettel

360,940
56
644
725

it is not just gzipped but a compressed folder of files – Ramnath Aug 12 '11 at 19:04
1

Nevertheless, it's good advice that you can just use read.table('myURL.gzip') on individual files. – John Aug 12 '11 at 20:26

Tom Wenseleers · Answer 4 · 2022-07-11T16:06:21.483

1

Using library(archive) one can also read in a particular csv file within an archive without having to UNZIP it first : read_csv(archive_read("http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", file = 1), col_types = cols())

This is quite a bit faster.

To unzip everything one can do archive_extract("http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", dir=XXX).

That worked very well for me & is faster than the unbuilt untar(). It also works on all platforms. It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.

edited Jul 11 '22 at 16:06

answered Jul 11 '22 at 15:35

Tom Wenseleers

7,535
7
63
103

Great answer with a fantastic tool. Have you ever seen "Timeout of 60 seconds was reachedError in file(archive, "rb")" warning message when loading multiple urls? It happened to me when I load the third zip file in a for loop. – Y. Z. Aug 28 '22 at 20:39
@Y.Z. No sorry - haven't seen that. Probably something server specific. Maybe add a Sys.sleep(XX) between each download? Or just add some error catching to retry until it succeeds? – Tom Wenseleers Aug 28 '22 at 21:36

Using R to download gzipped data file, extract, and import data

4 Answers4

Linked

Related