Downloading and extracting .gz data file using R

Question

I already tried to solve my problem by adaption of this similar question. However, I get the following error for the URL or the file I want to do this with.

trying URL 'http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz'
Content type 'application/x-gzip' length 65933953 bytes (62.9 Mb)
opened URL
downloaded 62.9 Mb

 Show Traceback

 Rerun with Debug
 Error in open.connection(file, "rt") : cannot open the connection In addition: Warning message:
In open.connection(file, "rt") :
  cannot open zip file 'D:....'

here is what I tried:

url_S_C <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
tmpFile <- tempfile()
fileName <- gsub(".gz","",basename(url_S_C))
download.file(url_S_C, tmpFile)
data <- read.table(unz(tmpFile, fileName))
unlink(tmpFile)

Maybe someoe here can help me why this particular file is not working for me? Please note, that this file is quiet large (62.9 Mb), but I was not able to reproduce the error with the URL from the similar question.

Thank you!

3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux Description: Debian GNU/Linux 7.6 (wheezy) — MineSweeper, Mar 11 '15 at 12:55
In R, using `gzfile(tmpFile)` instead of `unz(tmpFile, fileName)` worked for me. Since you are on Linux I'm assuming you have the `wget` and `gunzip` command line utilities, so you could also download and unzip the `.gz` file and then read it into R like any other `.txt` file. — nrussell, Mar 11 '15 at 13:02
Could you write this as an answer, please? This sounds promising! — MineSweeper, Mar 11 '15 at 13:12

nrussell · Accepted Answer · 2015-03-11T14:06:38.847

Some additional options, with base R:

url <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
tmp <- tempfile()
##
download.file(url,tmp)
##
data <- read.csv(
  gzfile(tmp),
  sep="\t",
  header=TRUE,
  stringsAsFactors=FALSE)
names(data)[1] <- sub("X\\.","",names(data)[1])
##
R> head(data)
   mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id           mirna_alignment
1 MIMAT0000062 hsa-let-7a    5270    SERPINE2    uc002vnu.2         NM_006216   uuGAUAUGUUGGAUGAU-GGAGu
2 MIMAT0000062 hsa-let-7a  494188      FBXO47    uc002hrc.2      NM_001008777 uugaUA-UGUU--GGAUGAUGGAGu
3 MIMAT0000062 hsa-let-7a   80025       PANK2    uc002wkc.2         NM_153638   uugauaUGUUGG-AUGAUGGAgu
4 MIMAT0000062 hsa-let-7a   26036      ZNF451    uc003pdp.2          AK027074    uuGAUAUGUUGGAUGAUGGAGu
5 MIMAT0000062 hsa-let-7a     586       BCAT1    uc001rgd.3         NM_005504    uugaUAUGUUGGAUGAUGGAGu
6 MIMAT0000062 hsa-let-7a   22903       BTBD3    uc002wnz.2         NM_014962  uuGAUAUGUUGGAU-GAUGG-AGu
                  alignment            gene_alignment mirna_start mirna_end gene_start gene_end
1     | :|: ||:|| ||| ||||    aaCGGUGAAAUCU-CUAGCCUCu           2        21        495      516
2     || |||:  ::||||||||:  acaaAUCACAGUUUUUACUACCUUc           2        19        459      483
3         |::||: ||||||||     aauuucAUGACUGUACUACCUga           3        17         77       99
4      || || |   | |||||||     ccCUCUAGA---UUCUACCUCa           2        21       1282     1300
5        :|| |:   ||||||||     guagGUAAAGGAAACUACCUCa           2        19       6410     6431
6    || || ||| || ||||| ||   uaCUUUAAAACAUAUCUACCAUCu           2        21       2265     2288
              genome_coordinates conservation align_score seed_cat energy mirsvr_score
1 [hg19:2:224840068-224840089:-]       0.5684         122        0 -14.73      -0.7269
2  [hg19:17:37092945-37092969:-]       0.6464         140        0 -16.38      -0.1156
3    [hg19:20:3904018-3904040:+]       0.6522         139        0 -16.04      -0.2066
4   [hg19:6:56966300-56966318:+]       0.7627         144        7 -14.51      -0.8609
5  [hg19:12:24964511-24964532:-]       0.6775         150        7 -15.09      -0.2735
6  [hg19:20:11906579-11906602:+]       0.5740         131        0 -12.59      -0.2540

Or if you are on a Unix-like system, you could obtain the .txt file (either outside of R or using system or system2 from within R) like this:

[nathan@nrussell tmp]$ url="http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
[nathan@nrussell tmp]$ wget "$url" && gunzip human_predictions_S_C_aug2010.txt.gz

and then proceed as above, where you are reading in human_predictions_S_C_aug2010.txt from wherever wget and gunzip were executed,

data <- read.csv(
  "~/tmp/human_predictions_S_C_aug2010.txt",
  stringsAsFactors=FALSE,
  header=TRUE,
  sep="\t")

in my case.

Why is it necessary to separate `data` and `names` when it's done like this? It works as it should now! — MineSweeper, Mar 11 '15 at 14:00
`read.table` was not catching the columns names for some reason - thank you for pointing this out, I will update my answer. — nrussell, Mar 11 '15 at 14:04
`downloadAndDecompress <- function(url) { tmp <- tempfile() download.file(url,tmp,method='curl') data <- read.table(gzfile(tmp),header=TRUE) unlink(tmp) return(data) } ` This function works for my purpose. The `method=curl` addition enables **https** connections, if needed. See _RCurl_ package. Thank you! — MineSweeper, Mar 11 '15 at 14:35

Miha Trošt · Answer 2 · 2015-03-11T13:12:41.437

You can read the data from file into R the following way (tested on Windows):

library(stringr)
library(plyr)
library(dplyr)

 # download and extract file from web  

temp <- tempfile()
download.file("http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz", temp)
gzfile(temp, 'rt')
data <- read.csv(temp, 
                 stringsAsFactors = FALSE,
                 nrows = 20)
unlink(temp)

# column names

my_names <- 
  str_split(names(data), "\\.") %>% 
  unlist(.)

# toy example using only first 6 rows of dataset

mickey_mouse_data <- 
  head(data) %>% 
  unlist(.) %>% 
  str_split(., "\t") %>% 
  ldply(.)

names(mickey_mouse_data) <- my_names[-1]

tbl_df(mickey_mouse_data)

   mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id
1 MIMAT0000062 hsa-let-7a    5270    SERPINE2    uc002vnu.2         NM_006216
2 MIMAT0000062 hsa-let-7a  494188      FBXO47    uc002hrc.2      NM_001008777
3 MIMAT0000062 hsa-let-7a   80025       PANK2    uc002wkc.2         NM_153638
4 MIMAT0000062 hsa-let-7a   26036      ZNF451    uc003pdp.2          AK027074
5 MIMAT0000062 hsa-let-7a     586       BCAT1    uc001rgd.3         NM_005504
6 MIMAT0000062 hsa-let-7a   22903       BTBD3    uc002wnz.2         NM_014962
Variables not shown: mirna_alignment (chr), alignment (chr), gene_alignment (chr),
  mirna_start (chr), mirna_end (chr), gene_start (chr), gene_end (chr),
  genome_coordinates (chr), conservation (chr), align_score (chr), seed_cat (chr), energy
  (chr), mirsvr_score (chr)

This is a fine approach for reading in a text file, but I think the OP's main issue was with extracting the underlying `.txt` file from its `.gz` format, so maybe you could address this in your answer? — nrussell, Mar 11 '15 at 13:04

Downloading and extracting .gz data file using R

2 Answers2