0

I wish to read into my environment a large CSV (~ 8Gb) but I am having issues.

My data is a publicly available dataset:

# CREATE A TEMP FILE TO STORE THE DOWNLOADED DATA
temp <- tempfile()

# DOWNLOAD THE FILE FROM THE CMS
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
              destfile = temp)

This is where I'm running into difficulty, I am unfamiliar with linux working directories and where temp folders are created.

When I use list.dir() or list.files() I don't see any reference to this temp file.

I am working in an R project and my working director is as follows:

getwd()
[1] "/home/myName/myProjectName"

I'm able to read in the first part of the file but my system crashes after about 4Gb.

# UNZIP THE NPI FILE
npi <- unz(temp, "npidata_pfile_20050523-20220213.csv")

I then came across this post which has a function for decompressing large zip files using the system2 unzip functionality. However due to my limited R knowledge and Linux experience I couldn't get the function to point to the downloaded file in the temp folder

checking the path for temp above I get the following path:

temp
[1] "/tmp/Rtmpl6SHIJ/file7e5e6c1fc693"

Using the system2 function from the link above I tried the following:

x <- decompress_file(directory = temp,
                     file = "NPPES_Data_Dissemination_February_2022.zip")

But get the following error about setting the working directory:

enter image description here

Any pointers to how I can get this file unzipped given it's size and read it into memory would be much appreciated.

TheGoat
  • 2,587
  • 3
  • 25
  • 58
  • Might be a file permissions problem with `/tmp`. Can you download the file to `/home/myName/myProjectName` & decompress there instead? – mrhellmann Mar 11 '22 at 17:28
  • Just out of curiosity: does `/tmp` reside on it's own filesystem, and if yes, how big is it? :) – tink Mar 11 '22 at 17:35
  • @mrhellmann thanks for the reply, I tried downloading to my project dir rather than the temp and I got a waring about "Warning messages: 1: In download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip", : URL https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip: cannot open destfile '/home/myName/myProjectName', reason 'Is a directory'" – TheGoat Mar 11 '22 at 17:42
  • @tink thanks for the reply, I am not sure if it's on a separate file system. Is there a linux command that I could use to check? – TheGoat Mar 11 '22 at 17:44
  • 1
    @TheGoat try `download.file("url from above", destfile = "/home/myName/myProjectName/npi.csv")` replace `url from above` with the actual url & destfile should be a full path with the filename you want. – mrhellmann Mar 11 '22 at 17:48
  • `df -h | grep tmp` – tink Mar 11 '22 at 18:12

2 Answers2

1

temp is the path to the file, not just the directory. By default, tempfile does not add a file extension. It can be done by using tempfile(fileext = ".zip")

Consequently, decompress_file can not set the working directory to a file. Try this:

x <- decompress_file(directory = dirname(temp), file = basename(temp))
Marcus
  • 3,478
  • 1
  • 7
  • 16
1

It might be a file permission issue. To get around it work in a directory you're already in, or know you have access to.


# DOWNLOAD THE FILE 
# to a directory you can access, and name the file. No need to overcomplicate this.

download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
              destfile = "/home/myName/myProjectname/npi.csv")

# use the decompress function if you need to, though unzip might work
x <- decompress_file(directory = "/home/myName/myProjectname/",
                     file = "npi.zip")

# remove .zip file if you need the space back
file.remove("/home/myName/myProjectname/npi.zip")

mrhellmann
  • 5,069
  • 11
  • 38