1

I have daily TAQ data for a month. I am trying to unzip those using SAS but I am failing. The folder's name is EQY_US_ALL_TRADE_202107. It has several zipped (GZ files) files for each trading day named as EQY_US_ALL_TRADE_202210701 EQY_US_ALL_TRADE_202210702 EQY_US_ALL_TRADE_202210703 ... EQY_US_ALL_TRADE_202210729

I have tried the following code. First I tried to unzip two files (hence, in line 4, do n = 1 to 2). It is not working at all.

'data "D:\EQY_US_ALL_TRADES_202107\MainDataset";

 rc=filename("folderef","D:\EQY_US_ALL_TRADES_202107");

 did = dopen("folderef");

 do _n_ = 1 to 2;

 filename = dread(did,_n_);

 if scan(filename,-1,'.') ne 'gz' then continue;

 fullname = pathname("folderef") || '/' || filename;

 do while(1);

      infile archive zip filevar=fullname gzip dlm='|' firstobs=2 eof=nextfile;

  OUTPUT;

  end;

nextfile:

end;

   stop;

 run;

Proc contents data = "D:\EQY_US_ALL_TRADES_202107\MainDataset";

run;'
  • So you appear to have more than one GZIP files, and no ZIP files. Each GZIP file can only contain a single file, unlike an actual ZIP file which could contain multiple files. So what type of files are these GZIP files? Are they text files? SAS datasets? Something else? Is the goal just to expand the GZIP files back to their uncompressed versions? Or is the goal to convert them to something else? For example if they are GZIPped text files you might want to read them into a SAS dataset. – Tom Jan 05 '23 at 21:32
  • What version of SAS do you have? Later versions can process a zip file more easily. https://blogs.sas.com/content/sasdummy/2017/10/10/reading-writing-gzip-files-sas/ – Reeza Jan 05 '23 at 22:56
  • You have an INFILE but no INPUT statements so the file is never actually read.... – Reeza Jan 05 '23 at 23:04
  • @Tom: These GZIP files are downloaded from a Stock Trading Activity Dataset. I do not know what is the original format. I unzip one of the files manually - its properties show file type as "File". It opens in notepad++ though. My ultimate aim is to add all the GZIP files to a SAS dataset so that I start analyzing them. But I am unable to do so. I would like to unzip/load all these files to a SINGLE SAS dataset. – zannatus saba Jan 06 '23 at 03:41
  • 1
    @Reeza: I have SAS Studio 9.4. I read the link that you shared - I am very new to SAS. How do I modify that code to unzip all of them and load all of them in a single SAS dataset? I tried using the * wildcard (all files start with same letters except the last two digits) in the filename statement but it did not work out. Any tips? – zannatus saba Jan 06 '23 at 03:50
  • You're trying to build Rome in a day, you do it in steps, as Tom illustrates in his excellent solution below. First figure out how to read a single file. Then once you understand how to read that file, figure out how to get the list of files, then how to apply the code that works to all files. – Reeza Jan 06 '23 at 17:36

1 Answers1

1

So you have three problems.

The primary one is understanding how to read ONE of the files. If you downloaded this from NYSE then they should be pipe delimited text files and the variable definitions are published. So first work on code that can read one of the files.

To read a pipe delimited text file just use a simple data step. So say perhaps you have the daily quotes file. The documentation says that file has 23 variables. Reading delimited files is simple. Just define the variables and the input them. Make sure to remove the summary line at the bottom.

data want;
  infile 'myfile.gz' zip gzip dsd dlm='|' termstr=lf truncover firstobs=2 ;
  attrib Time length=$15 label='Timestamp Time the quote was published by the SIP';
  attrib Exchange length=$1 label='The Exchange that issued the quote'
  attrib Symbol length=$17 label='Symbol Stock symbol';
  attrib BidPrice length=8 label'='The highest price any buyer is willing to pay for shares of this security';
  attrib BidSize length=8 label='The maximum number of shares the highest bidder is willing to buy, in round lots';
/* you can type the rest */
  attrib SecurityStatus length $2 label='The Security Status Indicator field is used to report trading suspensions';
  input Time -- SecurityStatus ;
  if time='END' then delete;
run;

The second problem is how to get the list of files to be read.

To get the list of files from a directory is a common question here and on SAS Communities. Your current code is close to doing that using the DOPEN() and DREAD() functions.

data files;
  length fileref $8 filename fullname $256 ;
  rc=filename(fileref,"D:\EQY_US_ALL_TRADES_202107");
  did = dopen(filref);
  do _n_ = 1 to dnum(did);
    filename = dread(did,_n_);
    if scan(filename,-1,'.') = 'gz' then do;
      fullname = catx('/',pathname(fileref),filename);
      output;
    end;
  end;
  keep fullname;
run;

Once you have solved those two problems you can then move onto how to read ALL of the files into one dataset. That you could do by driving the data step that reads the TAQ files with the data that has the list of files. You can use the FILEVAR= option of the INFILE statement to do that. So if you have dataset named FILES with a variable named FULLNAME that has the name of the GZIP files you want to read the basic structure would look like this:

data want;
  set files ;
  infile dummy zip gzip filevar=FULLNAME end=eof dsd dlm='|' termstr=lf truncover firstobs=2 ;
  attrib Time length=$15 label='Timestamp Time the quote was published by the SIP';
  attrib Exchange length=$1 label='The Exchange that issued the quote'
  attrib Symbol length=$17 label='Symbol Stock symbol';
  attrib BidPrice length=8 label'='The highest price any buyer is willing to pay for shares of this security';
  attrib BidSize length=8 label='The maximum number of shares the highest bidder is willing to buy, in round lots';
/* you can type the rest */
  attrib SecurityStatus length $2 label='The Security Status Indicator field is used to report trading suspensions';
  do while (not eof);
    input Time -- SecurityStatus ;
    if time ne 'END' then output;
  end;
run;
Tom
  • 47,574
  • 2
  • 16
  • 29