Parse a remote xml.gz file of a database without downloading

Question

I need to parse a Pubchem database to search for certain clues on the pages of compounds

(Toxicity codes, to be exact, they look like 'H300'), and then add their CIDs to the correspondent lists

The Database is here https://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/XML/

But the xml.gz files there are so big that they can't be unpacked on my computer So maybe there is a way to read this files directly on the server of a PubChem

testing one of the files .... a 1 GB *gz file uncompressed to ~18 GB; if you plan to run multiple 'searches' against a given file you may want to consider downloading the (eg, 1 GB) *gz file and then run your local unpack-in-memory + search operations; main objective is to limit yourself to downloading the *gz file once while still making use of local cpu+memory to run repetitive unpack+search operations; then again, if you plan on doing this type of operation on a regular basis then you may want to look at adding some (relatively cheap) disk space to your system ... — markp-fuso, Apr 29 '23 at 18:31
keep in mind that `parse ... without downloading` means running an operation on the remote host ... something most (if not all) websites are not going to allow so, you will need to *download* the (compressed) gz file; the next question is ... how many times will you need to download the same file, or do you have room to download the file *once* and then run your multiple unpack+search operations locally? — markp-fuso, Apr 29 '23 at 18:37

Ron · Accepted Answer · 2023-04-29T14:12:50.343

2

One way I would approach this is to use curl and gunzip and maybe grep:

Example:

curl -ks https://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/XML/Compound_000000001_000500000.xml.gz -o - | gunzip | grep someString

This will stream down the file, and in realtime decompress it, which will allow you in realtime to grep for what you need

edited Apr 29 '23 at 14:12

answered Apr 29 '23 at 14:10

Ron

5,900
2
20
30

and, importantly, without having to store the data compressed or uncompressed anywhere. however whether this is going to work for you depends on the processing you need to do to the file. and `grep` may not actually work for xml, since it cna e structured all sorts of different ways you may have to actually parse the XML. – erik258 Apr 29 '23 at 14:13
1

Yep, `grep` was just my addition to the example just to show the point – Ron Apr 29 '23 at 14:14

Parse a remote xml.gz file of a database without downloading

1 Answers1