4

I will explain what's my problem first, as It's important to understand what I want :-).

I'm working on a python-written pipeline that uses several external tools to perform several genomics data analysis. One of this tools works with very huge fastq files, which at the end are no more that plain text files.

Usually, this fastq files are gzipped, and as they're are plain text the compression ratio is very high. Most of data analysis tools can work with gzipped files, but we have a few ones that can't. So what we're doing is unzipp the files, work with them, and finaly re-compress.

As you may imagine, this process is:

  • Slower
  • High disk consuming
  • Bandwidth consuming (if working in a NFS filesystem)

So I'm trying to figure out a way of "tricking" these tools to work directly with gzipped files without having to touch the source code of the tools.

I thought on using FIFO files, and I tried that, but doesn't work if the tool reads the file more than once, or if the tool seeks around the file.

So basically I have to questions:

  • Is there any way to map a file into memory so that you can do something like:

    ./tool mapped_file (where mapped_file is not really a file, but a reference to a memory mapped file.

  • Do you have any other suggestions about how can I achieve my target?

Thank you very much to everybody!

Leandro Papasidero
  • 3,728
  • 1
  • 18
  • 33
guillemch
  • 323
  • 2
  • 14

4 Answers4

3

From this answer you can load the whole uncompressed file into ram:

mkdir /mnt/ram
mount -t ramfs ram /mnt/ram
# uncompress your file to that directory
./tool /mnt/ram/yourdata

This, however, has the drawback of loading everything to ram: you'll need to have enough space to hold your uncompressed data!

Use umount /mnt/ram when you're finished.

Community
  • 1
  • 1
bernard paulus
  • 1,644
  • 1
  • 21
  • 33
  • Hi Bernard, That's really close to what I need! Just... I don't have root permissions :-( – guillemch Oct 12 '12 at 12:20
  • There is the workaround of adding an entry to `/etc/fstab` so that you would be able to do that, but this requires the cooperation of your administrator. Or if you are able to create your own virtual machine and make everything run on top of it... but I think it becomes complex for what you ask... – bernard paulus Oct 12 '12 at 12:38
  • Yes, and the problem is that our pipeline is used by external users, so I cannot assume that they'll have root access to tune /etc/fstab or so. Thank you again! – guillemch Oct 12 '12 at 12:43
2

If your script can read from standard input, then one possibility would be to decompress and stream using zcat, and then pipe it to your script.

Something like this:

zcat large_file.gz | ./tool

If you want to compress your results as well, then you can just pipe the output to gzip again:

zcat large_file.gz | ./tool | gzip - > output.gz

Otherwise, you can look at python's support for memory mapping:

http://docs.python.org/library/mmap.html

Finally, you can convert the ASCII fastq files to BAM format, which isn't compressed (per se) but uses a more compact format that will save you space. See the following:

http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam

juniper-
  • 6,262
  • 10
  • 37
  • 65
  • Hi juniper, Thank you for your answer. Some tools can't read from stdin, furthermore they may need more than one file to read. Anyway, this solution doesn't also solve the problem that happens when the tool open and read the same file several times. Thank you anyway! – guillemch Oct 12 '12 at 11:51
  • I've read about python mmap, and that would be grat if these tools were written in python. Just the pipeline is written in python, and it uses the external tools with a subprocess.check_call, so I can't see the way of using mmap. And converting the files to bam has the same problem than using gzipped files: The tools doesn't understand them. Thank you! – guillemch Oct 12 '12 at 12:10
2

Consider looking at winning entries in the Pistoia Alliance Sequence Squeeze contest, which rated FASTQ compression tools. You may find a tool which provides IO overhead reduction through random access and faster decompression performance.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
0

You can write a fuse file system driver, if you are on linux: http://pypi.python.org/pypi/fuse-python

The fuse driver needs to compress and decompress the files. Maybe something like this already exists.

guettli
  • 25,042
  • 81
  • 346
  • 663