8

I am using mongodb to store raw HTML data of web pages using scrapy framework. In one day of web scraping 25GB disk space is filled up. Is there a way to store raw data in compressed format.

Binit Singh
  • 973
  • 4
  • 14
  • 35

3 Answers3

7

There's nothing built in for compression. Some operating systems offer disk/file compression, but if you want more control, I'd suggest you compress it using a library for whatever programming language you're using and manually control the compression.

For example, NodeJs offers simple convenience methods for this: http://nodejs.org/api/zlib.html#zlib_examples

3.0 Update

If you choose to switch to the new storage engine WiredTiger which ships with 3.0, you can choose between several types of compression as documented here. Of course, you'll want to test this change in production workloads to find if the additional CPU utilization is worth the compression received.

WiredPrairie
  • 58,954
  • 17
  • 116
  • 143
  • http://stackoverflow.com/questions/8506897/how-do-i-gzip-compress-a-string-in-python – WiredPrairie Aug 02 '13 at 11:15
  • @binit No: https://jira.mongodb.org/browse/SERVER-164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel – Sammaye Aug 02 '13 at 12:53
  • 1
    @binit -Why wouldn't you want to compress the data with Python? – WiredPrairie Aug 02 '13 at 15:23
  • there is no facility in mongodb as per today. the recommended way is to compress it client side and write it as a binary field. f.ex use somethink like gzip to compress it before writing it. – christkv Aug 13 '13 at 12:00
  • You can try looking at TokuMX. They do compression on mongodb dataset and also replace the BTree storage engine with a (seemingly) better fractal tree datastructure. http://tokutek.com/products/tokumx-for-mongodb – Ankur Chauhan Aug 24 '13 at 22:45
7

Starting with 2.8 version of Mongo, you can use compression. You will have 3 levels of compression with WiredTiger engine, mmap (which is default in 2.6 does not provide compression):

Here is an example of how much space will you be able to save for 16 GB of data:

enter image description here

data is taken from this article.

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
0

You can store your string like this to compress it: myhtml.encode('zlib')

nside
  • 113
  • 2
  • 8
  • myhtml.encode('zlib') doest always generate unicode, which causes problems when inserting into mongoDB. – fccoelho Oct 15 '13 at 13:48