7

Description: I have a huge MySQL database table. The total size is about 10 terabytes. It only contains texts.

A sample text from this database table:

In other cases, some countries have gradually learned to produce the same products and services that previously only the U.S. and a few other countries could produce. Real income growth in the U.S. has slowed.

There are about 50 billion different texts.

What have I tried?

I've tried to zip them all. Actually it has worked, reduced the total size. However, I need to make searching and I can't search any data while they are located in a zip file.

I've tried PHP's base64 encoding. It has made my sample text data as:

SW4gb3RoZXIgY2FzZXMsIHNvbWUgY291bnRyaWVzIGhhdmUgZ3JhZHVhbGx5IGxlYXJuZW QgdG8gcHJvZHVjZSB0aGUgc2FtZSBwcm9kdWN0cyBhbmQgc2VydmljZXMgdGhhdCBwcmV2 aW91c2x5IG9ubHkgdGhlIFUuUy4gYW5kIGEgZmV3IG90aGVyIGNvdW50cmllcyBjb3VsZC Bwcm9kdWNlLiBSZWFsIGluY29tZSBncm93dGggaW4gdGhlIFUuUy4gaGFzIHNsb3dlZC4=

What I'd like to accomplish?

I want to reduce text's size before sending them to MySQL. First of all, I don't know how I can do this job. I'm thinking of encrypting and decrypting the data.

So, here is an example what I want to do:

I want to encrypt text data before storing. Then, I want to call encrypted data from MySQL in order to decrypt.

Any way to reduce the size of texts? Base64 does not work for me, is there any other way?

Nikola K.
  • 7,093
  • 13
  • 31
  • 39
Paraiba to Pusan
  • 347
  • 2
  • 4
  • 11

4 Answers4

14

Please Note neither base64 nor encryption was designed for reduction of string length. What you should be looking at is compression and i think you should look at gzcompress and gzdeflate

Example using decoded version of your text

$original = "In other cases, some countries have gradually learned to produce the same products and services that previously only the U.S. and a few other countries could produce. Real income growth in the U.S. has slowed." ;
$base64 = base64_encode($original);
$compressed = base64_encode(gzcompress($original, 9));
$deflate = base64_encode(gzdeflate($original, 9));
$encode = base64_encode(gzencode($original, 9));


$base64Length = strlen($base64);
$compressedLength = strlen($compressed) ;
$deflateLength  = strlen($deflate) ;
$encodeLength  = strlen($encode) ;

echo "<pre>";
echo "Using GZ Compress   =  " , 100 - number_format(($compressedLength / $base64Length ) * 100 , 2)  , "% of Improvement", PHP_EOL;
echo "Using Deflate       =  " , 100 - number_format(($deflateLength / $base64Length ) * 100 , 2)  , "% of Improvement", PHP_EOL;
echo "</pre>";

Output

Using GZ Compress   =  32.86%  Improvement
Using Deflate       =  35.71%  Improvement
Nick
  • 597
  • 4
  • 15
Baba
  • 94,024
  • 28
  • 166
  • 217
4

Base64 is not compression or encryption, it is encoding. You can pass text data through the gzip compression algorithm (http://php.net/manual/en/function.gzcompress.php) before you store it in the database, but that will basically make the data unsearchable via MySQL queries.

monitorjbl
  • 4,280
  • 3
  • 36
  • 45
2

Okay, it's really challenging! (at least for me!) ... you have 10 TB of text and you want to load it on your MySQL database and perform a fulltext search on the tables!

Maybe some clustering or some performance tricky ways on a good hardware works for you, but if that's not the case, you may find it interesting.

First, you need an script to just load these 50 billion piece of text one after each other, split them into some words and treat them as a keyword, that means giving them a numeric id and then save them on a table. by the way I am piece of large text. would be something like this:

[1: piece][2: large][3: text]

and I'm the next large part! would be:

[4: next][2: large][5: part]

By the way words I, am, of, I'm, the plus ., ! has been eliminated because they do not nothing usually in a keyword-based search. However you can keep them also in your keywords array, if you wish.

Give the original text a unique id. You can calculate the md5 of the original text or just simply giving a numeric id. Store this id somewhere then.

You will need to have a table to keep the relationships between texts and keywords. it would be a many-to-many structure like this:

[text_id][text]
1 -> I am piece of large text.
2 -> I'm the next large part!

[keyword_id][keyword]
1 -> piece
2 -> large
3 -> text
4 -> next
5 -> part

[keyword_id][text_id]
1 -> 1
2 -> 1
3 -> 1
4 -> 2
2 -> 2
5 -> 2

Now, imagine how much it would be easier (especially for MySQL!) if somebody search large text!

As far as I found on the 'net, it would be about 50,000 or 60,000 of words as your keywords or maximum 600,000-700,000 words, if you just keep everything as a keyword. So, you can simply guess 50,000 words would be far less than 10 TB of text-based data.

I hope that it helps, and if you need I can explain more or help you to make that works somehow! :)

Mahdi
  • 9,247
  • 9
  • 53
  • 74
  • Wow, amazing!! I loved your way very much indeed. Can you please give us more information? BTW, I've already copied and pasted your message to my computer in case someone deletes it. Perfect answer, thanks, but I need to test it on my localhost. It will take some time to test everything. – Paraiba to Pusan Sep 22 '12 at 20:00
  • thanks! just let me know which parts are not enough clear ... generally I would say first test it on a few sentences, and then start loading your data into that, but slowly. If you see that it works correctly on, let say 200 MB of text, then you may write a c, java, perl application to just parse your text and put that on your database. on PHP side, you just need to perform the final stage of search, which comes from the end-user. – Mahdi Sep 22 '12 at 20:11
  • This would be great answer for me if you gave me information with a sample. Am I going to store 10 TB text info anyway? I could not understand your offer exatcly, however your sounds very interesting and well worth pursuing. – Paraiba to Pusan Sep 22 '12 at 22:29
  • Give me some time, and I will come back with a working code ... :) – Mahdi Sep 23 '12 at 09:01
  • One more thing, do you have any problem for storing this 10 TB of data on your filesystem or not? I mean a search functionality while keeping your original texts on the file system is okay for you? Or you NEED/HAVE to reduce the size of texts on your file system? Give me an answer for this, because it's important, and after that I would come with the code ... :) – Mahdi Sep 23 '12 at 09:46
  • Storing 10TB data is my main problem. I need to reduce the size of texts on my system. I can work with every sort of file system, database system as long as it reduces the total size. – Paraiba to Pusan Sep 23 '12 at 14:53
  • Okay! could you upload for example 20 MB of your texts somewhere and I'd work on it and see how much it decrease the size? (I want to work on your actual data). I would post the source code here then! ... probably in the next 1-2 days! :) – Mahdi Sep 23 '12 at 15:51
1

While both answers address the question and provide options for text compression, I think compression will help solve your problem. Searching large amounts of data was never the purpose of relational databases like MySQL.

You got a very good tip aleady for Apache Lucene, and there are other options like Sphinxsearch. Here's a quick thread on a comparison:

Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

Community
  • 1
  • 1
Ramon de la Fuente
  • 8,044
  • 3
  • 32
  • 31
  • Thank you mate, I'm going to check out Sphinxsearch. I'm sure it will help me. Thanks, because I haven't known Sphinxsearch before. – Paraiba to Pusan Sep 22 '12 at 19:36