1

I have a database with a few million sha256 hashes of files. I frequently get millions of new files which I have to check against the database to avoid duplicates.

It takes years to check a hash of a file against the mysql db. I already splitted the hashes into 16 tables (0 to F). I already tried couchbase, but this needs more than 8GB of my RAM and aborted the import with a few millions hashes left cause of to much RAM usage...

Can anyone give me a solution to store about 4,5GB of hashes (size calucalted when hashes are dumped to a plain text file) in a databse which is faster than MySQL?

The Hashes are stored without any meta information, no filename or path or id or whatelse.

kind regards, 3vilc00kie

Edit Table Definition:

-- phpMyAdmin SQL Dump
-- version 3.3.9
-- http://www.phpmyadmin.net
--
-- Host: 127.0.0.1
-- Erstellungszeit: 31. Januar 2014 um 13:55
-- Server Version: 5.5.8
-- PHP-Version: 5.3.5

SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;

--
-- Datenbank: `filehashes`
--

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `0`
--

CREATE TABLE IF NOT EXISTS `0` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `1`
--

CREATE TABLE IF NOT EXISTS `1` (
  `sha256` binary(32) NOT NULL,
  UNIQUE KEY `sha256` (`sha256`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `2`
--

CREATE TABLE IF NOT EXISTS `2` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `3`
--

CREATE TABLE IF NOT EXISTS `3` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `4`
--

CREATE TABLE IF NOT EXISTS `4` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `5`
--

CREATE TABLE IF NOT EXISTS `5` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `6`
--

CREATE TABLE IF NOT EXISTS `6` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `7`
--

CREATE TABLE IF NOT EXISTS `7` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `8`
--

CREATE TABLE IF NOT EXISTS `8` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `9`
--

CREATE TABLE IF NOT EXISTS `9` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `a`
--

CREATE TABLE IF NOT EXISTS `a` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `b`
--

CREATE TABLE IF NOT EXISTS `b` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `c`
--

CREATE TABLE IF NOT EXISTS `c` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `d`
--

CREATE TABLE IF NOT EXISTS `d` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `e`
--

CREATE TABLE IF NOT EXISTS `e` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `f`
--

CREATE TABLE IF NOT EXISTS `f` (
  `sha256` binary(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Raphi Pour
  • 439
  • 4
  • 7
  • Splitting the tables was probably counter-productive. Checking the hash shouldn't take 'years' if your table is properly indexed on the hash value. Post your table definition. – user207421 Jan 31 '14 at 12:13
  • This doesn't make sense. If you have indexed the hash column in the db, checking a table of only a few dozen millions should be near instant. Tell us more, what kind of query do you perform, and how does the table look like (including indexes). (What would take time in this scenario is actually calculating the hash of the file content) – nos Jan 31 '14 at 12:31
  • Damn, I only used UNIQUE once... is it much faster when I implement it to all tables? – Raphi Pour Jan 31 '14 at 13:29

5 Answers5

7

You probably don't need a database.

A sha256 is only 32 bytes long. I generated a list of 50 million unique sha256s, sorted them, and put them in a file (without hex encoding them). That's 1.5GB of RAM for very well-balanced binary sort structure. That should be easy enough for just about any computer you can find.

So all you have to do is read or mmap it and do a binary search for each one you check.

When the LinkedIn database of sha1s leaked, there was a site that tried to do something similar to what you're doing here by putting all of the hashes in a database server and let users check them from a web request.

It was not working reliably, so I built basically what I described above. If you grab the code in my gist here: https://gist.github.com/dustin/2885182 and modify for sha256 (basically set the hash sizes to 32 instead of 20), it should work pretty well. You can run the logic inline with your file scanner for approximately instant lookups.

Dustin
  • 89,080
  • 21
  • 111
  • 133
2

Just add an index on the 'sha256' field. In fact, as Dustin points out, this is only 1.5 GB worth of data. It will fit in a single table in MySQL if that's what you're comfortable with (and you clearly are). Just index.

1

Instead of handling multiple tables, I would use MySQL partitions. You can partition the data into multiple tables, quite easily, along with an index. This simplifies the queries and maintenance.

But, the following is what is important. Create an index on the mdx hash -- this doesn't have to be a primary key or a unique index. If you do things correctly, then the index will be the only thing loaded into memory.

Second, be sure that MySQL is configured to use a lot of the memory.

If the index fits in memory, then you are ok.

Your process of getting "millions of new files" suggests optimizations on the comparison side. If the "files" are in the application and you are comparing one-by-one, then sort the files by the hash before doing the comparisons. Walking through the data in order will do wonders for performance.

If they are in the database, put them in a temporary table with the hash as the primary key. This will keep those in order. Then the index lookups will be pretty efficient, even if the index does not fully fit in memory.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

You can use MySQL:

  • explore having an auto-increment primary key with a UNIQUE constraint on the hash, rather than using the hash as the primary key. Benchmark this with a fairly large dataset to know if this helps.

  • if using innodb, set the innodb_buffer_pool_size config setting to 50-80% of available RAM: http://docs.oracle.com/cd/E17952_01/refman-5.5-en/innodb-configuration.html

  • if you don't care to know what the duplicates are, use REPLACE INTO ... VALUES (hash1), (hash2), ... to insert 4000 or 10000 hashes at a time in a single statement. Experiment with how big the bulk insert should be to get a good balance of performance.

Will
  • 73,905
  • 40
  • 169
  • 246
  • I don't see how adding a key is going to make it faster. – user207421 Jan 31 '14 at 12:14
  • @EJP because innodb clusters on the primary key, and sha256 is by design indistinguishable from random? http://stackoverflow.com/questions/9819271/why-is-mysql-innodb-insert-so-slow I am careful to say it wants benchmarking for this data – Will Jan 31 '14 at 12:40
0

I think in your case you should use HBASE and Redis.

You can use HBASE or Cassandra to store the files.

Since the size of files and the hash is in GB's using MySQL won't be a better choice.

You can use Redis to store the hash of the files.

I have had an opportunity to use Redis and from my experience i can say that Redis is hell fast. The reason for this is that, REDIS is a in memory data store.

This may help to know more about redis.

So,i think you should use

HBASE / CASSANDRA : Data Store

REDIS : For storing Hash.

Hope this might have helped. Thanks.

Community
  • 1
  • 1
ashubhargave
  • 230
  • 2
  • 14