We're currently working with Cassandra on a single node cluster to test application development on it. Right now, we have a really huge data set consisting of approximately 70M lines of texts that we would like dump into a Cassandra.
We have tried all of the following:
- Line by line insertion using python Cassandra driver
- Copy command of Cassandra
- Set compression of sstable to none
We have explored the option of the sstable bulk loader, but we don't have an appropriate .db format for this. Our text file to be loaded has 70M lines that look like:
2f8e4787-eb9c-49e0-9a2d-23fa40c177a4 the magnet programs succeeded in attracting applicants and by the mid-1990s only #about a #third of students who #applied were accepted.
The column family that we're intending to insert into has this creation syntax:
CREATE TABLE post (
postid uuid,
posttext text,
PRIMARY KEY (postid)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={};
Problem: The loading of the data into even a simple column family is taking forever -- 5hrs for 30M lines that were inserted. We were wondering if there is any way to expedite this as the performance for 70M lines of the same data being loaded into MySQL takes approximately 6 minutes on our server.
We were wondering if we have missed something? Or if someone could point us in the right direction?
Many thanks in advance!