How Cassandra deals with disk IO


How Cassandra deals with disk IO

I want to compare the read performance between PostgreSQL and Cassandra on a single node.
I have a table of 8 columns, 150000 rows. To convert it to a column family, I made the primary key the row key in Cassandra and the rest columns are as they are in PostgreSQL. Also, I bulk loaded the data into Cassandra SSTables so data of both are on disks.
To read the table from PostgreSQL:
select * from tableName;

It costs 200ms or so.
To read the column family (with keycache and rowcache enabled) I tried both thrift API (get_range_slices method) and CQL2.0. The former takes averagely around 7000ms and the latter unbearably 100000ms.
I know it could be pretty fast if reading from Cassandra Memtables. But since they both read from disks, why Cassandra is way way slower?
What underlying mechanisms are crucial?
customer column family
WITH comparator = UTF8Type
AND key_validation_class = UTF8Type
AND caching = all
AND column_metadata =
{column_name: C_NAME, validation_class: UTF8Type},
{column_name: C_ADDRESS, validation_class: UTF8Type},
{column_name: C_NATIONKEY, validation_class: UTF8Type},
{column_name: C_PHONE, validation_class: UTF8Type},
{column_name: C_ACCTBAL, validation_class: UTF8Type},
{column_name: C_MKTSEGMENT, validation_class: UTF8Type},
{column_name: C_COMMENT, validation_class: UTF8Type}

Here’s my thrift query
// customer is that column family of 150000 rows
ColumnParent cf1 = new ColumnParent(“customer”);
// all columns
SlicePredicate predicate = new SlicePredicate();
predicate.setSlice_range(new SliceRange(ByteBuffer.wrap(new byte[0]), ByteBuffer.wrap(new byte[0]), false, 100));
// all keys
KeyRange keyRange = new KeyRange(150000);
keyRange.setStart_key(new byte[0]);
keyRange.setEnd_key(new byte[0]);
List cf1_rows = client.get_range_slices(cf1, predicate, keyRange, ConsistencyLevel.ONE);

Also my CQL2.0 query:
select * from customer limit 150000;

I blame myself for a misleading title and the provided data may bring more controversies. I’m not picking a winner here.
They are both doing disk I/O (which is a not a typical use case for Cassandra) and differ in their time so there must be a reason. I’m curious about the ways they deal with that.
So I will appreciate it if you guys shed some light on the underlying mechanisms.
This is not an apple-to-apple comparison but my concern is over the flavor. One is sourer probably because it contains more Vitamin C. And that’s what matters to me.


Solution 1:

This is not a valid test for Cassandra, as Postgres and Cassandra are not designed to address the same problems. A full CF scan is not a real-world query, and if you did this in a production system you would do so using Hadoop and not over Thrift. A more realistic Cassandra test for retrieving a lot of data would be a column slice, where you are retrieving a range of columns from A to n for a given set of keys. This is a much more efficient operation and a more appropriate data model choice for Cassandra. Additionally, no one runs Cassandra on a single node; 3 nodes is a bare minimum configuration.

If you want to test full scan capabilities, using Thrift (via CQL in your case) is not the way to do it because all your results will have to fit in RAM and be serialized over the wire at once (i.e. there are no cursors). If all your data can fit in RAM, Cassandra isn’t the right choice for you. Using Hadoop with Cassandra allows you to parallelize a full scan and answer questions about a theoretically infinite amount of data in seconds–something Postgres is not designed to do. If you want to see how this works in detail, check out the RangeClient in Cassandra’s Hadoop package. It’s also worth noting that a full scan requires a disk read, whereas many common read patterns make use of caches and never hit the disk.

By contrast, Cassandra is very fast on column range queries or get-by-key(s). This is because keys are hashed to a specific node, then sorted by column name on write. So if you know your keys and/or want a range of contiguous columns (a very common Cassandra read pattern), you get sequential I/O at worst and cached data at best–with no locking or indirection (i.e. indexes).

Solution 2:

To add to your metrics, we did a performance run on a six node cluster where performance shines (ie. more nodes). We ran with PlayOrm’s Scalable SQL and queried for all Activites that matched our criteria, and it took 60ms to return 100 rows from a table of 1,000,000 rows in it.

Generally, people page results so querying for first 100 rows is a more typical web-site use case. Other automated programs “may” get all rows but generally on all rows, you need to be using map/reduce. Again, you are not doing apples to apples if you are going CQL into all rows which is something you don’t do in noSQL.

Also, a more fair comparison is Cassandra on six or ten nodes, NOT one, as it gets faster since the disks are in parallel, something that is really not doable with Postgres, or at the very least, would have trouble performing distributed transactions. This may be more apples to apples as you are not going to run Cassandra with one node in production.

Solution 3:

Thrift, and CQL-over-Thrift, are RPC-based, not cursor-based. So Cassandra has to pull the entire resultset into memory, then convert it into Thrift format and send it back (still in memory).

So, my educated guess is that most of the difference is from you clobbering the hell out of the JVM’s allocation/GC subsystem.