Cassandra multitenant configuration options
We are in the process of evaluating whether to move an multitenant EAV system built on PostGres to Cassandra and I wanted input on our schema approach to see if a test with Cassandra makes sense. Our multitenant system hierarchy consists of account->app where an account can run multiple apps. Queries need to be segregated by app or by account (aggregating all app data for the account). Accounts can create their own dataobjects with their own custom fields in our EAV model.
There are two approaches that I have considered taking with Cassandra. The first is to hold a certain number of apps (say 20) within 1 column family (to reduce the number of column families used). Each row would be identified by a composite column of accountid->appid->dataobjectid->recordid. Columns would be added on the fly for each app’s dataobject as needed by that app. This means if the column family had two apps, 1 row for the first app may have 20 columns defined while the second app might have 30 columns defined. This would mean there would be a total of 50 potential columns for those two apps. Right now the average number of columns for an app is 19. This means the avg number of columns in a column family would be 400. Seems reasonable and takes advantage of Cassandra’s wide column support. In fact, we could probably easily support more apps per columnfamily. The drawback is that secondary indexes would be difficult as we don’t allow user’s to create their own indexes so queries could not be made more efficient without .
The second approach is to have two columnfamilies to hold all data for say 1000 apps. The first column family would have the same composite column as above, but it would hold the entire dataobject for that row in a JSON document. A second column family would have the same composite key but would add another value to the key which is fieldid that represents the field within the json document (our apps metadata manager stores UUIDs to identify each “field” within a JSON doc), but would have a “fieldvalue” column for each datatype – string, number, decimal, float (dates and bools get converted to numbers). The nice feature here is that we could easily index each of those columns for search purposes and we are minimizing the number of columnfamilies we create.
What are the pros and cons of the two approaches above? Am I missing something obvious or misunderstanding Cassandra in the scenarios above (for example, can I have composite columns that are so wide in the first place)? Are there other, better schema suggestions for an app of this type?
I think the first question you need to answer when deciding on your data model is “how do I intend to query this data?” In general, you are nowhere near the limit in terms of CFs, columns, or number of components in a composite in either model, so I wouldn’t worry about that.
Considering that you are concerned about the lack of secondaries in your first model, this tells me that query-by-value functionality may be important. If so, the second model might serve you better. The caveat there is that secondaries work best in situations where you have low cardinality, and your data may not fit that case well. If not, you can create your own index quite easily, in which case either model will work.
My advice is to figure out how you intend to read your data, then plan your model to match your read patterns. If you’re unsure, play around with both models to see which works best. In my experience it often takes more than one iteration to work out a good model, and you should not be afraid to write your data more than one way. Normalization is not the objective here. If you want to discuss your model more in-depth, check out the Cassandra IRC channel on freenode (#cassandra).