Your email address will not be published. After installing the relevant pkgs (detailed in the links), edit /etc/exports file on each Apache Cassandra node and add the following in a single line: [Full path to snapshot epoch folder] [Scylla_IP](rw,sync,no_root_squash,no_subtree_check), Restart NFS server sudo systemctl restart nfs-kernel-server, Create a new folder on one of the Scylla nodes and use it as a mount point to the Apache Cassandra node, sudo mount [Cassandra_IP]:[Full path to snapshots epoch folder] /[ks]/[table]. (scylla-tools-java/issues #36), Nodetool tablestats partition keys (estimated) number in Scylla, post migration from Apache Cassandra, differs by 20% less up to 120% more than the original amount in Cassandra (issue-2545), Scylla 2.x is using Apache Cassandra 2.x file format. Scylla and Apache Cassandra encrypted backup files are not compatible. Apache, Apache Cassandra, are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Thats right about where we started. To do that, you need to install the scylla-tools-core package (it includes the sstableloader tool). unirestore for migrating from SSTable to SSTable. In the new ScyllaDB system however because both clusters are down there is a downtime users will not be able to connect to the system during the migration process and thats usually a constraint that few organizations are willing to accept hence come the online migration thats the second strategy contrary to the offline migration in online migration both legacy and the ScyllaDB deployment are fully operational and can serve the traffic there are common steps that are common to both the online and the offline which are these and we will dive into them so offline migration lets look at a timeline without there are No Dual Writes so we have writes and reads coming to the old database we migrate the schema we will show how to do that in a second you can use the describe the CQL command to basically describe your schema and pour it into a .cql file which you can then use to upload to moving it from Cassandra to ScyllaDB you can dump the schema and do some adjustments very minor ones we have the needed adjustments documented in this link, A few syntax differences minor differences between ScyllaDB and Cassandra and then you can use the CQL IP - file command to just populate that schema in your ScyllaDB cluster and as I mentioned there are a few minor differences which you might need to tweak because thats why we theres an original and adjustable schema so after we migrated the schema were now were still talking on offline were still speaking about offline migration flow we need to forklift the existing data move it from Cassandra to ScyllaDB there are various ways and after forklifting the existing data the databases are at sync now we get to a validation phase where we can make sure that the information is on the new ScyllaDB cluster properly and then we can fade out the old database and start serving the writes and reads from the new database again this is the offline migration online migration which is mostly 95 percent of the time thats usually how migrations are done so first of all you need to think about dual writes ok if you think about updating your application logic to send writes to both clusters ok you have the Cassandra cluster, you have your ScyllaDB cluster you populated the schema on the ScyllaDB cluster now before you do anything else you need to start doing dual writes to both clusters to take care of your real-time traffic okay you need to be able to compare the results and log inconsistencies if you have any you need to make sure that youre using client-side timestamps, Okay you dont want to have some NTP servers going crazy and then your entries in ScyllaDB are not with the same timestamp as they were with Cassandra so we need to use client timestamps okay and by the way that is the default in the latest versions of most drivers so make sure youre using the latest drivers, thats a general recommendation each DB writer should enable you to stop start writing to each of the database at any point in time and basically you will be able to apply the same logic to the dual reads later on in the process now lets look at the timeline of a live migration writes and reads coming to the old database we migrated the schema to the new database now were starting the dual writes part during that period were writing to the new database and were still writing to the old database we are starting to handle the forklifting of the existing data which is the second part of the session how to do that forklifting at this point the databases are at sync were now doing a validation phase which will consist of dual reads so all the reads will be sent to the old database as they were all the time but we will start reading also from the new database but still relying on the legacy database as the source of truth until we are confident in the validation phase reading from the new database just mention that once we are confident and the validation phase is considered to be successful we can start fading out the old database and consider ScyllaDB as the source of truth, Okay lets say youre already using Kafka you have your Kafka layer in front of your application layer or the database there it is recommended that you buffer the writes while the migration is in process meaning all the writes to the Cassandra clusters stay intact but the writes to the ScyllaDB cluster gets buffered in a Kafka cluster and after the migration is completed you would do a double write of all the data that is buffered in Cassandra okay you will then increase the TTL and add buffer time while the migration is in progress or about to start and you can use the Kafka connect framework with Cassandra connector to write the data directly into ScyllaDB or use KSQL if you want to run a couple of transformations on your data but that is still experimental so Kafka is a good way of doing that streaming another option is to do the dual writes as I mentioned earlier this is an example how we took a code we just duplicated the writes.append and were now writing to DB1 and DB2 and its basically the same its not hard, we have more detailed examples in our documentation ok now lets talk about how to migrate your existing data all this was about our real-time traffic, So we have a lot of historical data we want to migrate right so there are a few options CQL COPY, SStableloader and the Spark Migrator important thing to mention whats written in the bottom when performing an online migration always choose the write strategy that preserves timestamps unless all your keys are unique ok you dont want the entries to get registered, your historical data to get registered as new timestamp entries in the new database that would just mess up everything so little bit about databases under the hood so there are each one of these tools does the data migration differently okay for example ScyllaDB spark migrator is basically reading native CQL actually reading your data and writing it with CQL statements to the new data, SSTableloader basically takes the SStable files which are the basis of the data that we have stored in Cassandra okay it takes the SStable files and then it parse them and it writed CQL statement to the new database and there is the copy CQL copy command which basically allows you to use arbitrary data files CSV files to just read scan through the CSV file and just insert all that data so here we have an example of a way that you can migrate without even doing any forklifting so lets say you have highly volatile data with very low TTL okay not TTL of years TTL of 10 days or something like that, Okay you can establish the dual writes as I explained earlier keep it running until the last record in the old database in your legacy database is expired by the TTL turn off the dual rights and just phase out the old database okay in that case you know you have all the actual needed information in the new database because the rest of the things were already TTLed in the old database there is no need to keep pushing them to the new database right this is just a small example small corner case now lets talk about the three options that I mentioned lets start with the CQL Copy, SSTableloader which is the tool that is most commonly used is a powerful tool but sometimes you just have a small amount of data you want to migrate historical data something like 2 million rows or less or you need to have a CSV files you need to a CSV file style to use for example if youre not moving from Cassandra to ScyllaDB moving from I dont know Mongo some other database and you cannot just move the SStable files or you cannot read the data with CQL and write it you need to dump the data somewhere ok so you need to create some a CSV file dump and that can be used now the COPY command is used to export or import small amount of data in the CSV format okay copy to export the data from a table into a CSV file copy from imports the data from a CSV file into an existing table each row in the CSV each row in your data will be written as a line in the CSV file you can determine whats going to be the delimiter and how to parse the delimiter when youre uploading the data you can drop columns that you dont want so you wont need to dump all the information and you can do some few more things, Lets talk about the things that you can do with the COPY so for example the file size matters basically its not going to be trivial to import huge data dumps of CSV so if you have a big data set you can split it into multiple CSV files make sure that every CSV file would be less than two million rows and then use this and again sometimes this is the only available strategy that you can use to migrate because youre not migrating from a cluster that supports CQL and SStables the other tools need CQL and/or SSTables format which is Cassandra thats why most of the migrations here are based from Cassandra you can also skip unwanted columns you can select which columns you want to import as I mentioned formatting is important and may be different in various CSV file so you need to make sure that you know what your formatting is like and thats why you will need to specify some of these flags during the CQL copy command ok use a separate host to do the loading dont do that locally on the ScyllaDB node ok as it will be an extra overhead on the node that will already be digesting a lot of information that youre loading to so use separate nodes and even better use multiple separate nodes in parallel to stream the data and move it, import it using the sql copy there are few other options if HEADER equals false and no column named are specified the fields are important in your deterministic order but there is a help for this command and you can read about all the various options that you have here about batching, TTL, CHUNKSIZE etc, So to summarize the pros and cons so copy command is a very simple way of migrating its a very simple command to use the CSV transparency is pretty much understood its easy to validate the destination schema can have less column than the original it can be tweaked and supports plenty of languages so you can basically have the data from various databases and then use the copy command on the CSV file to migrate ScyllaDB as mentioned and of course is compatible with Cassandra ScyllaDB and ScyllaDB cloud the con is, it is not made for huge data sets and the timestamps are not preserved okay so if you have a use case that is timestamp sensitive then this is probably not the way to go in this case SSTableloader so SSTableloader basically you can install the ScyllaDB, ok first of all we want you to use its written in the bottom you must use the ScyllaDB SSTableloader in order to migrate between Cassandra and ScyllaDB both of them have a SSTableloader it comes as part of the tools both in Cassandra and ScyllaDB but you need to use the ScyllaDB SSTableloader to migrate to ScyllaDB sometimes people dont do that and they hit some issues, Ok the way that it does usually the process is you create a snapshot on each of the Cassandra nodes you run the SStable from each of the Cassandra node or even preferably the intermediate node again not to load your Cassandra nodes which are currently your live real-time database okay you can use mount points or whatever and you use SSTableloader to read the SSTables files from the Cassandra and write them into ScyllaDB there is the -t option for throttling if you see that somebodys choking either the Cassandra side is starting to choke because youre doing a lot of reads from it or the ScyllaDB side, then use the throttling to limit the rate of the SSTableloaders and of course always best practice I mentioned that also in the CQL copy run several SSTableloaders in parallel we recommend to run several SSTableloaders in parallel to utilize all the ScyllaDB nodes, so ScyllaDB has the shard architecture you want to have a lot of connections opening to ScyllaDB by the SSTableloaders so using maybe one or two or three or four SSTableloaders might not be utilizing ScyllaDB which is currently like an empty database and you want to just have it ingest all the information as soon as possible so think about eight ten twelve SSTableloaders, We also recommend to start with one key space and its underlying SSTable files and after you did this one key space you move to the next one basically always any migration starts small then go big dont go with the full-blown migration when youre still learning the tool okay its better doing something small specific key space even specific table see that you understood how to utilize the tool and then go with the full-blown migration so things that are important during the forklifting there should not be any schema updates okay on your source database SSTableloader has support for simple column rename, if you know the column name can be a big factor in the metadata that youre storing if every column is a very very long name the metadata can be very big and sometimes its better to shorten the column name some simple operations about column renames are supported in SSTableloader Spark is even better in that way, as I mentioned earlier when youre assuming RF=3 you end up with nine copies of the data until compactions will take care of it so you need to be able to accommodate that and as for the failure handling if you lost if SSTableloader failed, each loading job is per key space or table that means that in any case of failure you will need to repeat that loading job of that SS table, as you are loading the same data because of the replication factor compactions will take care of any duplication so dont worry about that what happens if my Apache Cassandra node fails during the migration so if the node failed was a node that you were loading SSTables from then the SSTableloader will also fail we wont be able to read from that node because its down and if you are using replication factor larger than one which is mostly the case then your data will exist in the other nodes and the other replicas that are still working with the other Cassandra nodes that are up, Hence you can continue with your SSTable loading from all the other Cassandra nodes and once completed all the data should still be in ScyllaDB what should I do with one of the ScyllaDB nodes fails now if one of the ScyllaDB nodes, the destination failed and its one of the node you were loading SSTables to then the SSTableloader job will fail and you will need to restart that job and how to roll back from and start from scratch you will need to stop the dual writes to ScyllaDB stop the ScyllaDB service use CQL to perform truncate on all the data that was already loaded to ScyllaDB start dual writes over again take new snapshots on your source Cassandra nodes and then start the SSTableloader job again from the new snapshot location okay ScyllaDB Spark Migrator so ScyllaDB Spark Migrator is a very very cool tool its highly resilient to failures it will retry reads that failed and writes and also saves savepoints which means it knows how to roll back to a savepoint and continue from there if it failed at some point so it has basically unlimited streaming power it depends just how many nodes or how big the spark cluster that you will be raising is but spark is a very powerful tool and you should take advantage of that there is no interruption to the workload for both Cassandra and the ScyllaDB cluster while they are running side by side accepting the new data traffic into both clusters there is a lot of flexibility to change the data model, Okay the usage of ScyllaDB migrator can help you when you need to make changes into the data model there is an option to use the ScyllaDB migrator as an ETL tool to take one data column and enrich it when youre storing it in ScyllaDB so all of these are highly great advantages for this tool and the most important one is with solutions like the SSTableloader if you remember we were copying the number of replicas so we were copying the data three times but when were using spark youre only transferring single copy of the data because its actually reading the CQL its actually reading the data using CQL statement and then writing that CQL statements to ScyllaDB so basically its already doing like a like a full table scan of that you will get only the single copy of the data okay so its not like I need to copy the same data over and over it would be like an actual application just reading the data one time and writing it to ScyllaDB you can make sure that no tombstones will be transferred when youre using spark migrator if tombstone can be a very huge load of storage space okay and you usually dont want the tombstones to be transferred, So you can have the tombstones not transferred when youre using the spark migrator and this way youll be transferring less data saving the space saving the bandwidth on the network etc and reducing your data transfer costs the spark migrator is very simple to install you just need the standard spark stack Java JRE and JDK and SBT there is a configuration file you need to edit there is a link here on how to get it done the connections of the spark Java are created to both the ScyllaDB and the Cassandra clusters, size your spark cluster on the amount of data you have to transfer from the Cassandra nodes to ScyllaDB it is recommended if possible that you run a compaction on your Cassandra cluster before you start the migration it will accelerate the read process that there will be less tombstones these are part of the setting you need to take into consideration so we care about the spark and the Cassandra spark connector version that you use you need to use this version or higher we will need to take advantage of these latest versions to take advantage of the number of connections your spark workloads will open towards the Cassandra and the ScyllaDB clusters the number of connections that you open should be proportional to the number of cores you have on your ScyllaDB or your Cassandra deployment, So make sure theyre not too low otherwise you wont have a connection to each shard and in ScyllaDB architecture we want to have at least one or even a few connections per shard okay so make sure you know how many cores your ScyllaDB nodes have and adjust that accordingly you basically need to define a source as a destination key space at the tables you want to migrate in order to make the transfer more efficient we asked you to set the split limit count and the split limit count would determine how many partitions are used for spark execution ideally this should be the total number of cores in the spark cluster okay and everything that Im saying is explained also in the link, so you dont have to memorise everything, This is basically how a ScyllaDB spark migraor conf.