MySQL Architecture – Sharding – How to approach sharding?

In the previous blog post we have started to look at the sharding in MySQL. We have discussed why someone may want to pursue sharding and when it’s a good time for that. In this blog post we will talk about different approaches to sharding. What are the different methods in which you can shard your data? What is the best solution in which case? What are the pros and cons of each of the sharding methods? We hope that this blog post will give you some insight into the options that you may have.

Functional sharding

We are going to start with the most basic approach – functions. Most of the time the application is not a monolith – it consists of different modules and functionalities. Not always, but sometimes those different functions are separated from each other. You have a separate schema for a web forum, a separate schema for the e-commerce module, separate schema is used to store log entries and so on.

If you can separate the data into multiple schemas and you do not need to perform cross-schema queries, the next step can be to split those schemas into separate clusters. We will have a cluster responsible for logging, an e-commerce cluster and a cluster that acts as a backend for the web forum or some social media elements of our website. This way we can separate the database traffic from all of those modules and minimize the global impact caused by one module only. If needed, in case there is shared data across some of the shards, we can try to denormalize the data and modify the application to keep the same data across multiple shards in sync. An example might be user data – something which is required by both e-commerce and social media part of our application. After all, we may want to use the single source of truth regarding the user data, login credentials and so on. In that case we would need to either do a cross-shard queries or just duplicate the user data and store them locally, on each shard, making sure that the application will update it on all of the shards.

Pros and cons

Obviously, there are pros and cons of such an approach. One of the definite pros is that it is not hard to implement. If your application’s modules use separate schemas, tables or data in general, you should be able to split the data into shards according to the application functionality quite easily. There is no need for any complex designs, just keep what you already had, changing the connect strings so they point towards the correct shard.

There are also cons of such an approach. First, the most significant is that not every module of the application generates the same level of load. We sure can split the app into multiple parts but to what extent it will really help? If there is one main functionality that generates the most data or write operations, splitting the data set into one big and couple of small shards may only postpone the issue for a quick time. Eventually you will have to figure out how to shard the main chunk of the data anyway.

Algorithm-based sharding

Another approach we can take is to use some sort of an algorithm to determine how the data should be split. In the most simple form, it can be just a very simple mathematical formula like modulo. The idea here is to split the data into shards based on the outcome of the algorithm and then let the application determine where to connect for a given row, again, using that particular algorithm. Let’s take a look at the following scenario. Let’s say that we have a particular column with some sort of integer ID. Let’s say that we have rows with values from 1 to 20. We can easily split it into, let’s say, 5 shards by applying the “row ID modulo 5” function. The remainder of the division will be our shard ID.

ShardRow ID
Shard 05, 10, 15, 20
Shard 11, 6, 11, 16
Shard 22, 7, 12, 17
Shard 33, 8, 13, 18
Shard 44, 9, 14, 19

If the application needs to retrieve a row with the ID of 13, it can execute the 13 mod 5 function and it will find that such row should be located in the shard 3.

Of course, this is the simplest example – we can use more complex algorithms to split the data. In some cases we might be even able to split the data in a non-uniform way, allowing to isolate more active parts of the dataset into a separate shard.

Pros and cons

Again, there are pros and cons of such an approach to sharding. The main advantage is that we can easily identify the shard where the data is located only by applying the function. There is no need to keep track of everything, do the bookkeeping of every row that exists in the database. If applied on the row level, to some ID column, it should generate more or less even split of the data across the shards – we don’t have to worry that one shard is significantly larger than the other.

The downside would be that once the shards start to fill up, resharding is not easy. You have to come up with another sharding function, for example, instead of using ‘mod 5’ we can use ‘mod 10’. The thing is, this requires moving half of the data to another location, something which is less than ideal in the real world scenario. In our example, the row distribution would look like this:

ShardRow ID
Shard 010, 20
Shard 11, 11
Shard 22, 12
Shard 33, 13
Shard 44, 14
Shard 55, 15
Shard 66, 16
Shard 77, 17
Shard 88, 18
Shard 99, 19

This becomes an even more serious problem when we face the “hot” shard problem. Let’s say that for some reason some of the data is more frequently accessed. It can be a user that is more frequently accessing the website, generates more content or something else. No matter what the reason is, the outcome is that the shard where such a user is located, has to deal with more load than the others. It may even be to the extent that other users, collocated with our “hot” user on the same shard, are reporting performance issues and temporary slowdowns. One solution would be to split that shard into a couple of smaller shards. You could even try to isolate “hot” data and move it towards a separate shard. In our scenario, where we used a simple modulo function, such operation is impossible. In case of more complex functions that generate ranges for rows to fit in, it might be doable but you would have to reshard at least one particular shard using a custom range that contains just that particular “hot” data. What is impossible, no matter what function we take, is to extract some rows from a shard, move them to another place and leave everything else untouched.

Metadata-based sharding

If we are looking for more flexibility and we are willing to pay the price in increased complexity of the whole environment, we can keep track of the sharding information in a metadatabase. The idea is quite simple. We want to be able to tell where every row is located and we want to be able to move data around freely. This is not achievable by means of algorithm-based sharding but we can have a central database where we will store the information about every single row and every single shard. Let’s consider following

Row IDShard
1Shard 1
2Shard 4
3Shard 1
4Shard 1
5Shard 2
6Shard 2
7Shard 3
8Shard 3
9Shard 5
10Shard 3
11Shard 3
12Shard 3
13Shard 3
14Shard 4
15Shard 4
16Shard 4
17Shard 1
18Shard 5
19Shard 5
20Shard 5

As you can see, we can freely allocate rows to shards. There is no need to keep an order of the shards, there’s no need to keep the row allocation even across the shards. If we notice that the load coming from the rows of ID 5 and 6 is enough to fill the whole shard, we’ll just not add anything else to shard 2. If a new hot spot will emerge on, let’s say, shard 4, related to row 15, we can always create a new shard (shard 6) and move rows 2, 14 and 16 there, leaving only row 15 on shard 5. We can do pretty much anything that we want. Yes, it will require moving the data across shards and you will probably need to build custom tooling for such operations, but it is definitely doable and, in the end, you will benefit from great deal of flexibility when it comes to shard and data management.

Pros and cons

As we mentioned above, this solution is very flexible and you should be able to move the data across the shards as you deem fit. We can’t stress enough how useful such functionality is. Whenever needed you can rebalance the shards in a custom way, separating hot parts of the data from other, making load even across the shards. There are significant disadvantages though. First of all, the metadata store has to be always available. It can be a MySQL cluster, it can be some sort of a key – value store like etcd or Consul. No matter what it will be, we have to maintain it, which, obviously, adds the overhead to the database operations. Metadata’s availability and performance is critical – this is the source of truth that has to be queried any time an application needs to retrieve data. No answer from the metadata store, the app cannot figure out where to look for the data it needs. Another problem is the tooling. You will want to benefit from the flexibility we mentioned and this pretty much means that you have to build and maintain tools, which will be used to move the data around. You also have to design the process in a way it will not impact the database’s performance (both MySQL and meta store, whatever it will be) but also it cannot impact the data consistency. You will have to move the data to the new shard then keep it updated while you redirect the traffic to the new location. You need a rollback plan should the migration go south. Eventually you will want to be able to purge the old data. This pretty much means that quite some lines of code will have to be written, thoroughly tested and maintained. Sure, it is not rocket science but it’s another code to take care of.

Vitess

Finally, an example of yet another approach. Instead of building the sharding solution on your own why not benefit from existing solutions like Vitess? We have a series of blogs on Vitess where we attempt to set up a cluster using Vitess and its Kubernetes operator. We also dug into some of the operational aspects of using Vitess. We are not going into details here, if you are interested, please check the blogs from the “Vitess” category. What is important is that Vitess is a platform for sharding MySQL. It consists of several elements, among others a proxy, metadata store and database instances. Under the hood it is all MySQL but there are layers of tooling that allows for rather easy operations. Resharding, scaling up and down, backups, schema changes, query routing – all of it can be managed through tools exposed by Vitess. Vitess is plugin-friendly so it is possible to extend it by developing some of the new functionality on your own. An example can be the different vindexes (indexes that tell Vitess how to split the data across shards) – if you want to implement your own algorithm for sharding, you can quite easily do so.

Pros and cons

As you can imagine, there are pros and cons also in using Vitess. If you are happy with the tool, if it fulfills your requirements, it might be a great way to reduce the burden of creating, testing and implementing a custom-built solution. Vitess is maintained, developed and, to some extent, supported and can be used as a base for the stable sharded environment. On the other hand, you are presented with a set of features that have been built into the Vitess. If you would like to change some of them, improve the way some operations are performed, you cannot do much except for filling feature requests or, simply (or not so simply), modifying the code on your own. This is, pretty much, the biggest limitation – if you are unhappy with the ready solutions, you need to consider if it makes sense to invest time and money into implementing Vitess or not. Vitess is also quite complex, especially when it comes to the type of queries it supports (as query planning and routing is handled by Vitess) so, again, if it doesn’t work for your application, you have to modify it to work with Vitess. So, on one hand you won’t have to build the solution from scratch, on the other hand you are limited to what is feasible in the sharding platform that you use.

As you can see, there are different ways in which you can approach sharding. Based on the requirements and your application design you can pick with several options, including very flexible approach of building the whole sharding environment along with the metadata store, tooling and what not, but also with an option to utilize existing solutions that can be easily leveraged and, if needed, extended to work with your setup.