In the last blog we ended up with three shards for our newsbtest keyspace. This is great but let’s assume that we quickly have noticed that our problem has not disappeared. One of the shards is generating the majority of the load and it impacts the whole application. How can we deal with such issue? Let’s take a look at the custom sharding available in Vitess.
If you would like to follow our steps you can clone this repository: vitesstests
The scenario
The scenario we are going to discuss, in our setup, doesn’t make too much sense but it can still be used to highlight the process and challenges that we might be facing in a real world deployment.
Let’s assume that we have our data organized using some sort of an ID. It could be, for example, a user ID. Let’s say that some of the users generate more data and load. We can be talking about a ticketing system where some users are larger than the others. We can talk about social media where some users generate more content than others. It is not important. What is important is that we have a set of tables in our keyspace which are sharded using the ID column. Just like our “newsbtest” keyspace. Unfortunately, while users are equally distributed across our three shards, the load they generate is not. Let us assume that one particular user is quite heavy and is responsible for the majority of the workload. It’s ID is 2763546.
What can we try to do? We could add more shards to the mix but if that user is indeed responsible for 40% of the load, splitting doesn’t make a difference. We will end up with more shards, probably underutilized with an exception of the shard where our problematic user is located.
How to find a shard for a row?
As we have mentioned, the sharding is based on the vindex that is used to access the rows and identify their location on the list of shards in the keyspace. Each vindex is, from the code standpoint, a plugin that defines how to identify the rows. Everyone can write such a plugin and to determine how it works, the best place (outside of documentation – if documentation exists) is the code.
We can check in vschema what indexes are defined:
root@k8smaster:~/vitesstests# vtctlclient GetVSchema newsbtest
{
"sharded": true,
"vindexes": {
"id1": {
"type": "reverse_bits"
},
"id2": {
"type": "reverse_bits"
}
},
"tables": {
"sbtest1": {
"columnVindexes": [
{
"column": "id",
"name": "id1"
}
],
"autoIncrement": {
"column": "id",
"sequence": "sbtest1_seq"
}
},
"sbtest2": {
"columnVindexes": [
{
"column": "id",
"name": "id2"
}
],
"autoIncrement": {
"column": "id",
"sequence": "sbtest2_seq"
}
}
}
}
As you can see (or maybe you remember it from a previous blog), we used “reverse_bits” indexes for our “newsbtest” keyspace. Let’s take a look at what steps are performed to decide to which shard given row belongs. First, we have to take our ID value, which is an integer. Then we convert it to binary form, using 64 bit format. Then we perform reverse bits operation on it. For example, if we have an ID value of 127 then the binary representation, with all added zeros, is:
0000000000000000000000000000000000000000000000000000000001111111
Reverse bits operation will result in:
1111111000000000000000000000000000000000000000000000000000000000
We just read the data from end to the beginning.
Then, we convert this new value to hexadecimal format: 0xfe00000000000000
and this is all we need to store the row in the correct keyspace. In our case, we have three shards: ‘-55’, ‘55-aa’ and ‘aa-’. ‘0xfe’ will end up in the third range so the row with ID of 127 will end up in the third shard.
We can easily check this by executing a query that will return that row and directing it to particular shards:
root@k8smaster:~/vitesstests# mysql newsbtest/-55 -e "SELECT id,k FROM sbtest1 WHERE id=127" root@k8smaster:~/vitesstests# mysql newsbtest/55-aa -e "SELECT id,k FROM sbtest1 WHERE id=127" root@k8smaster:~/vitesstests# mysql newsbtest/aa- -e "SELECT id,k FROM sbtest1 WHERE id=127" +-----+---------+ | id | k | +-----+---------+ | 127 | 1702230 | +-----+---------+
As we calculated, the only result comes from the keyspace ‘newsbtest/aa-’.
We have included a very simple python script that does such calculations, reversebits.py.
Sharding the hot shard
We can now put all of those things together. What happened is that we have a hot shard, ‘aa-’ that gives us a headache. We have identified that the majority of the load is caused by a particular account with the ID of 653645. Let’s see how we can deal with this issue.
Vitess comes with a functionality that allows us to define any particular range for a shard. It allows us to create a shard just for that particular user, keep it separated from the rest of the data and ensure it will not impact other users. Well, to actually make it work in a Kubernetes environment, we would have to add another node to the Kubernetes cluster and then use nodeSelector and nodeAffinity to bind that shard to that separate node (and prevent other pods from being allocated on it). We won’t go into such details, focusing on the Vitess part for now. First, we have to figure out a shard range that will contain only our problematic ID. We will use our simple python script for that:
root@k8smaster:~/vitesstests# python3 reversebits.py Pass the integer to convert or CTRL+C to exit 653645 binary representation of 653645: 0000000000000000000000000000000000000000000010011111100101001101 Reversed bits: 1011001010011111100100000000000000000000000000000000000000000000 The hex representation of 1011001010011111100100000000000000000000000000000000000000000000: 0xb29f900000000000) Pass the integer to convert or CTRL+C to exit
Ok, so according to the script we are talking about the hexadecimal value of b29f900000000000. Therefore, we are interested in a range: b29f900000000000 – b29f900000000001.
Now we can create a new yaml file with custom sharding. You can find it as 108_custom_sharding_of_newsbtest.yaml. The most important bits are those:
- custom:
shards:
- databaseInitScriptSecret:
name: example-cluster-config
key: init_db.sql
keyRange: {
start: "aa",
end: "b29f900000000000"
}
tabletPools:
- cell: zone1
type: replica
replicas: 2
vttablet:
extraFlags:
db_charset: utf8mb4
backup_storage_implementation: file
backup_engine_implementation: xtrabackup
xtrabackup_root_path: /usr/bin
xtrabackup_user: root
xtrabackup_stripes: '8'
restore_from_backup: 'true'
file_backup_storage_root: /mnt/backup
resources:
requests:
cpu: 500m
memory: 1Gi
mysqld:
resources:
requests:
cpu: 500m
memory: 1Gi
dataVolumeClaimTemplate:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
extraVolumes:
- name: backupvol
persistentVolumeClaim:
claimName: "backupvol"
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 100Gi
volumeName: backup
extraVolumeMounts:
- name: backupvol
mountPath: /mnt
We have added a whole section where we define custom shards. Instead of
partitionings:
- equal:
parts: 3
we are defining each shard separately, using different ranges:
keyRange: {
start: "b29f900000000000",
end: "b29f900000000001"
}
keyRange: {
start: "b29f900000000001",
end: ""
}
Basically, we are going to define shards with following ranges:
- -55
- 55-aa
- aa-b29f900000000000
- b29f900000000000-b29f900000000001
- b29f900000000001-
First two will not change and the Vitess operator will reuse existing setup. It will create three more sets of shards, each containing primary and a replica.
Once this is done you shall see new tablets defined:
root@k8smaster:~/vitesstests# vtctlclient listalltablets | grep newsbtest zone1-0349226440 newsbtest 55-aa primary 10.244.2.3:15000 10.244.2.3:3306 [] 2021-09-25T17:20:38Z zone1-0778238830 newsbtest b29f900000000000-b29f900000000001 primary 10.244.3.39:15000 10.244.3.39:3306 [] 2021-09-30T13:43:05Z zone1-1463074389 newsbtest aa- primary 10.244.3.35:15000 10.244.3.35:3306 [] 2021-09-25T08:01:56Z zone1-1504968304 newsbtest -55 primary 10.244.2.254:15000 10.244.2.254:3306 [] 2021-09-25T08:01:41Z zone1-1572740937 newsbtest b29f900000000001- primary 10.244.2.5:15000 10.244.2.5:3306 [] 2021-09-30T13:35:45Z zone1-1676955594 newsbtest -55 replica 10.244.1.49:15000 10.244.1.49:3306 [] <null> zone1-2150384904 newsbtest b29f900000000000-b29f900000000001 replica 10.244.1.51:15000 10.244.1.51:3306 [] <null> zone1-2214114162 newsbtest aa-b29f900000000000 primary 10.244.2.6:15000 10.244.2.6:3306 [] 2021-09-30T13:35:43Z zone1-3262256522 newsbtest aa- replica 10.244.1.50:15000 10.244.1.50:3306 [] <null> zone1-3429635106 newsbtest aa-b29f900000000000 replica 10.244.3.38:15000 10.244.3.38:3306 [] <null> zone1-3839184014 newsbtest b29f900000000001- replica 10.244.3.37:15000 10.244.3.37:3306 [] <null> zone1-4162850680 newsbtest 55-aa replica 10.244.1.48:15000 10.244.1.48:3306 [] <null>
Now, the same old story begins. We have to reshard our environment using vtctlclient Reshard command. Please keep in mind that we will not touch two existing shards (-55 and 55-aa). We only have to reshard existing shard ‘aa-’ and split it in three parts.
root@k8smaster:~/vitesstests# vtctlclient Reshard -source_shards 'aa-' -target_shards 'aa-b29f900000000000,b29f900000000000-b29f900000000001,b29f900000000001-' Create newsbtest.reshard
As before, we can check the progress of the resharding process:
root@k8smaster:~/vitesstests# vtctlclient Reshard Progress newsbtest.reshard Copy Progress (approx): sbtest1: rows copied 277059/1618282 (17%), size copied 55721984/360611840 (15%) sbtest2: rows copied 1/1614349 (0%), size copied 49152/383696896 (0%) Following vreplication streams are running for workflow newsbtest.reshard: id=1 on b29f900000000000-b29f900000000001/zone1-0778238830: Status: Running. VStream Lag: 0s. id=1 on b29f900000000001-/zone1-1572740937: Status: Copying. VStream Lag: 0s. id=1 on aa-b29f900000000000/zone1-2214114162: Status: Copying. VStream Lag: 0s.
Finally, it should complete:
root@k8smaster:~/vitesstests# vtctlclient Reshard Progress newsbtest.reshard Copy Completed. Following vreplication streams are running for workflow newsbtest.reshard: id=1 on b29f900000000000-b29f900000000001/zone1-0778238830: Status: Running. VStream Lag: 0s. id=1 on aa-b29f900000000000/zone1-2214114162: Status: Running. VStream Lag: 0s. id=1 on b29f900000000001-/zone1-1572740937: Status: Running. VStream Lag: 0s.
Once this is done, we can verify if we got what we wanted. We will select every row from a sbtest1 table in ‘b29f900000000000-b29f900000000001’ shard:
root@k8smaster:~/vitesstests# mysql newsbtest/b29f900000000000-b29f900000000001 -e "SELECT id,k FROM sbtest1\G" *************************** 1. row *************************** id: 653645 k: 3184758
Looks good, this shard contains only one ID, the one of our customer generating high load. From here we can proceed as before. Switch the traffic to new shards, drain unneeded pods (from shard 55-), clean up the workload and remove unused pods.
root@k8smaster:~/vitesstests# vtctlclient Reshard SwitchTraffic newsbtest.reshard . . . SwitchTraffic was successful for workflow newsbtest.reshard Start State: Reads Not Switched. Writes Not Switched Current State: All Reads Switched. Writes Switched
Looks good, the traffic has been redirected to the new shards. Now, the drain annotation:
root@k8smaster:~/vitesstests# for alias in $(vtctlclient listalltablets | grep "newsbtest aa- " | awk '{print $1}') ; do echo ${alias} ; pod=$(kubectl get pods | grep ${alias} | awk '{print $1}') ; kubectl annotate pod ${pod} drain.planetscale.com/started="Draining after reshard" ; done
zone1-1463074389
pod/example-vttablet-zone1-1463074389-a4c6b61f annotated
zone1-3262256522
pod/example-vttablet-zone1-3262256522-ea0a10a7 annotated
Then complete the workflow:
root@k8smaster:~/vitesstests# vtctlclient Reshard Complete newsbtest.reshard I1001 13:44:29.685156 906611 main.go:67] I1001 13:44:29.419952 traffic_switcher.go:1509] Deleting shard newsbtest.aa- I1001 13:44:29.694362 906611 main.go:67] I1001 13:44:29.429502 shard.go:197] Deleting all tablets in shard newsbtest/aa- cell zone1 I1001 13:44:29.694388 906611 main.go:67] I1001 13:44:29.429526 shard.go:201] Deleting tablet zone1-3262256522 I1001 13:44:29.740572 906611 main.go:67] I1001 13:44:29.475817 shard.go:201] Deleting tablet zone1-1463074389 I1001 13:44:29.772756 906611 main.go:67] I1001 13:44:29.507633 traffic_switcher.go:1515] Deleted shard newsbtest.aa- I1001 13:44:29.772857 906611 main.go:67] I1001 13:44:29.507681 traffic_switcher.go:1546] Deleting reverse streams for workflow reshard db_name vt_newsbtest I1001 13:44:29.882183 906611 main.go:67] I1001 13:44:29.617336 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest I1001 13:44:29.882221 906611 main.go:67] I1001 13:44:29.617386 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest I1001 13:44:29.882231 906611 main.go:67] I1001 13:44:29.617400 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest Complete was successful for workflow newsbtest.reshard Start State: All Reads Switched. Writes Switched Current State: Workflow Not Found
That’s it. Now, apply the 109_remove_equal_sharding_from_newsbtest.yaml yaml file to delete the unused pods:
root@k8smaster:~/vitesstests# kubectl apply -f 109_remove_equal_sharding_from_newsbtest.yaml vitesscluster.planetscale.com/example configured secret/example-cluster-config configured
This is it. In our case the termination of the pods did not happen immediately and we had to bounce vttablet containers (locate them on the nodes and then use docker stop to stop them). This triggered the termination of the pods that used to belong to the ‘aa-’ shard. If we will figure out the correct steps to ensure that the pods are terminated always, without the need to bounce vttablets, we’ll definitely share it with you.
We ended up with split shards and we now have a separate shard for the heavy loaded ID. In the next blog we will show you how to merge the shards if, let’s say, the load has reduced and there is no need to use the separate shard for one particular user.