Vitess – Sharding the shard

In the last blog we ended up with three shards for our newsbtest keyspace. This is great but let’s assume that we quickly have noticed that our problem has not disappeared. One of the shards is generating the majority of the load and it impacts the whole application. How can we deal with such issue? Let’s take a look at the custom sharding available in Vitess.

If you would like to follow our steps you can clone this repository: vitesstests

The scenario

The scenario we are going to discuss, in our setup, doesn’t make too much sense but it can still be used to highlight the process and challenges that we might be facing in a real world deployment.
Let’s assume that we have our data organized using some sort of an ID. It could be, for example, a user ID. Let’s say that some of the users generate more data and load. We can be talking about a ticketing system where some users are larger than the others. We can talk about social media where some users generate more content than others. It is not important. What is important is that we have a set of tables in our keyspace which are sharded using the ID column. Just like our “newsbtest” keyspace. Unfortunately, while users are equally distributed across our three shards, the load they generate is not. Let us assume that one particular user is quite heavy and is responsible for the majority of the workload. It’s ID is 2763546.

What can we try to do? We could add more shards to the mix but if that user is indeed responsible for 40% of the load, splitting doesn’t make a difference. We will end up with more shards, probably underutilized with an exception of the shard where our problematic user is located.

How to find a shard for a row?

As we have mentioned, the sharding is based on the vindex that is used to access the rows and identify their location on the list of shards in the keyspace. Each vindex is, from the code standpoint, a plugin that defines how to identify the rows. Everyone can write such a plugin and to determine how it works, the best place (outside of documentation – if documentation exists) is the code.
We can check in vschema what indexes are defined:

root@k8smaster:~/vitesstests# vtctlclient GetVSchema newsbtest
{
  "sharded": true,
  "vindexes": {
    "id1": {
      "type": "reverse_bits"
    },
    "id2": {
      "type": "reverse_bits"
    }
  },
  "tables": {
    "sbtest1": {
      "columnVindexes": [
        {
          "column": "id",
          "name": "id1"
        }
      ],
      "autoIncrement": {
        "column": "id",
        "sequence": "sbtest1_seq"
      }
    },
    "sbtest2": {
      "columnVindexes": [
        {
          "column": "id",
          "name": "id2"
        }
      ],
      "autoIncrement": {
        "column": "id",
        "sequence": "sbtest2_seq"
      }
    }
  }
}

As you can see (or maybe you remember it from a previous blog), we used “reverse_bits” indexes for our “newsbtest” keyspace. Let’s take a look at what steps are performed to decide to which shard given row belongs. First, we have to take our ID value, which is an integer. Then we convert it to binary form, using 64 bit format. Then we perform reverse bits operation on it. For example, if we have an ID value of 127 then the binary representation, with all added zeros, is:

0000000000000000000000000000000000000000000000000000000001111111

Reverse bits operation will result in:

1111111000000000000000000000000000000000000000000000000000000000

We just read the data from end to the beginning.
Then, we convert this new value to hexadecimal format: 0xfe00000000000000
and this is all we need to store the row in the correct keyspace. In our case, we have three shards: ‘-55’, ‘55-aa’ and ‘aa-’. ‘0xfe’ will end up in the third range so the row with ID of 127 will end up in the third shard.
We can easily check this by executing a query that will return that row and directing it to particular shards:

root@k8smaster:~/vitesstests# mysql newsbtest/-55 -e "SELECT id,k FROM sbtest1 WHERE id=127"
root@k8smaster:~/vitesstests# mysql newsbtest/55-aa -e "SELECT id,k FROM sbtest1 WHERE id=127"
root@k8smaster:~/vitesstests# mysql newsbtest/aa- -e "SELECT id,k FROM sbtest1 WHERE id=127"
+-----+---------+
| id  | k       |
+-----+---------+
| 127 | 1702230 |
+-----+---------+

As we calculated, the only result comes from the keyspace ‘newsbtest/aa-’.
We have included a very simple python script that does such calculations, reversebits.py.

Sharding the hot shard

We can now put all of those things together. What happened is that we have a hot shard, ‘aa-’ that gives us a headache. We have identified that the majority of the load is caused by a particular account with the ID of 653645. Let’s see how we can deal with this issue.

Vitess comes with a functionality that allows us to define any particular range for a shard. It allows us to create a shard just for that particular user, keep it separated from the rest of the data and ensure it will not impact other users. Well, to actually make it work in a Kubernetes environment, we would have to add another node to the Kubernetes cluster and then use nodeSelector and nodeAffinity to bind that shard to that separate node (and prevent other pods from being allocated on it). We won’t go into such details, focusing on the Vitess part for now. First, we have to figure out a shard range that will contain only our problematic ID. We will use our simple python script for that:

root@k8smaster:~/vitesstests# python3 reversebits.py
Pass the integer to convert or CTRL+C to exit
653645
binary representation of 653645: 0000000000000000000000000000000000000000000010011111100101001101
Reversed bits: 1011001010011111100100000000000000000000000000000000000000000000
The hex representation of 1011001010011111100100000000000000000000000000000000000000000000: 0xb29f900000000000)
Pass the integer to convert or CTRL+C to exit

Ok, so according to the script we are talking about the hexadecimal value of b29f900000000000. Therefore, we are interested in a range: b29f900000000000 – b29f900000000001.
Now we can create a new yaml file with custom sharding. You can find it as 108_custom_sharding_of_newsbtest.yaml. The most important bits are those:

    - custom:
        shards:
        - databaseInitScriptSecret:
            name: example-cluster-config
            key: init_db.sql
          keyRange: {
            start: "aa",
            end: "b29f900000000000"
          }
          tabletPools:
          - cell: zone1
            type: replica
            replicas: 2
            vttablet:
              extraFlags:
                db_charset: utf8mb4
                backup_storage_implementation: file
                backup_engine_implementation: xtrabackup
                xtrabackup_root_path: /usr/bin
                xtrabackup_user: root
                xtrabackup_stripes: '8'
                restore_from_backup: 'true'
                file_backup_storage_root: /mnt/backup
              resources:
                requests:
                  cpu: 500m
                  memory: 1Gi
            mysqld:
              resources:
                requests:
                  cpu: 500m
                  memory: 1Gi
            dataVolumeClaimTemplate:
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 100Gi
            extraVolumes:
            - name: backupvol
              persistentVolumeClaim:
                claimName: "backupvol"
                accessModes: ["ReadWriteMany"]
                resources:
                  requests:
                    storage: 100Gi
                volumeName: backup
            extraVolumeMounts:
            - name: backupvol
              mountPath: /mnt

We have added a whole section where we define custom shards. Instead of

partitionings:
- equal:
    parts: 3

we are defining each shard separately, using different ranges:

          keyRange: {
            start: "b29f900000000000",
            end: "b29f900000000001"
          }

          keyRange: {
            start: "b29f900000000001",
            end: ""
          }

Basically, we are going to define shards with following ranges:

-55
55-aa
aa-b29f900000000000
b29f900000000000-b29f900000000001
b29f900000000001-

First two will not change and the Vitess operator will reuse existing setup. It will create three more sets of shards, each containing primary and a replica.
Once this is done you shall see new tablets defined:

root@k8smaster:~/vitesstests# vtctlclient listalltablets | grep newsbtest
zone1-0349226440 newsbtest 55-aa primary 10.244.2.3:15000 10.244.2.3:3306 [] 2021-09-25T17:20:38Z
zone1-0778238830 newsbtest b29f900000000000-b29f900000000001 primary 10.244.3.39:15000 10.244.3.39:3306 [] 2021-09-30T13:43:05Z
zone1-1463074389 newsbtest aa- primary 10.244.3.35:15000 10.244.3.35:3306 [] 2021-09-25T08:01:56Z
zone1-1504968304 newsbtest -55 primary 10.244.2.254:15000 10.244.2.254:3306 [] 2021-09-25T08:01:41Z
zone1-1572740937 newsbtest b29f900000000001- primary 10.244.2.5:15000 10.244.2.5:3306 [] 2021-09-30T13:35:45Z
zone1-1676955594 newsbtest -55 replica 10.244.1.49:15000 10.244.1.49:3306 [] <null>
zone1-2150384904 newsbtest b29f900000000000-b29f900000000001 replica 10.244.1.51:15000 10.244.1.51:3306 [] <null>
zone1-2214114162 newsbtest aa-b29f900000000000 primary 10.244.2.6:15000 10.244.2.6:3306 [] 2021-09-30T13:35:43Z
zone1-3262256522 newsbtest aa- replica 10.244.1.50:15000 10.244.1.50:3306 [] <null>
zone1-3429635106 newsbtest aa-b29f900000000000 replica 10.244.3.38:15000 10.244.3.38:3306 [] <null>
zone1-3839184014 newsbtest b29f900000000001- replica 10.244.3.37:15000 10.244.3.37:3306 [] <null>
zone1-4162850680 newsbtest 55-aa replica 10.244.1.48:15000 10.244.1.48:3306 [] <null>

Now, the same old story begins. We have to reshard our environment using vtctlclient Reshard command. Please keep in mind that we will not touch two existing shards (-55 and 55-aa). We only have to reshard existing shard ‘aa-’ and split it in three parts.

root@k8smaster:~/vitesstests# vtctlclient Reshard -source_shards 'aa-' -target_shards 'aa-b29f900000000000,b29f900000000000-b29f900000000001,b29f900000000001-' Create newsbtest.reshard

As before, we can check the progress of the resharding process:

root@k8smaster:~/vitesstests# vtctlclient Reshard Progress newsbtest.reshard

Copy Progress (approx):

sbtest1: rows copied 277059/1618282 (17%), size copied 55721984/360611840 (15%)
sbtest2: rows copied 1/1614349 (0%), size copied 49152/383696896 (0%)


Following vreplication streams are running for workflow newsbtest.reshard:

id=1 on b29f900000000000-b29f900000000001/zone1-0778238830: Status: Running. VStream Lag: 0s.
id=1 on b29f900000000001-/zone1-1572740937: Status: Copying. VStream Lag: 0s.
id=1 on aa-b29f900000000000/zone1-2214114162: Status: Copying. VStream Lag: 0s.

Finally, it should complete:

root@k8smaster:~/vitesstests# vtctlclient Reshard Progress newsbtest.reshard

Copy Completed.

Following vreplication streams are running for workflow newsbtest.reshard:

id=1 on b29f900000000000-b29f900000000001/zone1-0778238830: Status: Running. VStream Lag: 0s.
id=1 on aa-b29f900000000000/zone1-2214114162: Status: Running. VStream Lag: 0s.
id=1 on b29f900000000001-/zone1-1572740937: Status: Running. VStream Lag: 0s.

Once this is done, we can verify if we got what we wanted. We will select every row from a sbtest1 table in ‘b29f900000000000-b29f900000000001’ shard:

root@k8smaster:~/vitesstests# mysql newsbtest/b29f900000000000-b29f900000000001 -e "SELECT id,k FROM sbtest1\G"
*************************** 1. row ***************************
id: 653645
 k: 3184758

Looks good, this shard contains only one ID, the one of our customer generating high load. From here we can proceed as before. Switch the traffic to new shards, drain unneeded pods (from shard 55-), clean up the workload and remove unused pods.

root@k8smaster:~/vitesstests# vtctlclient Reshard SwitchTraffic newsbtest.reshard
.
.
.
SwitchTraffic was successful for workflow newsbtest.reshard
Start State: Reads Not Switched. Writes Not Switched
Current State: All Reads Switched. Writes Switched

Looks good, the traffic has been redirected to the new shards. Now, the drain annotation:

root@k8smaster:~/vitesstests# for alias in $(vtctlclient listalltablets | grep "newsbtest aa- " | awk '{print $1}') ; do echo ${alias} ; pod=$(kubectl get pods | grep ${alias} | awk '{print $1}') ;  kubectl annotate pod ${pod} drain.planetscale.com/started="Draining after reshard" ; done
zone1-1463074389
pod/example-vttablet-zone1-1463074389-a4c6b61f annotated
zone1-3262256522
pod/example-vttablet-zone1-3262256522-ea0a10a7 annotated

Then complete the workflow:

root@k8smaster:~/vitesstests# vtctlclient Reshard Complete newsbtest.reshard
I1001 13:44:29.685156  906611 main.go:67] I1001 13:44:29.419952 traffic_switcher.go:1509] Deleting shard newsbtest.aa-
I1001 13:44:29.694362  906611 main.go:67] I1001 13:44:29.429502 shard.go:197] Deleting all tablets in shard newsbtest/aa- cell zone1
I1001 13:44:29.694388  906611 main.go:67] I1001 13:44:29.429526 shard.go:201] Deleting tablet zone1-3262256522
I1001 13:44:29.740572  906611 main.go:67] I1001 13:44:29.475817 shard.go:201] Deleting tablet zone1-1463074389
I1001 13:44:29.772756  906611 main.go:67] I1001 13:44:29.507633 traffic_switcher.go:1515] Deleted shard newsbtest.aa-
I1001 13:44:29.772857  906611 main.go:67] I1001 13:44:29.507681 traffic_switcher.go:1546] Deleting reverse streams for workflow reshard db_name vt_newsbtest
I1001 13:44:29.882183  906611 main.go:67] I1001 13:44:29.617336 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest
I1001 13:44:29.882221  906611 main.go:67] I1001 13:44:29.617386 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest
I1001 13:44:29.882231  906611 main.go:67] I1001 13:44:29.617400 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest
Complete was successful for workflow newsbtest.reshard
Start State: All Reads Switched. Writes Switched
Current State: Workflow Not Found

That’s it. Now, apply the 109_remove_equal_sharding_from_newsbtest.yaml yaml file to delete the unused pods:

root@k8smaster:~/vitesstests# kubectl apply -f 109_remove_equal_sharding_from_newsbtest.yaml
vitesscluster.planetscale.com/example configured
secret/example-cluster-config configured

This is it. In our case the termination of the pods did not happen immediately and we had to bounce vttablet containers (locate them on the nodes and then use docker stop to stop them). This triggered the termination of the pods that used to belong to the ‘aa-’ shard. If we will figure out the correct steps to ensure that the pods are terminated always, without the need to bounce vttablets, we’ll definitely share it with you.

We ended up with split shards and we now have a separate shard for the heavy loaded ID. In the next blog we will show you how to merge the shards if, let’s say, the load has reduced and there is no need to use the separate shard for one particular user.