Vitess – Shard the keyspace

Let’s say that we are running out of the write capacity in my primary database. In more simple words, our application executes such a number of writes per second that the database is not able to deal with them. It can be the replication (even multithreaded) that becomes the bottleneck, it can be just a primary node that is overloaded with queries that modify the data in one way or another. Adding more replicas, obviously, won’t change anything. It is a time to split the data into two or more parts which, basically, splits the writes. Due to the fact that writes are not necessarily uniformly distributed, we still may see some part of the data to generate more write traffic than the other part. In that case it might be required to split the data even more but let’s not go too deep into details. For now we are just going to assume that we want to split the data in two parts.

If you would like to follow our steps you can clone this repository: vitesstests

The plan

We are going to shard our “newsbtest” keyspace. We are going to split it into two shards. This keyspace contains two tables, both have the same structure:

mysql> SHOW CREATE TABLE newsbtest.sbtest1\G
*************************** 1. row ***************************
       Table: sbtest1
Create Table: CREATE TABLE `sbtest1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8
1 row in set (0.01 sec)

Of course, the real world scenarios will be more complex but for now this should work quite well as an example. Let’s say that we are going to use column ‘id’ to shard the data. There are couple of things we have to do:
Autoincrement doesn’t work well with sharding therefore we should remove it and replace it by sequences managed by Vitess.
We have to create vschema that will be used by Vitess to “understand” the keyspace structure in order to work with sharding.
We have to decide on the type and then create the vindex on the column we are going to use for sharding
Spin up additional vttablets to accommodate new shards
Shard the data by migrating them to new vttablets
Finish the migration by switching traffic to the new set of vttablets and delete old, unsharded vttablets

Preparations

As you can see, we have a couple of things to do. Let’s go through the steps one by one.
Migrate auto_increment columns to sequences
The first step would be to disable auto_increment for ‘id’ columns and start using sequences. There are several smaller steps we need to perform. First, we will have to run the schema change to disable AUTO_INCREMENT on ‘id’ columns. Then we will have to create sequence tables in non-sharded keyspace. Finally, we will define vschema with sequences and proper vindexes.

Online schema change

Vitess lets us perform schema change in multiple ways. We can execute “normal”, direct ALTER, just like on regular replication setup. It comes with the same limitations and the impact is quite severe. We can use built-in tools to perform online schema change. We can configure Vitess to use external online schema change tools like pt-online-schema-change and gh-ost to execute the schema modification. We can, finally, forget (to some extent) about Vitess and treat our setup as regular replication and just connect to the vttablets directly and use whatever approach we prefer. In our case we will go with pt-osc managed by Vitess.
The ALTERs that we want to run look like this:

ALTER TABLE sbtest1 CHANGE id id BIGINT NOT NULL;
ALTER TABLE sbtest2 CHANGE id id BIGINT NOT NULL;

We can execute schema changes through Vitess using vtctlclient:

root@k8smaster:~/vitesstests# vtctlclient ApplySchema -allow_long_unavailability -ddl_strategy=pt-osc -sql="ALTER TABLE sbtest2 CHANGE id id BIGINT NOT NULL" newsbtest
cb2c7f54_18d1_11ec_8195_1601acba28b7

In this case we have scheduled an online migration using pt-osc. We can check the status of the migration:

root@k8smaster:~/vitesstests# vtctlclient OnlineDDL newsbtest show all
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
|      Tablet      | shard | mysql_schema | mysql_table | ddl_action |            migration_uuid            | strategy |  started_timestamp  | completed_timestamp | migration_status |
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
| zone1-1817996704 | -     | vt_newsbtest | sbtest1     | alter      | cb2c7f54_18d1_11ec_8195_1601acba28b7 | pt-osc   | 2021-09-18 22:43:12 |                     | running          |
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+

If we want, we may even check exactly what is happening. For that we would have to find a docker container for vttablet with alias zone1-1817996704. We have shown in earlier posts how to do that but the gist is, we can see where the pod is scheduled using ‘kubectl describe nodes’ and then SSH to the node and connect to the correct Docker container. Once inside we can run:

vitess@example-vttablet-zone1-1817996704-e4f712d6:/$ tail -f /tmp/online-ddl-cb2c7f54_18d1_11ec_8195_1601acba28b7-3720010186/migration.log
Copying `vt_newsbtest`.`sbtest1`:   6% 01:18:42 remain
Copying `vt_newsbtest`.`sbtest1`:   6% 01:21:47 remain
Copying `vt_newsbtest`.`sbtest1`:   6% 01:24:49 remain
Copying `vt_newsbtest`.`sbtest1`:   7% 01:26:09 remain
Copying `vt_newsbtest`.`sbtest1`:   7% 01:28:28 remain
Copying `vt_newsbtest`.`sbtest1`:   7% 01:29:53 remain
Copying `vt_newsbtest`.`sbtest1`:   8% 01:31:20 remain
Copying `vt_newsbtest`.`sbtest1`:   8% 01:32:39 remain
Copying `vt_newsbtest`.`sbtest1`:   8% 01:33:44 remain
Copying `vt_newsbtest`.`sbtest1`:   9% 01:35:12 remain

to track the status of the migration. We can now proceed to schedule another migration:

root@k8smaster:~/vitesstests# vtctlclient ApplySchema -allow_long_unavailability -ddl_strategy=pt-osc -sql="ALTER TABLE sbtest2 CHANGE id id BIGINT NOT NULL" newsbtest
W0918 22:49:11.314024 1538096 main.go:67] W0918 22:49:11.051922 schema.go:107] Failed to reload schema on replica tablet zone1-3154448388 in newsbtest/- (use vtctl ReloadSchema to try again): rpc error: code = Unknown desc = TabletManager.ReloadSchema on zone1-3154448388 error: timed out waiting for position 0846df54-17b6-11ec-835d-c280334b0c29:1-6,0c526d17-178f-11ec-9d8d-b6bf3fb70a78:1-35,0f2cbc65-178f-11ec-9105-f20b4a03d898:1-4448: timed out waiting for position 0846df54-17b6-11ec-835d-c280334b0c29:1-6,0c526d17-178f-11ec-9d8d-b6bf3fb70a78:1-35,0f2cbc65-178f-11ec-9105-f20b4a03d898:1-4448
9e2f15af_18d2_11ec_8195_1601acba28b7

This time we got an error related to reloading of the schema for our keyspace. This is sort of expected, given that one online migration is running. It should not affect our second migration though. If we’ll take a look at the online migration status we should see a queued migration:

root@k8smaster:~/vitesstests# vtctlclient OnlineDDL newsbtest show all
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
|      Tablet      | shard | mysql_schema | mysql_table | ddl_action |            migration_uuid            | strategy |  started_timestamp  | completed_timestamp | migration_status |
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
| zone1-1817996704 | -     | vt_newsbtest | sbtest1     | alter      | cb2c7f54_18d1_11ec_8195_1601acba28b7 | pt-osc   | 2021-09-18 22:43:12 |                     | running          |
| zone1-1817996704 | -     | vt_newsbtest | sbtest2     | alter      | 9e2f15af_18d2_11ec_8195_1601acba28b7 | pt-osc   |                     |                     | queued           |
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+

That’s it, now it is time to wait for the schema changes to complete on both primary node and on the replica.

It may happen that the migration will fail, for whatever reason. Vitess, by default, will repeat the migration twice and then it will mark it as failed. This is exactly what happened for us:

root@k8smaster:~/vitesstests# vtctlclient OnlineDDL newsbtest show all
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
|      Tablet      | shard | mysql_schema | mysql_table | ddl_action |            migration_uuid            | strategy |  started_timestamp  | completed_timestamp | migration_status |
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
| zone1-1817996704 | -     | vt_newsbtest | sbtest1     | alter      | cb2c7f54_18d1_11ec_8195_1601acba28b7 | pt-osc   | 2021-09-18 22:43:12 | 2021-09-19 04:32:05 | complete         |
| zone1-1817996704 | -     | vt_newsbtest | sbtest2     | alter      | 9e2f15af_18d2_11ec_8195_1601acba28b7 | pt-osc   |                     | 2021-09-19 04:32:13 | failed           |
+------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+

What we can do? We can investigate the migration log hoping it will point us to some particular problem that prevented migration from completing. As we mentioned earlier, those logs can be found on the vttablet container where migration was scheduled. In our case, for both failed mirations, it was:

vitess@example-vttablet-zone1-1817996704-e4f712d6:/$ cat /tmp/online-ddl-9e2f15af_18d2_11ec_8195_1601acba28b7-1263821530/migration.log | more
Cannot connect to MySQL: DBI connect('vt_newsbtest;host=example-vttablet-zone1-1817996704-e4f712d6;port=3306;mysql_read_default_group=client','vt-online-ddl-internal',…) failed: A
ccess denied for user 'vt-online-ddl-internal'@'10.244.2.11' (using password: YES) at /usr/bin/pt-online-schema-change line 2345.

To be honest, it seemed like a temporary fluke as the first migration worked just fine and the second one was to be executed in exactly the same way. We just proceeded to re-schedule the migration. You can do it quite easily by running:

root@k8smaster:~/vitesstests# vtctlclient OnlineDDL newsbtest retry 9e2f15af_18d2_11ec_8195_1601acba28b7
+------------------+--------------+
|      Tablet      | RowsAffected |
+------------------+--------------+
| zone1-1817996704 |            1 |
+------------------+--------------+

We marked a particular migration (using its UUID) to retry. This time it started just fine and pt-online-schema-change continues to execute without any problems.

Create sequence tables

The migrations are running but we don’t have to just sit and wait – we can proceed with the rest of the steps. First, we have to create sequence tables. Those tables will have to be created in a non-sharded keyspace. We can use ‘sbtest’ keyspace for that. We will also have to fill in the data for those tables. It is a single row that defines how the sequence should be working. You can think of sequences in Vitess as of centralized ID generators that take care of generating values for ‘id’ columns across all shards.
We have created a file, sbtest1_2.seq.sql which will create and populate sequence tables for both tables:

create table sbtest1_seq(id int, next_id bigint, cache bigint, primary key(id)) comment 'vitess_sequence';
insert into sbtest1_seq(id, next_id, cache) values(0, 50000001, 10);
create table sbtest2_seq(id int, next_id bigint, cache bigint, primary key(id)) comment 'vitess_sequence';
insert into sbtest2_seq(id, next_id, cache) values(0, 50000001, 10);

What we see here are two CREATE TABLE statements which create sequence tables as defined in the Vitess documentation and we populate them with the data. ID column should always be set to 0 then we have the value of the next ID that should be used. We have 5m rows in each table therefore the next value is 5000001. Then we have a cache, which tells us how many sequence values at most can be used in a batch. What it means, in our example Vitess will take 10 values for sequence if needed, use how many it needs and then discard the rest if they are not required at this moment. Setting it higher reduces workload on the database (less frequent queries to the sequence table) but it increases chances that some of the ID values won’t be used and will be, technically, “lost”.

We can execute those queries via vtctlclient:

root@k8smaster:~/vitesstests# vtctlclient ApplySchema -sql "$(cat sbtest1_2.seq.sql)" sbtest
I0920 13:49:02.472689 2540047 main.go:67] I0920 13:49:02.192959 tablet_executor.go:277] Received DDL request. strategy=direct
I0920 13:49:03.168761 2540047 main.go:67] I0920 13:49:02.907278 tablet_executor.go:277] Received DDL request. strategy=direct
I0920 13:49:03.237284 2540047 main.go:67] I0920 13:49:02.975719 tablet_executor.go:277] Received DDL request. strategy=direct
I0920 13:49:03.869827 2540047 main.go:67] I0920 13:49:03.608305 tablet_executor.go:277] Received DDL request. strategy=direct

Create VSchemas

Then we should proceed and create proper VSchemas for our keyspaces. VSchema is a schema definition in Vitess that helps it to perform its tasks. That’s a short, vague definition. What we want is to tell Vitess that we have just created sequence tables. We also want to tell Vitess which tables will use those sequences we have just created. Finally, we want to tell Vitess to create a VIndex that is going to be used for sharding.
The first file to apply is vschema_sbtest_sequence.json:

root@k8smaster:~/vitesstests# cat vschema_sbtest_sequence.json
{
        "tables": {
                "sbtest1_seq": {
                        "type": "sequence"
                },
                "sbtest2_seq": {
                        "type": "sequence"
                },
                "sbtest1": {},
                "sbtest2": {},
                "sbtest3": {},
                "sbtest4": {}
        }
}

What we have here is information that keyspace ‘sbtest’ consists of two tables (sbtest1_seq and sbtest2_seq) that are sequence tables and four other (sbtest1, 2, 3 and 4) tables that are quite uninteresting.
The next step will be to createVSchema for our newsbtest keyspace. You can find it in the file: vschema_newsbtest_vindex.json

root@k8smaster:~/vitesstests# cat vschema_newsbtest_vindex.json
{
        "sharded": true,
        "vindexes": {
                "id1": {
                        "type": "reverse_bits"
                },
                "id2": {
                        "type": "reverse_bits"
                }
        },
        "tables": {
                "sbtest1": {
                        "column_vindexes": [{
                                "column": "id",
                                "name": "id1"
                        }],
                        "auto_increment": {
                                "column": "id",
                                "sequence": "sbtest1_seq"
                        }
                },

                "sbtest2": {
                        "column_vindexes": [{
                                "column": "id",
                                "name": "id2"
                        }],
                        "auto_increment": {
                                "column": "id",
                                "sequence": "sbtest2_seq"
                        }
                }
        }
}

            "sbtest2": {
                    "column_vindexes": [{
                            "column": "id",
                            "name": "id2"
                    }],
                    "auto_increment": {
                            "column": "id",
                            "sequence": "sbtest2_seq"
                    }
            }
    }

}

As you can see here, first of all, we mark the keyspace as sharded. Technically it is not sharded yet but it will be eventually. We also define two vindexes of “reverse_bits” type. Vitess allows the use of different types of indexes, based on the requirements, column type and so on. One can be used by several columns so there is no need to use two but in case we would need to make a change in one of them, we decided to go ahead with two, for now identical vindexes.
Finally we have a definition of both tables. We haven’t included all columns, even though it may help when it comes to the performance – Vitess would be able to better understand the schema and the structure of the data. We just listed the minimum required: vindex and the sequence.

Let’s go ahead and apply those changes:

root@k8smaster:~/vitesstests# vtctlclient ApplyVSchema -vschema="$(cat vschema_sbtest_sequence.json)" sbtest
New VSchema object:
{
  "tables": {
    "sbtest1": {},
    "sbtest1_seq": {
      "type": "sequence"
    },
    "sbtest2": {},
    "sbtest2_seq": {
      "type": "sequence"
    },
    "sbtest3": {},
    "sbtest4": {}
  }
}
If this is not what you expected, check the input data (as JSON parsing will skip unexpected fields).

As you can see, we can check how Vitess understands our vschema and what it applied. Let’s apply the other file, where we define vschema for newsbtest keyspace.

root@k8smaster:~/vitesstests# vtctlclient ApplyVSchema -vschema="$(cat vschema_newsbtest_vindex.json)" newsbtest
New VSchema object:
{
  "sharded": true,
  "vindexes": {
    "id1": {
      "type": "reverse_bits"
    },
    "id2": {
      "type": "reverse_bits"
    }
  },
  "tables": {
    "sbtest1": {
      "columnVindexes": [
        {
          "column": "id",
          "name": "id1"
        }
      ],
      "autoIncrement": {
        "column": "id",
        "sequence": "sbtest1_seq"
      }
    },
    "sbtest2": {
      "columnVindexes": [
        {
          "column": "id",
          "name": "id2"
        }
      ],
      "autoIncrement": {
        "column": "id",
        "sequence": "sbtest2_seq"
      }
    }
  }
}
If this is not what you expected, check the input data (as JSON parsing will skip unexpected fields).

Once this is done and when the schema change has completed, we move on to the next step.

Spinning up the shard vttablets

At this point what we have to do is to spin up new vttablets that we will use to accommodate our sharded data. We will use 106_sharding_newsbtest.yaml file for that. In that file we have added a second set of vttablets for keyspace “newsbtest”.
The most important bit is here:

- equal:
    parts: 3

What it means is we want to create three shards, all covering equal range. Basically, we are going to split the “newsbtest” keyspace into three parts.

root@k8smaster:~/vitesstests# kubectl get pods,pv
NAME                                                 READY   STATUS      RESTARTS   AGE
pod/example-etcd-faf13de3-1                          1/1     Running     0          4d6h
pod/example-etcd-faf13de3-2                          1/1     Running     0          4d6h
pod/example-etcd-faf13de3-3                          1/1     Running     0          4d6h
pod/example-vttablet-zone1-0349226440-35dab1bc       0/3     Init:0/2    0          9s
pod/example-vttablet-zone1-1463074389-a4c6b61f       0/3     Init:0/2    0          9s
pod/example-vttablet-zone1-1504968304-96f9a1bf       0/3     Pending     0          9s
pod/example-vttablet-zone1-1676955594-dc39347b       0/3     Init:0/2    0          9s
pod/example-vttablet-zone1-1817996704-e4f712d6       3/3     Running     0          3d7h
pod/example-vttablet-zone1-2179083526-f3060bc1       3/3     Running     1          3d12h
pod/example-vttablet-zone1-2344898534-e9abaf0e       3/3     Running     1          4d6h
pod/example-vttablet-zone1-2646235096-9ba85582       3/3     Running     1          4d6h
pod/example-vttablet-zone1-3154448388-73c3b20a       3/3     Running     0          3d7h
pod/example-vttablet-zone1-3262256522-ea0a10a7       0/3     Init:0/2    0          9s
pod/example-vttablet-zone1-4162850680-b78f527c       0/3     Init:0/2    0          9s
pod/example-zone1-vtctld-1d4dcad0-64668cccc8-swmj4   1/1     Running     1          4d6h
pod/example-zone1-vtgate-bc6cde92-8665cd4df-kwgcn    1/1     Running     1          4d6h
pod/vitess-operator-f44545df8-l5kk9                  1/1     Running     0          4d8h

Once everything is up and running, we should see new vttablets listed:

root@k8smaster:~/vitesstests# vtctlclient listalltablets
zone1-0349226440 newsbtest 55-aa primary 10.244.2.3:15000 10.244.2.3:3306 [] 2021-09-25T17:20:38Z
zone1-1463074389 newsbtest aa- primary 10.244.3.35:15000 10.244.3.35:3306 [] 2021-09-25T08:01:56Z
zone1-1504968304 newsbtest -55 primary 10.244.2.254:15000 10.244.2.254:3306 [] 2021-09-25T08:01:41Z
zone1-1676955594 newsbtest -55 replica 10.244.1.49:15000 10.244.1.49:3306 [] <null>
zone1-1817996704 newsbtest - primary 10.244.2.241:15000 10.244.2.241:3306 [] 2021-09-24T08:51:32Z
zone1-2179083526 sbtest - replica 10.244.1.16:15000 10.244.1.16:3306 [] <null>
zone1-2344898534 sbtest - replica 10.244.3.8:15000 10.244.3.8:3306 [] <null>
zone1-2646235096 sbtest - primary 10.244.2.8:15000 10.244.2.8:3306 [] 2021-09-16T13:44:09Z
zone1-3154448388 newsbtest - replica 10.244.3.34:15000 10.244.3.34:3306 [] <null>
zone1-3262256522 newsbtest aa- replica 10.244.1.50:15000 10.244.1.50:3306 [] <null>
zone1-4162850680 newsbtest 55-aa replica 10.244.1.48:15000 10.244.1.48:3306 [] <null>

As you can see, for “newsbtest” keyspace we have two sets of vttablets. One set, a primary and a replica, that covers range “-” – this is our current non-sharded setup. Then we have another set of vttablets, three ranges (-55, 55-aa, aa-), each range with two tablets, primary and replica. This is our new setup, which we want to migrate to.

Migrating to new shard setup

At this moment we have everything ready and we can proceed with the process of resharding our “newsbtest” keyspace.The process is quite simple, we can do it using vtctlclient. Let’s start the job:

root@k8smaster:~/vitesstests# vtctlclient Reshard -source_shards '-' -target_shards '-55,55-aa,aa-' Create newsbtest.reshard
Reshard Error: rpc error: code = Unknown desc = Create ReshardWorkflow failed: ValidateVSchema(newsbtest, [-], [], true) failed: newsbtest/- has tables that are not in the vschema: [_sbtest2_old _vt_EVAC_cb2c7f5418d111ec81951601acba28b7_20210923073112]
E0920 21:07:33.658042 2728183 main.go:76] remote error: rpc error: code = Unknown desc = Create ReshardWorkflow failed: ValidateVSchema(newsbtest, [-], [], true) failed: newsbtest/- has tables that are not in the vschema: [_sbtest2_old _vt_EVAC_cb2c7f5418d111ec81951601acba28b7_20210923073112]

Ok, it failed spectacularly. What can we make out of this error? It seems like Vitess found tables located in our keyspace that are not included in the vschema definition and, as such, can’t be sharded. Based on the names it seems like we have some artefacts left by schema changes. Let’s drop those tables:

root@k8smaster:~# mysql newsbtest/-
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.7.9-vitess-12.0.0-SNAPSHOT Version: 12.0.0-SNAPSHOT (Git revision eb17117ffd branch 'main') built on Thu Sep 16 07:29:11 UTC 2021 by vitess@buildkitsandbox using go1.17 linux/amd64

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show tables;
+----------------------------------------------------------+
| Tables_in_newsbtest                                      |
+----------------------------------------------------------+
| _sbtest2_old                                             |
| _vt_EVAC_cb2c7f5418d111ec81951601acba28b7_20210923073112 |
| sbtest1                                                  |
| sbtest2                                                  |
+----------------------------------------------------------+
4 rows in set (0.01 sec)

mysql> DROP TABLE _sbtest2_old;
Query OK, 0 rows affected (0.22 sec)

mysql> DROP TABLE _vt_EVAC_cb2c7f5418d111ec81951601acba28b7_20210923073112;
Query OK, 0 rows affected (0.28 sec)

Now we can retry the resharding:

root@k8smaster:~/vitesstests# vtctlclient Reshard -source_shards '-' -target_shards '-55,55-aa,aa-' Create newsbtest.reshard
Waiting for workflow to start:
0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... E0920 21:10:59.337054 2729414 main.go:67] E0920 21:10:59.075055 vtctl.go:2260] workflow did not start within 30s
Reshard Error: rpc error: code = Unknown desc = workflow did not start within 30s
E0920 21:10:59.337192 2729414 main.go:76] remote error: rpc error: code = Unknown desc = workflow did not start within 30s

Ok, it still doesn’t look good, workflow hasn’t started in 30 seconds. Keep in mind, though, we are on “poor man’s” K8s cluster, let’s just see if it started at all or not:

root@k8smaster:~/vitesstests# vtctlclient Reshard Progress newsbtest.reshard

Copy Progress (approx):

sbtest1: rows copied 911300/4802916 (18%), size copied 208470016/1155530752 (18%)
sbtest2: rows copied 0/4929680 (0%), size copied 49152/1155530752 (0%)


Following vreplication streams are running for workflow newsbtest.reshard:

id=1 on -55/zone1-1504968304: Status: Copying. VStream Lag: 0s.
id=1 on aa-/zone1-1463074389: Status: Copying. VStream Lag: 0s.
id=1 on 55-aa/zone1-0349226440: Status: Copying. VStream Lag: 0s.

It, actually, started and it’s making progress on the first table. Now we wait. The next step, once the new shards are populated, will be to cut the traffic to the new set of shards.

root@k8smaster:~/vitesstests# vtctlclient Reshard Progress newsbtest.reshard

Copy Completed.

Following vreplication streams are running for workflow newsbtest.reshard:

id=1 on aa-/zone1-1463074389: Status: Running. VStream Lag: -1s. Tx time: Tue Sep 21 01:06:30 2021.
id=1 on -55/zone1-1504968304: Status: Running. VStream Lag: -1s. Tx time: Tue Sep 21 01:06:30 2021.
id=1 on 55-aa/zone1-0349226440: Status: Running. VStream Lag: -1s. Tx time: Tue Sep 21 01:06:30 2021.

As we can see, the copy is now completed. We can cut over the writes:

root@k8smaster:~/vitesstests# vtctlclient Reshard SwitchTraffic newsbtest.reshard
.
.
.
SwitchTraffic was successful for workflow newsbtest.reshard
Start State: Reads Not Switched. Writes Not Switched
Current State: All Reads Switched. Writes Switched

The job will generate long output but what we are looking for are last lines indicating that the traffic has been switched to the new shard.
Once everything is ready, we can remove the previous shards but before we attempt to do that, we should mark old vttablets for the drain process. This process, managed by the Vitess operator, is intended to clean the vttablets, demote primary vttablets and so on. The process itself is fairly simple. You have to annotate the old pods before we can clean up the workflow. Here’s a command that can do it:

root@k8smaster:~/vitesstests# for alias in $(vtctlclient listalltablets | grep "newsbtest - " | awk '{print $1}') ; do echo ${alias} ; pod=$(kubectl get pods | grep ${alias} | awk '{print $1}') ;  kubectl annotate pod ${pod} drain.planetscale.com/started="Draining after reshard" ; done
zone1-1817996704
pod/example-vttablet-zone1-1817996704-e4f712d6 annotated
zone1-3154448388
pod/example-vttablet-zone1-3154448388-73c3b20a annotated

We can then cleanup the workload:

root@k8smaster:~/vitesstests# vtctlclient Reshard Complete newsbtest.reshard
I0926 22:14:31.086312 2254761 main.go:67] I0926 22:14:30.824644 traffic_switcher.go:1509] Deleting shard newsbtest.-
I0926 22:14:31.096583 2254761 main.go:67] I0926 22:14:30.835135 shard.go:197] Deleting all tablets in shard newsbtest/- cell zone1
I0926 22:14:31.096612 2254761 main.go:67] I0926 22:14:30.835167 shard.go:201] Deleting tablet zone1-3154448388
I0926 22:14:31.151426 2254761 main.go:67] I0926 22:14:30.889905 shard.go:201] Deleting tablet zone1-1817996704
I0926 22:14:31.183772 2254761 main.go:67] I0926 22:14:30.922262 traffic_switcher.go:1515] Deleted shard newsbtest.-
I0926 22:14:31.183994 2254761 main.go:67] I0926 22:14:30.922308 traffic_switcher.go:1546] Deleting reverse streams for workflow reshard db_name vt_newsbtest
I0926 22:14:31.306691 2254761 main.go:67] I0926 22:14:31.044941 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest
I0926 22:14:31.306726 2254761 main.go:67] I0926 22:14:31.044961 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest
I0926 22:14:31.306739 2254761 main.go:67] I0926 22:14:31.044985 traffic_switcher.go:1537] Deleting target streams for workflow reshard db_name vt_newsbtest
Complete was successful for workflow newsbtest.reshard
Start State: All Reads Switched. Writes Switched
Current State: Workflow Not Found

It looks like everything seems to be working just fine. We can double check that the vttablets have indeed been removed:

root@k8smaster:~/vitesstests# vtctlclient listalltablets
zone1-0349226440 newsbtest 55-aa primary 10.244.2.3:15000 10.244.2.3:3306 [] 2021-09-25T17:20:38Z
zone1-1463074389 newsbtest aa- primary 10.244.3.35:15000 10.244.3.35:3306 [] 2021-09-25T08:01:56Z
zone1-1504968304 newsbtest -55 primary 10.244.2.254:15000 10.244.2.254:3306 [] 2021-09-25T08:01:41Z
zone1-1676955594 newsbtest -55 replica 10.244.1.49:15000 10.244.1.49:3306 [] <null>
zone1-2179083526 sbtest - replica 10.244.1.16:15000 10.244.1.16:3306 [] <null>
zone1-2344898534 sbtest - replica 10.244.3.8:15000 10.244.3.8:3306 [] <null>
zone1-2646235096 sbtest - primary 10.244.2.8:15000 10.244.2.8:3306 [] 2021-09-16T13:44:09Z
zone1-3262256522 newsbtest aa- replica 10.244.1.50:15000 10.244.1.50:3306 [] <null>
zone1-4162850680 newsbtest 55-aa replica 10.244.1.48:15000 10.244.1.48:3306 [] <null>

Yes, all is good. We can now double check if the drain process has completed. You should see annotations similar to this when you run kubectl describe pod for pods that were used by the old shard:

Annotations: drain.planetscale.com/acknowledged: 2021-09-26 22:11:44.125940741 +0000 UTC
drain.planetscale.com/finished: 2021-09-26 22:11:51.019486217 +0000 UTC
drain.planetscale.com/started: Draining after reshard
drain.planetscale.com/supported: ensure that the tablet is not a master

Now we can delete the non-used pods by applying 107_delete_old_newsbtest_vttablets.yaml file:

root@k8smaster:~/vitesstests# kubectl apply -f 107_delete_old_newsbtest_vttablets.yaml
vitesscluster.planetscale.com/example configured
secret/example-cluster-config configured

If everything will go as planned, you should see two pods terminating. If something is not right, typically bouncing vttablet containers using docker stop should be enough. Operator will restart vttablet and it should pick up the termination process.

In the next blog we are going to change the sharding pattern to deal with a hot shard.