Designing optimal primary keys YCQL

The Primary key is a column or a set of columns that uniquely identifies a row, such as a user ID or order number. You should choose the primary key based on the most common access pattern. Columns of data type string, number, serial, or UUID make good choices for primary keys.

Automatically generate the primary key

The best way to uniquely identify a record is to allow the database to assign a unique identifier to the row. YugabyteDB supports multiple schemes for generating identifiers that you can choose based on the needs of your application.

UUID

A UUID is a 128-bit number represented as a string of 36 characters, including hyphens. For example, 4b6aa2ff-53e6-44f5-8bd0-ef9de90a8095. YugabyteDB natively supports UUID generation as per RFC 4122 via the uuid-ossp extension. UUIDs have several advantages:

The likelihood of generating duplicate UUIDs is extremely low.
UUIDs can be independently generated on different nodes in the cluster without any coordination with other systems.
The randomness of UUIDs makes it hard to predict the next ID, providing an additional layer of security.

You can add a UUID to your schema as follows:

create keyspace if not exists yugabyte;
use yugabyte;

drop table if exists users;
create table users (
    id uuid,
    name text,
    primary key (id)
);

Insert some rows into the table:

insert into users (id, name) values (uuid(), 'John Wick');
insert into users (id, name) values (uuid(), 'Iron Man');
insert into users (id, name) values (uuid(), 'Harry Potter');
insert into users (id, name) values (uuid(), 'Kack Sparrow');

Select all the rows from the table:

select * from users;

Notice how the generated IDs are totally random.

 id                                   | name
--------------------------------------+--------------
 85a17586-317f-4ef1-b5dd-582a13ccc832 | Harry Potter
 b431bb80-b20d-42fe-900d-fed295de507a | Kack Sparrow
 7abae478-532c-40aa-9f81-42e85750fe01 |    John Wick
 2a151214-272d-4448-af3e-a343f434fa68 |     Iron Man

TimeUUID

TimeUUID is a special type of UUID that has a time factor integrated into it so the generated UUIDs have an order associated with them. They can be generated using the now() function. To do this, first create a table as follows:

create keyspace if not exists yugabyte;
use yugabyte;

drop table if exists users;
create table users (
    id timeuuid,
    name text,
    primary key(id)
);

Insert some rows into the table:

insert into users (id, name) values (now(), 'John Wick');
insert into users (id, name) values (now(), 'Iron Man');
insert into users (id, name) values (now(), 'Harry Potter');
insert into users (id, name) values (now(), 'Kack Sparrow');

Select all the rows from the table:

select * from users;

The generated ids are not very random:

 id                                   | name
--------------------------------------+--------------
 19f24446-2904-11ef-917b-6bf61abbc06e |     Iron Man
 13dc1460-2904-11ef-917b-6bf61abbc06e |    John Wick
 1a3f8ac6-2904-11ef-917b-6bf61abbc06e | Kack Sparrow
 19f29720-2904-11ef-917b-6bf61abbc06e | Harry Potter

Existing columns as primary keys

To illustrate how to choose existing columns as primary keys, first create a sample census schema.

Local
YugabyteDB Anywhere

Set up a local cluster

If a local universe is currently running, first destroy it.

Start a local three-node universe with an RF of 3 by first creating a single node, as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.1 \
                --base_dir=${HOME}/var/node1 \
                --cloud_location=aws.us-east-2.us-east-2a

On macOS, the additional nodes need loopback addresses configured, as follows:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Next, join more nodes with the previous node as needed. yugabyted automatically applies a replication factor of 3 when a third node is added.

Start the second node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.2 \
                --base_dir=${HOME}/var/node2 \
                --cloud_location=aws.us-east-2.us-east-2b \
                --join=127.0.0.1

Start the third node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.3 \
                --base_dir=${HOME}/var/node3 \
                --cloud_location=aws.us-east-2.us-east-2c \
                --join=127.0.0.1

After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:

./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone

This command can be executed on any node where you already started YugabyteDB.

To check the status of a running multi-node universe, run the following command:

./bin/yugabyted status --base_dir=${HOME}/var/node1

Setup

To set up a universe, refer to Set up a YugabyteDB Anywhere universe.

Create a census table as follows:

CREATE TABLE census(
   id int,
   name varchar,
   age int,
   zipcode int,
   employed boolean,
   PRIMARY KEY(id)
);

Add some data to the table as follows.

INSERT INTO census (id,name,age,zipcode,employed) VALUES (1,'Zachary',55,94085,True);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (2,'James',56,94085,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (3,'Kimberly',50,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (4,'Edward',56,94085,True);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (5,'Barry',56,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (6,'Tyler',45,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (7,'James',47,94085,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (8,'Sarah',52,94084,True);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (9,'James',59,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (10,'Diane',51,94083,False);

ID as the primary key

In the census table, the most likely way to look up a person is by their id, so the primary key has been set to id. This means that the data is distributed based on ID. This works well for point lookups on ID. For example, to look up ID 9, you can do the following:

select * from census where id=9;

You will see output similar to the following:

 id | name  | age | zipcode | employed
----+-------+-----+---------+----------
  9 | Nancy |  59 |   94084 |    False

One row matching ID 9 was quickly fetched with just one request.

Name as the primary key

Suppose your most common lookup is based on the name. In this case you would make the name column part of the primary key. Because the name alone may not be unique enough to be the primary key (the primary key has to be unique), you can choose a primary key with both name and ID as follows:

CREATE TABLE census2(
   id int,
   name varchar,
   age int,
   zipcode int,
   employed boolean,
   PRIMARY KEY(name, id)
) WITH CLUSTERING ORDER BY id ASC;

INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (1,'Zachary',55,94085,True);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (2,'James',56,94085,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (3,'Kimberly',50,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (4,'Edward',56,94085,True);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (5,'Barry',56,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (6,'Tyler',45,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (7,'James',47,94085,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (8,'Sarah',52,94084,True);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (9,'James',59,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (10,'Diane',51,94083,False);

When specifying the primary key, the name column is specified first, and id second. This ensures that the data is stored sorted based on name first, and for all matching names, the id is stored sorted in ascending order, ensuring all people with the same name will be stored in the same tablet. This allows you to do a fast lookup on name even though (name, id) is the primary key.

Retrieve all the people with the name James as follows:

select * from census2 where name = 'James';

You will see output similar to the following:

 name  | id | age | zipcode | employed
-------+----+-----+---------+----------
 James |  2 |  56 |   94085 |    False
 James |  7 |  47 |   94085 |    False
 James |  9 |  59 |   94084 |    False
(5 rows)

There are 3 people named James, and all of them can be quickly looked up as the data has been distributed by name.

Notice that the rows are ordered by id. This is because you specified CLUSTERING ORDER BY id ASC to ensure that the rows with the same name will be stored ordered in the order of the id column.