Designing optimal primary keys YCQL
The Primary key is a column or a set of columns that uniquely identifies a row, such as a user ID or order number. You should choose the primary key based on the most common access pattern. Columns of data type string, number, serial, or UUID make good choices for primary keys.
Automatically generate the primary key
The best way to uniquely identify a record is to allow the database to assign a unique identifier to the row. YugabyteDB supports multiple schemes for generating identifiers that you can choose based on the needs of your application.
UUID
A UUID is a 128-bit number represented as a string of 36 characters, including hyphens. For example, 4b6aa2ff-53e6-44f5-8bd0-ef9de90a8095
. YugabyteDB natively supports UUID generation as per RFC 4122 via the uuid-ossp extension. UUIDs have several advantages:
- The likelihood of generating duplicate UUIDs is extremely low.
- UUIDs can be independently generated on different nodes in the cluster without any coordination with other systems.
- The randomness of UUIDs makes it hard to predict the next ID, providing an additional layer of security.
You can add a UUID to your schema as follows:
create keyspace if not exists yugabyte;
use yugabyte;
drop table if exists users;
create table users (
id uuid,
name text,
primary key (id)
);
Insert some rows into the table:
insert into users (id, name) values (uuid(), 'John Wick');
insert into users (id, name) values (uuid(), 'Iron Man');
insert into users (id, name) values (uuid(), 'Harry Potter');
insert into users (id, name) values (uuid(), 'Kack Sparrow');
Select all the rows from the table:
select * from users;
Notice how the generated IDs are totally random.
id | name
--------------------------------------+--------------
85a17586-317f-4ef1-b5dd-582a13ccc832 | Harry Potter
b431bb80-b20d-42fe-900d-fed295de507a | Kack Sparrow
7abae478-532c-40aa-9f81-42e85750fe01 | John Wick
2a151214-272d-4448-af3e-a343f434fa68 | Iron Man
TimeUUID
TimeUUID is a special type of UUID that has a time factor integrated into it so the generated UUIDs have an order associated with them. They can be generated using the now()
function. To do this, first create a table as follows:
create keyspace if not exists yugabyte;
use yugabyte;
drop table if exists users;
create table users (
id timeuuid,
name text,
primary key(id)
);
Insert some rows into the table:
insert into users (id, name) values (now(), 'John Wick');
insert into users (id, name) values (now(), 'Iron Man');
insert into users (id, name) values (now(), 'Harry Potter');
insert into users (id, name) values (now(), 'Kack Sparrow');
Select all the rows from the table:
select * from users;
The generated ids are not very random:
id | name
--------------------------------------+--------------
19f24446-2904-11ef-917b-6bf61abbc06e | Iron Man
13dc1460-2904-11ef-917b-6bf61abbc06e | John Wick
1a3f8ac6-2904-11ef-917b-6bf61abbc06e | Kack Sparrow
19f29720-2904-11ef-917b-6bf61abbc06e | Harry Potter
Existing columns as primary keys
To illustrate how to choose existing columns as primary keys, first create a sample census schema.
Set up a local cluster
If a local universe is currently running, first destroy it.
Start a local three-node universe with an RF of 3
by first creating a single node, as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.1 \
--base_dir=${HOME}/var/node1 \
--cloud_location=aws.us-east-2.us-east-2a
On macOS, the additional nodes need loopback addresses configured, as follows:
sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3
Next, join more nodes with the previous node as needed. yugabyted
automatically applies a replication factor of 3
when a third node is added.
Start the second node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.2 \
--base_dir=${HOME}/var/node2 \
--cloud_location=aws.us-east-2.us-east-2b \
--join=127.0.0.1
Start the third node as follows:
./bin/yugabyted start \
--advertise_address=127.0.0.3 \
--base_dir=${HOME}/var/node3 \
--cloud_location=aws.us-east-2.us-east-2c \
--join=127.0.0.1
After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:
./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone
This command can be executed on any node where you already started YugabyteDB.
To check the status of a running multi-node universe, run the following command:
./bin/yugabyted status --base_dir=${HOME}/var/node1
Setup
To set up a universe, refer to Set up a YugabyteDB Anywhere universe.Create a census table as follows:
CREATE TABLE census(
id int,
name varchar,
age int,
zipcode int,
employed boolean,
PRIMARY KEY(id)
);
Add some data to the table as follows.
INSERT INTO census (id,name,age,zipcode,employed) VALUES (1,'Zachary',55,94085,True);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (2,'James',56,94085,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (3,'Kimberly',50,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (4,'Edward',56,94085,True);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (5,'Barry',56,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (6,'Tyler',45,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (7,'James',47,94085,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (8,'Sarah',52,94084,True);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (9,'James',59,94084,False);
INSERT INTO census (id,name,age,zipcode,employed) VALUES (10,'Diane',51,94083,False);
ID as the primary key
In the census
table, the most likely way to look up a person is by their id
, so the primary key has been set to id
. This means that the data is distributed based on ID. This works well for point lookups on ID. For example, to look up ID 9, you can do the following:
select * from census where id=9;
You will see output similar to the following:
id | name | age | zipcode | employed
----+-------+-----+---------+----------
9 | Nancy | 59 | 94084 | False
One row matching ID 9 was quickly fetched with just one request.
Name as the primary key
Suppose your most common lookup is based on the name. In this case you would make the name column part of the primary key. Because the name alone may not be unique enough to be the primary key (the primary key has to be unique), you can choose a primary key with both name and ID as follows:
CREATE TABLE census2(
id int,
name varchar,
age int,
zipcode int,
employed boolean,
PRIMARY KEY(name, id)
) WITH CLUSTERING ORDER BY id ASC;
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (1,'Zachary',55,94085,True);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (2,'James',56,94085,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (3,'Kimberly',50,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (4,'Edward',56,94085,True);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (5,'Barry',56,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (6,'Tyler',45,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (7,'James',47,94085,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (8,'Sarah',52,94084,True);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (9,'James',59,94084,False);
INSERT INTO census2 (id,name,age,zipcode,employed) VALUES (10,'Diane',51,94083,False);
When specifying the primary key, the name
column is specified first, and id
second. This ensures that the data is stored sorted based on name
first, and for all matching names, the id
is stored sorted in ascending order, ensuring all people with the same name will be stored in the same tablet. This allows you to do a fast lookup on name
even though (name, id)
is the primary key.
Retrieve all the people with the name James as follows:
select * from census2 where name = 'James';
You will see output similar to the following:
name | id | age | zipcode | employed
-------+----+-----+---------+----------
James | 2 | 56 | 94085 | False
James | 7 | 47 | 94085 | False
James | 9 | 59 | 94084 | False
(5 rows)
There are 3 people named James, and all of them can be quickly looked up as the data has been distributed by name.
Notice that the rows are ordered by id. This is because you specified CLUSTERING ORDER BY id ASC
to ensure that the rows with the same name will be stored ordered in the order of the id
column.