Ultimate Guide to Identifiers

Published

Welcome to the world of identifiers! Whether you're a developer working on a new project, a database administrator looking to optimize your system, or simply someone curious about the different types of identifiers, this guide has you covered.

In today's digital world, identifiers play a critical role in uniquely identifying and tracking entities such as users, products and transactions. With the explosion of data and the increasing need for efficient and secure identification, choosing the right type of identifier is crucial.

In this guide, we'll dive into the world of identifiers, exploring the different types of identifiers, their unique features, and the best use cases for each. We'll cover globally unique id, numerical and enigmatic ones and provide you with the knowledge and insights you need to make an informed decision when choosing an identifier for your needs.

I won't cover all possible ID formats (possibilities are actually endless) but few, the most interesting ones, were selected.

Table of contents

Requirements

Before we go further we need to clearly establish few requirements regard identifiers. Without it, we might miss the point whenever we chose right algorithm for us.

Identifier has to be immutable

The main feature of an identifier is immutability. This means that no matter how the entity state or data changes, its ID will never change.

Why can't it change? Because dealing with ID mutability can be challenging and, in certain cases, even impossible to handle.

If you update the ID of a user, then all other entities that reference it in the database have to be updated as well. Seems easy, right? But consider this:

  • What if a link to a user (that contains its ID) was sent via email? You cannot change that without the extra effort of handling redirects.
  • What if that ID is stored in the user's browser, like LocalStorage, cookies, or a bookmark? You cannot control that either.
  • What if it's printed?

Generally, dealing with ID mutability is not worth the effort, so a good rule is that the ID should never change.

You might ask, "What about slugs like this-is-amazing-guide-about-identifiers that are pretty good to identify the page?" That's a good question, and the answer is covered in slugs.

Identifier has to uniquely identify record

If you have a task to find entity by ID 10 you do not know what to do. What are you looking for? A user? A document? An article? That crucial information is missing.

This introduces another characteristic of IDs. You need to know which collection of entities a lookup is being performed on.

In the example above, the ID value is not enough for a task to succeed. But if we rephrase it to find User by ID 10, then context is provided. Perform a lookup on the collection of users.

Usually you do not think about it since most endpoints you are operating on are very explicit, like:

  • Rest endpoint GET /user/:id - Obvious
  • userService.findById(id) - Same - Operation on service for users

The situation looks different once the IDs already contain the context information.

For example every user ID is prepended by U-, every document ID by D- so on so forth. Then the task find entity by ID U-10 contains necessary information. The context and the ID.

This idea might sound irrational, but it is not when operating on a complex system, with a rich set of different collections.

For example, some resources created by AWS contain such prefixes:

  • vpc-1a2b3c4d5e6f1a2b3 - VPC
  • i-0a358cab6b00d2846 - EC2 instance
  • vol-003f462982c73f08d - Elastic Block Store Volume

Whenever you see an error from AWS like "Cannot find resource vpc-1a2b3c4d5e6f1a2b3", you immediately know it's about VPC. There is no need to look for the information about which AWS service is misconfigured.

Imagine a case of pasting such an ID in the search bar in your app. Since you know the ID formats for all entities, you can infer that whenever someone asks for the phrase "U-100", they mean a user with ID 100.

Monotonicity

This feature is optional, but it is often helpful.

A monotonic function in mathematics is a function that produces values that are either increasing, decreasing, or remain the same. If the values sometimes increase and sometimes decrease, then it is not a monotonic function.

In the world of identifiers, it means that every consecutive ID generation creates a value that is always greater than its predecessors. The simplest example of this might be an ID generator that produces integers in order: 1, 2, 3, etc. If the value is sometimes greater and sometimes not, then it is not monotonic. An example of this could be generating random 64-bit numbers.

Why does monotonicity matter?

Since a function that generates monotonic values is strictly correlated with the creation date of an entity, it greatly improves performance for cases like cursor pagination or performing a lookup on a table with a sort on the creation date. You no longer need to create a separate index for the creation date since you can just sort by ID, which is already indexed. This also means that if you need to find the oldest entity, you can look for an entity with the lowest ID, and vice versa.

Once all characteristics of IDs are outlined, we can talk about technical details.

Predictability

Does the generated ID need to be unpredictable? Or maybe the opposite is true?

It depends on the requirements and the goal you want to achieve. Let's take a peek at what the industry is doing.

Twitter has quite predictable IDs. You can't tell the next ID from the top of your head, but running a very narrow brute force scan might reveal a few.

Is that a problem for Twitter? Not at all. Even though Twitter has the concept of protected tweets, they are visible only to people who are logged in and those specified by the tweet author. Therefore, even if you try to brute-force the tweet ID, you won't see it until you're supposed to see it, so no harm is done.

Youtube is different. The ID algorithm is designed to be unpredictable. Why is that? Youtube has 3 video visibility settings: private for specific people, unlisted, and public.

Unlisted videos use the same ID format as the public ones, so anyone who guesses the ID is able to see the unlisted video. Youtube uses unpredictable IDs to prevent scanning for such unlisted videos since they are not supposed to be explicitly public. Even if you guess the ID or find it in the logs, browser history, or somewhere else, it is still not a big deal. It is up to the video publisher to care about who can access the video. If you want to make your video really private, then use the private visibility setting and decide who can see it. Again, no harm is done, and it requires the least amount of effort from Youtube's perspective.

To summarize, the decision on whether the ID algorithm should be predictable or not depends heavily on the use case and the data protection measures you need to take. Before making that decision, please make sure you have read the security concerns related to that.

Sequential numerical ID

1, 2, 3, 4 ... are the simplest kind of identifiers you can think of. Every time a new entity is created, it receives the next number from a sequence. Since it is a simple sequence, the generated values are automatically unique.

Sequentially generated numbers are also strictly monotonic, which means the next generated ID is always greater than the previous ID.

Sequentially generated numbers are also short (to a certain point), easy to pronounce, or remember. Let's use them everywhere... right?

Not so fast.

There are some characteristics of sequential IDs that need to be taken into account.

Statefulness

In order to generate next sequence value (for example 101) you need to know previous sequence value (100). This is a state. Dealing with state is a simple task for a single machine or process but much harder once more machines are involved. 1

There are ways to deal with it in standard RDBMS solutions but other databases does not support it all 2 and providing custom solution for them is not trivial.

If your system runs on a single-machine database then you're fine, otherwise consider using one of solutions from multiple factor stateless identifiers.

Revealing size of collection

Since numbers are generated in sequence they also reveal information how many entities are stored within a system.

Imagine you are making an order for a book via website. Shipping details provided, order paid, you receive an e-mail that says Your order number is 71.

That indicates there was only up to 70 orders before. You can take average order amount, multiply it by 70 and you've got an estimation about the revenue. This is not a kind of data that every e-commerce would be happy to reveal.

If you need to hide that information and still use sequential IDs take a look at enigmatic IDs.

Enigmatic IDs

You still want to use sequential IDs but not reveal its value to end user? hashids comes to help you.

Hashids encodes numerical value into a string and allows to decode it back to a number later. Here is a list of encoded values for numbers from 1 to 5 with standard configuration (default alphabet, no minimum length, no salt).

  1. jR
  2. k5
  3. l5
  4. mO
  5. nR

That doesn't seem particularly great but once we increase the minimum length to 8 it looks much more enigmatic.

  1. olejRejN
  2. pmbk5ezJ
  3. pnel5aKB
  4. MvbmOeYA
  5. openRe7A

You can test it out

const Hashids = require('hashids');

const setup = new Hashids(undefined, 8);
for (let i = 1; i <= 5; i++) {
    console.log(i, setup.encode(i));
}

But hold on. If you were able to generate those values and decode them, then everybody else also can do this? There is no hiding any information here? Correct.

But first - you need to know how to do this right? Standard internet user have no idea that MvbmOeYA is a value encoded by hashids. Regardless of this hashids supports custom salt that shuffles values in encoded value making it harder to guess 3.

By using salt great-guide-about-identifiers (min length 8, default alphabet) we get this:

  1. NjEelLle
  2. ZwLgvyQ0
  3. gKdP1yXm
  4. v7ER8yoX
  5. GpdVwyWe

If you try to decode it back, without knowing the salt, then hashids simply fails to decode the value.

Let us try to use hashids to generate order number.

Enigmatic order number

Order number suppose to be pretty short and easy to spell. Without these features it would be pretty hard to provide support via phone call for some order where order number is hard to spell.

In order to achieve it lets setup hashids in following way:

  • Alphabet: 123456789ABCDEFGHJKLMNPQRSTUVWXYZ (note that some characters are purposely removed to prevent confusing them like 0, O, l during spelling)
  • Min length: 8
  • Salt: great-guide-about-identifiers

Results (added space in the middle for readability):

  1. 39NK LAYQ
  2. 6EX7 QA5R
  3. 4QAM RNZP
  4. V9XJ VXPG
  5. VRXR PN3G

That looks awesome!

Should you be worried about the maximum number I can pass to such configured hashids? Rather not for two reasons.

  1. With such long alphabet and min length of 8 you can encode pretty lot of values. Even at 2 000 000 000 hashids still generates 8 characters long IDs.
  2. Note that this is a minimal length, so if encoded value won't fit into 8 characters it will be extended by adding 9-nth character.

Suggestions regard technical implementation

In order to work with hashids effectively you should follow one simple rule:

Do not store encoded values in database and always decode them as soon as possible

hashids is an encoding and decoding tool so you need to have a very good reason to store encoded values in database since you can decode them anytime.

Whenever you're performing a lookup for an order by order number (39NK LAYQ) you just pass it to the backend, backend decodes it to a numerical order id (1) and that is all you need to perform the lookup.

Same for exposing ID to the world - just encode it again.

So simple and elegant!

Multiple factor stateless identifiers

This section describes group of identifiers algorithms that I call multiple factor stateless identifiers (in short MFSI) because they combine more than one factor to generate final value. All of them are stateless or semi-stateless since for example they might need initial state to start with.

The most common factor is obviously timestamp. The timestamp you have on mind is not exactly the same timestamp you might think so keep reading to get to know differences between timestamps used between algorithms.

UUID

UUID is probably the most common ID format and looks like this:

  • 9e8488a0-c8c4-11ed-afa1-0242ac120002
  • f98fa1cf-96c6-48ef-b351-847fd6f92cd3
  • 63016bfe-4655-4b75-a0c8-ca8c955ebe0e
  • 2db3c793-2e84-4ec9-9f50-b93c8a92be07

UUID is 128 bit long, so it takes 16 bytes to store it. As a string its 36 characters long (including hyphens) encoded using hexadecimal values.

It is very complex algorithm, has multiple version and deep nuances that I will not cover in this guide but focus on summarization of possibilities.

General format

Depending on version UUID uses different factors to generate ID but the format is common for all versions. Values are grouped into layout of 8-4-4-4-12 characters

xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx

Value at position A indicates version (1-5). B stores information about variant, but I'm not going to focus on this parameter since it is not necessary for the purpose of this guide.

Version 1

Version 1 uses timestamp, Clock sequence, MAC address or Node ID.

Timestamp is a 60-bit number in precision of 100-nanoseconds since 15 October 1582 UTC.

Clock sequence is a misleading name, mostly it is just random 11-13-bit (depending on variant) long number.

Standard allows to use MAC Address or Node ID. Both of them are 48-bit long.

Example UUID in Version 1.

123e4567-e89b-12d3-a456-426614174000

Version 2

This one is reserved for "DCE security" and in practice the usage if it is very limited. I won't cover the details but it is very similar to version 1.

Version 3 and 5

This is special version od ID that does not work in the same way as others. It does not generate unique values just by calling the function over and over. Instead it takes name (any string) and namespace (another UUID value) and produces the same output.

Example for version 3.

const uuid = require('uuid');

console.log(uuid.v3('someid', 'c0bcd871-13eb-40ae-9499-5e329e400435'));
console.log(uuid.v3('someid', 'c0bcd871-13eb-40ae-9499-5e329e400435'));

Gives following output:

  • f8e5d693-fcc9-3361-bb0d-216488deac5e
  • f8e5d693-fcc9-3361-bb0d-216488deac5e

UUID produces output by hashing input using md5 (for version 3) or sha1 (for version 5).

In order to generate unique values you need to provide unique namespace or name so it is totally your responsibility to provide unique values.

One of the use case for that might be constant, predictable ID generation based on input. Imagine you want to produce cache key for some query.

You can do this:

const SQL_NAMESPACE = 'c0bcd871-13eb-40ae-9499-5e329e400436';
const cacheKey = uuid.v3('query content', SQL_NAMESPACE); // this will always generate the same UUID value
if (cache.has(cacheKey)) {
  return cache.get(cacheKey);
}

You should not use it for hiding information or other security purposes since md5 and sha1 are already broken and should not be used in modern applications.

If you still need generation of some form of hash from input and namespace just use regular hashing function and pick much safer hash functions like sha256.

Version 4

This one just generates completely random values.

  • f40b4ab2-512a-4b53-8883-643e84ec6684
  • d770f9c0-e70b-4442-baaa-c94a549afdd2
  • 9b9c18ef-8a79-49be-a339-2c75330a18ba
  • 81ec1448-a5fb-4e0e-b6c3-6d009c7f7b28
  • c96c3213-5630-4e93-868b-e2b90b0fa18c

That's it. There is no monotonicity consideration here, no coordination of some seed number, nothing.

Since there is no coordination how this version is even able to generate unique values all the time? Due to nature of randomness of big numbers.

122-bits of entropy is such a big number that even if you generate 70 368 744 177 664 (!) UUID's then collision chance is 0,000 000 000 4%. You are 3571 times more likely to be killed by meteorite so we can say it is practically impossible.

This version is also the most commonly used due to its fire-and-forget nature. You simply just care about uniqueness and nothing else.

UUID or GUID?

Famous question. Is it UUID or GUID?

These are slightly different things. At first glance for humans there is no difference but technically speaking there are few subtle ones that sometimes may strike back.

In short every UUID might be (after conversion to upper case) GUID but not every GUID is UUID.

First difference is format.

UUID requires version to be a digit between 1-5 and variant to meet specific criteria.

GUID from the other side does not have such requirement at all and allows any hex character at those positions.

Second difference is characters case. GUID is all uppercase, UUID all lowercase.

// GUID
// just any upper-cased hex characters
E88D580C-176E-8EF6-BF6C-9D7B2B2EE22C

// uuid
// lowercase
// version and variant is important
38746e11-2c8f-4d95-ac51-7e33368c1746

Annoying usability issue

Whenever working with UUID in real life you'll stumble upon very annoying usability issue. You cannot easily select it!

Mouse double click simply is unable to capture it.

uuid-with-mouse-click

You need to either manually select the text or use expand selection keyboard shortcut...

uuid-with-keyboard ... that is different for each IDE so... have fun!

UUID in URL

One of the next values is not a valid UUID. Can you tell which one?

433f0e0c-7ff7-4a29-8329-d20925ff3a2b
7ca15f66‐3ee1‐4825‐a946‐7622c6d523b0
22ffb6d8-a2bc-4620-bf8a-58e949d03b70

7ca15f66‐3ee1‐4825‐a946‐7622c6d523b0 is invalid!

Why? It has the same shape as the others.

The issue is with . is a UTF-8 hyphen with a 8208 code, which is not valid to use in a UUID. - is just an ASCII character (hyphen-minus) with a code of 45, which is valid for UUIDs.

While humans may not notice the difference, machines do, and this can cause issues. Some text editors might automatically replace - with , as they believe that the UTF-8 hyphen is more "accurate" or "better" for some unknown reason.

Imagine someone sending you a link with a UUID from a shady email client, or they might manually type it in using a word-like tool, resulting in a link that no longer works.

Issues like this is are rare but still possible and you might consider dealing with them by normalizing all user input UUIDs before working with them:

function normalize(input: string) {
    return input.replace(/\u2010/g, "-");
}

However, it can be difficult to remember to do this consistently.

My 2 cents about UUID

Personally, I find UUIDs to be overly complex for regular applications. They are also too long in string form, have usability issues, and can potentially cause issues when used in URLs. Additionally, since most people use version 4 UUIDs, they are not even slightly monotonic. Despite their wide adoption and ability to work in distributed systems, I hardly see any benefit in using them.

If you need a "fire-and-forget" algorithm for generating IDs, I would recommend using ULID or ObjectId (more on those later) instead of UUIDs.

ULID

ULID is 128-bit long (same as UUID) but only 26 characters as string and looks like this:

  • 01GXFEZPQ3NXSAHHBS6GTVETY7
  • 01GXFEZPQ3772ZV6WW7Y2F5Z9J
  • 01GXFEZPQ33SCBA3TBZK7YD34K
  • 01GXFEZPQ3V566TS5TPS0Q0WPH
  • 01GXFEZPQ3MZ00MR7A3R8ZERJ7

Format

How is it possible that ULID takes only 26 characters as string while UUID is 36-characters long and both are 128-bit long?

ULID uses different binary-to-text algorithm. Specifically base32 with Crockford's alphabet: 0123456789ABCDEFGHJKMNPQRSTVWXYZ (ILOU are removed to avoid confusion). UUID uses HEX representation of binary data and adds hyphens for readability purposes. Base32 is just more efficient (in terms of output length) algorithm than hex.

ULID uses multiple factors to generate value:

  1. Timestamp
    • 48 bits
    • Big endian
    • Computed for each ID
    • In milliseconds since UNIX epoch
  2. Random number
    • 80 bits
    • Big endian
    • Computed for each ID or incremented previous value

First factor is obvious and does not require explanation but second is special.

ULID and Monotonicity

Depending on the generation method selected the random value generation differs. For regular method the random value is generated each time.

Values generated previously were using regular method so lets sort them first.

Original orderSorted
01GXFEZPQ3NXSAHHBS6GTVETY701GXFEZPQ33SCBA3TBZK7YD34K
01GXFEZPQ3772ZV6WW7Y2F5Z9J01GXFEZPQ3772ZV6WW7Y2F5Z9J
01GXFEZPQ33SCBA3TBZK7YD34K01GXFEZPQ3MZ00MR7A3R8ZERJ7
01GXFEZPQ3V566TS5TPS0Q0WPH01GXFEZPQ3NXSAHHBS6GTVETY7
01GXFEZPQ3MZ00MR7A3R8ZERJ701GXFEZPQ3V566TS5TPS0Q0WPH

For monotonic method the random value is incremented by 1 if previous ID generation occurred at same timestamp, if not - generated again.

Let's try to generate values again using monotonic method and sort it.

GeneratedSorted
01GXFFGFV411RCD664K8ZH99PH01GXFFGFV411RCD664K8ZH99PH
01GXFFGFV5AWK9F51TB24819AP01GXFFGFV5AWK9F51TB24819AP
01GXFFGFV5AWK9F51TB24819AQ01GXFFGFV5AWK9F51TB24819AQ
01GXFFGFV5AWK9F51TB24819AR01GXFFGFV5AWK9F51TB24819AR
01GXFFGFV5AWK9F51TB24819AS01GXFFGFV5AWK9F51TB24819AS

The order before and after sorting is exactly the same.

Before using ULID you need to decide which method to use based on your specific needs.

ObjectId

ObjectId is a 96-bit long format (24 characters as a string) popularized by MongoDB and looks like this:

  • 642dc259c1ad1d4cf7019b75
  • 642dc259c1ad1d4cf7019b76
  • 642dc259c1ad1d4cf7019b77
  • 642dc259c1ad1d4cf7019b78
  • 642dc259c1ad1d4cf7019b79

Format

Similarly to the previous kind of IDs, ObjectId uses multiple factors to generate the value:

  1. Timestamp
  • 32 bits
  • Big endian
  • Computed for each ID
  • In seconds since UNIX epoch
  1. Random number
  • 40 bits
  • Little endian
  • Created once per process
  • Never changes
  1. Counter
  • 24 bits
  • Big endian
  • Initialized with random once per process
  • Increments by 1 for each ID

The random number generated once per process has 40 bits, which means there are 1,099,511,627,776 possible values to be selected. The chance of generating the same random number for two different processes is pretty low alone, but the timestamp and counter are also taken into account. Therefore, the chance of generating the same ID by two different processes is negligible.

Not only for MongoDB

Just because it is used mainly in MongoDB that does not mean you cannot use it somewhere else. Just take bson package (or similar one) and use ObjectId.

Is this a good idea in general?

Yes, since ObjectId is shorter than the most popular ID format UUID and does not share other UUID downsides.

Snowflake

Snowflake is one of the most interesting algorithms in my opinion. It is very simple, elegant and highly customizable.

Looks like this:

  • 1602955782399709185
  • 1587411966733303808
  • 1299530165463199747
  • 1357241340313141249
  • 1649052609590992901

It is 64-bit long and in string form it takes only 19 characters but instead of standard hex string representation uses decimal form 4. Not all languages support 64-bit unsigned integers so despite the fact that generated id is a number it is used in transport as string.

It was initially designed by Twitter but other companies implemented the same idea in a little bit different way.

Factors of snowflake id:

  1. Timestamp
    • 42 bit
    • In milliseconds since custom epoch
  2. Datacenter ID
    • 5 bit
  3. Worker ID
    • 5 bit
    • Assigned at the beginning of process
  4. Sequence number
    • 12 bit
    • starts from 0 for each generation millisecond

Custom timestamp

Timestamp is very interesting since it is not the most common timestamp since UNIX epoch (1970-01-01 00:00:00) but any date in the past that is suitable for you.

Why?

To extend possible space of unique IDs generated.

Think about it.

If you're creating a service in 2022 and use timestamp based algorithm for ID generation then you're never going to have ID generated with timestamp for a year before 2022. That means 2022 - 1970 = 57 years worth of milliseconds for ID generation.

42 bits of timestamp in milliseconds is equal roughly of 139 years for ID space. Without custom epoch the possible space of ID shrinks and ends at 2109. Twitter has custom epoch set at year 2010 (2010-11-04T01:42:54.657Z to be precise) that means extra 40 years of ID generation, so they can postpone the task "need new ID solution" till year 2149.

Initial coordination needed

Worker ID is only 5 bit which means it can encode only 32 different values. Attempt to randomly pick one during process startup has pretty high chance to pick the same machine id as other processes causing generation of duplicated ids.

To overcome this issue Twitter uses zookeeper to pick unique worker ID. That means Snowflake ID generation algorithm is not fully stateless but rather semi-stateless since require initial coordination for unique value at startup.

Variations of Snowflake ID

Other services implemented slightly different version of Snowflake.

Discord 5 uses custom epoch at start of 2015 (1420070400000), datacenter ID is internal worker ID and worker ID is internal process ID. The rest remains the same.

Instagram 6 uses 41 bits for timestamp since August 2011 (1314220021721), 13 bits for shard ID (computed depending on domain, for example by user ID) and rest remains the same.

Tips for your own custom solution.

You should not hesitate to have your own solution of Snowflake.

You don't have multiple datacenters? That is OK. Just ignore that param and use only worker ID.

Need initial coordination for worker id? Consider using database with strong consistency capabilities like AWS Dynamo DB that will store all worker id per instance ID or IP.

Custom epoch? Not a problem.

Generally speaking Snowflake is great. Easy to store as unsigned integer, small (only 64 bit long) and readable (only numbers in string form).

Binary to base64?

Formats above uses different binary-to-text algorithms to represent ID for humans: Base32, decimal, hex. How about different one like... the old base64?

Base64 is a binary-to-text encoding algorithm that also plays well for IDs. Example representation of snowflake id 1623404998397763585 in base64 is Fod+PrAXoAE= which is decent. Unfortunately value represented that way contains + and = characters that has to be url encoded before placing it in URL.

After url encoding looks like this: Fod%2BPrAXoAE%3D. Terrible.

Gladly there is a better way.

Base64 is just encoding/decoding algorithm that uses standardized alphabet for representing combination of 6 bits. You can use any alphabet you want and remove problematic characters. Fortunately there is no need to reinvent the wheel. There is already the alphabet suitable for URL usage and it is called base64url.

Encoding the value 1623404998397763585 with base64url gives this: Fod-PrAXoAE. Much better! Only 11 characters long and there is no need to encode special characters!

Youtube uses such encoding. Video ID is 11 characters long, 64-bit, and have exactly the same alphabet as defined in base64url.

Timestamp based algorithms, concurrency and monotonicity

All timestamp-based algorithms that aim to be monotonic are not truly monotonic. They might behave like this in the scope of a single process, but once multiple processes are involved, that changes.

Let's take a look at ObjectId first. It seems to be strictly monotonic since in the example, every generated ID is greater than its predecessors.

  • 642dc259c1ad1d4cf7019b75
  • 642dc259c1ad1d4cf7019b76
  • 642dc259c1ad1d4cf7019b77
  • 642dc259c1ad1d4cf7019b78
  • 642dc259c1ad1d4cf7019b79

That is because it was run on a single process. What if multiple processes are trying to generate IDs at the same time?

I've run three processes simultaneously that generate five ObjectIds each:

Process 1Process 2Process 3
642f25b06c8d8ff7e3f5653c642f25b04651da01988f75d8642f25b0075c5f6e66d8fcff
642f25b06c8d8ff7e3f5653d642f25b04651da01988f75d9642f25b0075c5f6e66d8fd00
642f25b06c8d8ff7e3f5653e642f25b04651da01988f75da642f25b0075c5f6e66d8fd01
642f25b06c8d8ff7e3f5653f642f25b04651da01988f75db642f25b0075c5f6e66d8fd02
642f25b06c8d8ff7e3f56540642f25b04651da01988f75dc642f25b0075c5f6e66d8fd03

Which value from first row was generated first? You can't tell with 100% certainty.

Moreover, the first ID for Process 1: 642f25b06c8d8ff7e3f5653c is greater than the last ID from Process 3: 642f25b0075c5f6e66d8fd03, but that does not mean that it was generated later (which is rather unlikely).

Even though the timestamps (642f25b) are the same, the random numbers are not. Therefore they're really not monotonic.

Does it really matter at the end of the day?

Now let's do the following:

  • Generate 5 ObjectIds along with the timestamp (in ms) of generation.
  • Introduce random delay (5-20 seconds) between ID generations
  • Run 3 simultaneous processes.
  • Sort them separately by ID and timestamp
Sorted by IDSorted by timestamp
642f2d2e65cdc59cd421fb90 1680813358574 pid: 2642f2d2e65cdc59cd421fb90 1680813358574 pid: 2
642f2d30dfc8f1788e4c2847 1680813360943 pid: 3642f2d30dfc8f1788e4c2847 1680813360943 pid: 3
642f2d345f59e92940c7523f 1680813364268 pid: 1642f2d345f59e92940c7523f 1680813364268 pid: 1
642f2d3c65cdc59cd421fb91 1680813372819 pid: 2642f2d3c65cdc59cd421fb91 1680813372819 pid: 2
642f2d445f59e92940c75240 1680813380466 pid: 1642f2d445f59e92940c75240 1680813380466 pid: 1
642f2d46dfc8f1788e4c2848 1680813382042 pid: 3642f2d46dfc8f1788e4c2848 1680813382042 pid: 3
642f2d5265cdc59cd421fb92 1680813394265 pid: 2642f2d5265cdc59cd421fb92 1680813394265 pid: 2
642f2d54dfc8f1788e4c2849 1680813396781 pid: 3642f2d54dfc8f1788e4c2849 1680813396781 pid: 3
642f2d5a5f59e92940c75241 1680813402493 pid: 1642f2d5a5f59e92940c75241 1680813402493 pid: 1
642f2d5f65cdc59cd421fb93 1680813407913 pid: 2642f2d5f65cdc59cd421fb93 1680813407913 pid: 2
642f2d6565cdc59cd421fb94 1680813413525 pid: 2642f2d6565cdc59cd421fb94 1680813413525 pid: 2
642f2d69dfc8f1788e4c284a 1680813417195 pid: 3642f2d69dfc8f1788e4c284a 1680813417195 pid: 3
642f2d6d5f59e92940c75242 1680813421841 pid: 1642f2d6d5f59e92940c75242 1680813421841 pid: 1
642f2d7e5f59e92940c75243 1680813438416 pid: 1642f2d7e5f59e92940c75243 1680813438416 pid: 1
642f2d81dfc8f1788e4c284b 1680813441135 pid: 3642f2d81dfc8f1788e4c284b 1680813441135 pid: 3

The results are exactly the same.

We can conclude the following:

  • ObjectId generated at different time, even on different machines, are strictly monotonic
  • ObjectId generated on one machine are guaranteed to be strictly monotonic
  • ObjectId generated on different machines at the same time are NOT monotonic

The same holds for all timestamp-based, supposed monotonic algorithms.

Timestamp precision and monotonicity

Depending on the timestamp precision, the window during which generated IDs are not monotonic differs. "Generated at the same time" means generated at the same timestamp.

ObjectId uses a timestamp precision of a second. That means if two or more processes are trying to generate the ID in a window of one second, they're not monotonic.

ULID uses a precision of a millisecond. Then, if two or more processes are trying to generate the ID in a window of a millisecond, they're not monotonic. However, the chance of having multiple processes generating IDs at the same milliseconds is 1000 times lower than in a precision of a second.

That being said, you need to estimate how often ID for entities have to be generated. For regular, simple applications, ObjectId seems to be fine. For data-intensive applications, ULID or even Snowflake makes more sense.

Slugs

Slug is mostly a document title, normalized in such a way that is suitable to put them in the URL.

Obviously almost anything can be put into URL but slug is using characters that does not need to be encoded for URL purposes.

For example document title in polish zażółć gęślą jaźń in URL would look like this:

https://example.com/za%C5%BC%C3%B3%C5%82%C4%87%20g%C4%99%C5%9Bl%C4%85%20ja%C5%BA%C5%84

At first glance you would hesitate to open such URL since it looks like a hack attempt.

Slugified version would look like this:

https://example.com/zazolc-gesla-jazn

More trustworthy isn't?

How does it work?

I highly recommend to look at docs of great library for generating slugs - slugify.

Slugify performs following operations on a title:

  • Trims leading and trailing whitespace characters
  • Replaces spaces with -
  • Transliterates special characters to ascii equivalent 'ł -> l'
  • Transliterates symbols to text version ♥ -> love
  • Converts to lowercase

Note that reverse operation is not possible. That means you cannot convert slug back to original title since during conversion certain details and irreversibly removed.

Lookup by slug

In order to perform a lookup by slug you need to have it stored in database alongside the entity. At this point task is simple, just look for an entity with given slug. The problem is... uniqueness.

Slugs that works with this lookup algorithm have to be unique in a certain scope (based on requirements like tenancy, domain or entire system). Without it you don't know which entity should be returned.

Maintaining uniqueness might be easy task on single machine database, but again is tougher for distributed systems 1. It is even harder to handle for automatic redirects created whenever slug gets updated.

Slug redirects

Slugs are usually subject to change. Once document title changes significantly then slug should also be updated. How about all the links you've posted to social media that contains old slug? Setting up redirect manually is trivial task for small scale applications but significantly harder for bigger ones.

Let's imagine you need to create automatic redirects for an entity that has slug changed.

  1. Collection has one entity (ID 1) with slug: great-day
  2. The entity's slug gets updated to great-day2
  3. In order to retain old links you need to set up a redirect. Let's store them in separate collection redirects and tell that great-day has to be redirected to great-day2 for entity with ID 1
  4. Try to add another entity with slug great-day - This should fail.

Why? It is unique since there are no other entities with slug great-day anymore right? Well, it is unique in entities collection but there is a redirect of great-day to great-day2 that will prevent opening url http://example.com/great-day either way!

But that is not the end of the story.

  1. Update entity (ID 1) with slug great-day-today
  2. Add redirect from great-day2 to great-day-today
great-day -> great-day2
great-day2 -> great-day-today
  1. Update old redirect (from great-day to great-day2) to redirect to great-day-today
great-day -> great-day-today
great-day2 -> great-day-today
  1. Update entity (ID 1) again with original slug great-day
  2. You need to remove all redirects from great-day and rewrite previous redirects
great-day2 -> great-day

This is complex and I really doubt you will have it done right at the first time.

Additionally, you still cannot have any other document with slug great-day2 since there is still redirect for it!

Instead of all that complexity I propose different solution.

Slug with hashids

As I have mentioned at the beginning of this guide - identifiers should be immutable. slug is mutable therefore causing problems. So instead of having only slug in the URL lets put there 2 things - slug (useful for humans) and hashids version ( or any other short id solution) of ID (useful for machines).

It might look like this where great-day is the slug and jR is hashed ID 1`

https://example.com/great-day-jR 

Now whenever you need to find a document for that URL you just perform a lookup by ID 1, recompute the slug from the title and optionally redirect to fresh slug if the one in URL differs.

No uniqueness checks! No complicated redirect setup! You can have as many documents with the same title and slug as you want! All of it for a price of a little bit longer URL with pseudo random characters in the end.

I believe that is a good trade-off.

Comparison table

At the end I'm leaving this comparison table that helps summarize the most critical characteristics.

SizeSerialized sizeMonotonicityStateless
Sequential IDUp to 64 bitsvariesYesNo
UUID12836No/Partially 7Yes
ULID12826Partially 8Yes
ObjectId9624PartiallyYes
Snowflake64Up to 19PartiallyPartially 9
Snowflake + base64url6411PartiallyPartially 9

Predictability and security

Sometimes people claim they use unpredictable IDs as a security layer. Guessing the next UUIDv4 is practically impossible, so that should be good for security, right?

Not exactly.

An attack in which a user gains access to data that they are not supposed to is called an Insecure Direct Object Reference (IDOR), and it is a very common security issue. While the IDs are practically unpredictable, the attacker does not really need to predict them. Joseph rez0 made a good list of potential sources for such vulnerable IDs. Some of them are: logs, browser history, JWT Tokens, screen share, and others.

What is the lesson for you?

Always protect confidential information for authenticated users only. If you really need to make an identifier public that is supposed to be shareable, ensure that no critical pieces of information are exposed without a proper authentication layer.

Example? Shipping tracking number. User can see a tracking status by entering the tracking number. You can display the package size and current status, but no address, name, or anything personal until the user gets authorized.

Are predictable IDs better then?

Not exactly. Sequential IDs reveal the size of collection, but timestamp-based ones reveal the creation date. This might disclose information about the date of certain actions or the creation date of entities, which is not always something you want to disclose.

Footnotes

  1. It is impossible to achieve strong consistency (C), high availability (A) and partition tolerance (P) at the same time in distributed systems. 2

  2. DynamoDB does not support auto generated sequential ID. Same for CouchDB

  3. While it is not straightforward to decode values encoded by hashids without knowing salt and alphabet it is not impossible.

  4. For environments where representing 64-bit unsigned integer is not easy special conversion algorithm gets applied

  5. https://discord.com/developers/docs/reference#snowflakes

  6. https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c

  7. Partially only for version 1

  8. Partially only for monotonic version

  9. Requires initial-coordination, becomes stateless later 2

Łukasz Kużyński - Wookieb
Thoughts, tips about programming and related subjects to make your job easier and more pleasant so you won't burnout quickly.
Copyright © Łukasz Kużyński - Wookieb 2023 • All rights reserved.