Atom IDs, or Unique Identifiers for Blog Posts

January 1st, 2024

Atom feeds add one burden to the publisher, your entries require a lasting, well-formed unique identifier^[1], if you intend to follow the specification. Other formats do not add as strict rules, RSS and JSON Feed only require it to be unique, although JSON Feed relaxes it further to only be locally unique. Both usually advocate for the post’s URL (or permalink), but this isn’t necessarily the best option, because although cool URIs don’t change, in practice there’s very little perma in permalinks.

There’s not really anything new in this post that wouldn’t have been covered 20 years ago in Mark Pilgirm’s post on “How to make a good ID in Atom”^[2].

It’s easy to generate a URL as an ID at run-time, but its quality is based on the assumption that your URL structure^[3] and domain never change. And this “never” part is hard, because URLs are not really persistent even if you handle everything on your end. Domains are rented after all.

As a quick refresher, URLs are web addresses that you can use in your browser to find a page. URIs are a subset of URLs that also include other resource indicators that do not (directly) dereference to a web location like a URL does. These were previously called URNs. However, in practice URL means a thing you can put into a web browser and URI a thing that does point to a thing, but can’t be necessarily put into web browser^[4].

So, what other options we have for good Atom IDs, if URLs are not going to cut it?

The first that comes to mind for easily creating something that is globally unique, is to roll some dice and generate an UUID. These are entirely opaque. The good thing is that there is an existing namespace^[5] for UUID URIs (urn:uuid:fbffecb5-bb5c-4c4b-95e3-7e03eb807b18) so they pass Atom ID requirements. However, you need to somehow store this.

And if you can store your ID, you can also use the post’s permalink at the time of creation as the ID for the post. Just remember to always use this stored URL instead of a run-time generated, in case any metadata has changed that could affect the URL of that post.

In Pilgrim’s post, he advocated for tag URIs like tag:kalifi.org,2024-01-01:/2019/11/link-rot.html. They solve the problem of a domain’s (or other identification’s) validity over time by adding a date component. Essentially it adds a claim that this object’s authority was valid at this time. You can then go and identify the object, as Pilgrim suggests, by creation timestamp^[6], or to defeat the point a bit, with relative permalink as in above example. However, you porbably again need to store this ID somewhere, or have at least authority metadata available somewhere to regenerate it on the fly.

Since Pilgrim’s post, there has been a new addition, the Named Information (NI) URI scheme. Like Magnet links of BitTorrent and other p2p networks, it identifies the resource by its hash. Content hashing is a bit infeasbile for blogs, though, because any change to the content, like a typo fix, will change the hash^[7]. However, the RFC for Named Identifiers doesn’t require the use of full content^[8] for hashes:

Other than in the aforementioned special case where public keys are used, we do not specify the hash function input here. Other specifications are expected to define this.

For blogging purposes, one could just hash some metadata, like creation date and title, and use that as the Named Identifier. However, the only difference to tag URIs would be that the authority part of the URI is optional here. Otherwise you’d be essentially just hashing a tag URI.

Anyway, we are a bit bikeshedding here. Ultimately, a URL is natural and also a globally unique identifier for the resource. It’s probably persistent for some number of years. Many feed processors make an assumption that Atom’s ID is a permalink just like for other feed formats. And ultimately, when your posts’ permalinks change, the effect is minor nuisance that some posts are duplicated and marked wrongly as unread in your readers RSS readers^[9].

If you can esaily store the IDs of your posts somewhere, be it in a databse or front matter, your choice of it hardly matters - as long as you generate them on creation and let them be. If you need to regenerate them, tag is probably your best bet because it lets you identify the object any way you want from relative permalink to post title or creation timestamp.

You’re screwed anyway if/when your permalink structure changes, because then you need to have logic to regenerate old style IDs for older posts and newer style IDs for newer posts. Some ingenious use of NI scheme might work around this problem, but what that would look like is beyond this blog post.

Technically, the Atom IDs aren’t just URIs, they are IRIs. ↩︎
Mark Pilgirm’s “infosuicide” means that while any link to http://diveintomark.org/archives/2004/05/28/howto-atom-id is broken (ie. not dereferencable), they still uniquely identify that post. Unfortunately, Wayback Machine rewrites most of the page’s metadata so that this ID, URL, or URI is not available on that page anymore. ↩︎
Both your blog’s and your blog post’s. Changing the former can lead to hard to spot changes in the latter, even if you try to maintain the structure. ↩︎
In other words, normal people use URL, and fancy people use URI. ↩︎
Valid URNs have a namespace, which is usually controlled by some authority that handles registration of identifiers (for example, ISBNs), so you can’t just go and create your own. However, a workaround is to use the urn:uuid: namespace. ↩︎
This is often available in even database-less blogging platforms at date resolution in posts’ filename or front matter. Just don’t post more than once a day. ↩︎
Ideally, any non-significant change should not change the ID of the document. On the other hand, a significant change of the document probably would necessiate a new URL as well. ↩︎
What full content would even mean is a bit ambigious. Does it mean the source content, which is unavailable for any reader so they couldn’t verify the hash. Or does it mean the final HTML, while available to the reader, might have dynamic content unrelated to the main content, so any change to layout, for example updating stylesheet or script link would invalidate the hash. ↩︎
However, the likelihood that someone is using an RSS reader in 2024 is wishful thinking. Also, I assume most advanced RSS readers can handle changing permalinks among a variety of other oddities because 100% standards compliant feeds are rare. ↩︎