The Million Ways to Markup Your Content

April 26th, 2015

One of the more interesting directions the web has been taking is the so called semantic web, which essentially means helping computers figure out what humans have written on the web. The most practical reason to do this is, naturally, to help the almighty search engine bot overlords better index one’s site in hopes of a better ranking on the only list that matters, Google’s search results.

There were other, more academic, reasons for semantic web as well, but so far it looks like a bunch of acronyms and a boatload of XML documents^[1]. No surprise a bunch of more practical solutions are in actual use today. Too bad about the plural there, but it’s getting better – I just noticed Instapaper uses OpenGraph these days instead of their own version of Microformats.

For a blog, that leaves at least the following relevant metadata serializations:

HTML5, or core markup
Schema.org, or Microdata
h-entry / hentry, or Microformats
OpenGraph / Twitter Cards
JSON-LD, or Linked Data

What follows is how a web page that tries to implement all of these will look like.

HTML5

In a normal case there’s a site, an author and a blog post. All of these have a name (or a title), a URL, and the post has some content. All of them probably have more relevant metadata but for now I’m going to focus on these. The bare-bones HTML5 setup for the post would look something like this:

<!DOCTYPE html>
<html>
<head>
  <title>My Awesome Blog Post</title>
  <meta name="author" content="Me">
  <link rel="canonical" href="http://example.com/post">
  <link rel="author" href="http://example.com/me" title="Me">
  <link rel="home" href="http://example.com/" title="My Awesome Blog">
</head>
<body>
  <article>
    <header>
      <h1><a rel="bookmark" href="http://example.com/post">My Awesome Blog Post</a></h1>
      <p>Posted by <address><a rel="author" href="http://example.com/me">Me</a></address>
      at <time datetime="2015-01-01">a time</time></p>
    </header>
    <p>This is the best post in the universe.</p>
  </article>
</body>
</html>

Even here we already have some duplication, although the title in the “author” links is totally optional and probably no-one relies on that. Also, a link with rel="author" is probably clear enough to tell that the person can be contact this way and it doesn’t need to be wrapped in an <address>. This outer element is useful later on, though.

There’s probably very little difference between canonical and bookmark except that the other one is allowed in the <head> and the other on within an <article>. One thing to note is that there seems to be no pure-HTML5 way to show the published/updated time of the document. There was a short-lived, and resurrected, pubtime attribute but it’s now dead for good.

Schema.org Microdata

Let’s duplicate the metadata with Schema.org tags!

<!DOCTYPE html>
<html>
<head>
  <title>My Awesome Blog Post</title>
  <meta name="author" content="Me">
  <link rel="canonical" href="http://example.com/post">
  <link rel="author" href="http://example.com/me" title="Me">
  <link rel="home" href="http://example.com/" title="My Awesome Blog">
</head>
<body>
  <article itemscope itemtype="http://schema.org/BlogPosting">
    <header>
      <h1 itemprop="name">
        <a itemprop="url" href="http://example.com/post">My Awesome Blog Post</a>
      </h1>
      <p>Posted by
        <address itemprop="author" itemscope itemtype="http://schema.org/Person">
          <a rel="author" href="http://example.com/me" itemprop="name url">Me</a>
        </address>
        at
        <time datetime="2015-01-01" itemprop="datePublished">a time</time>
      </p>
    </header>
    <div itemprop="articleBody">
      <p>This is the best post in the universe.</p>
  </div>
  </article>
</body>
</html>

To implement Schema.org tags, one concession is to wrap the blog post content in a div as HTML5 defines the main content of an article (or an section) by what’s inside it but not in a header or a footer but most metadata markup tags want an explicit element to mark as the body content. Above we use the otherwise empty address to define a Person to which a name and a url belong to. With Schema.org microdata the post also finally gets some markup telling when it was published^[2].

h-entry Microformat

Microdata is the cool kid, but we can go further and use Microformats and mark up the blog post as a h-entry. It’s like an Atom Entry but without XML and inline.

<!DOCTYPE html>
<html>
<head>
  <title>My Awesome Blog Post</title>
  <meta name="author" content="Me">
  <link rel="canonical" href="http://example.com/post">
  <link rel="author" href="http://example.com/me" title="Me">
  <link rel="home" href="http://example.com/" title="My Awesome Blog">
</head>
<body>
  <article itemscope itemtype="http://schema.org/BlogPosting" class="h-entry">
    <header>
      <h1 itemprop="name" class="p-name">
        <a itemprop="url" href="http://example.com/post" class="u-url">My Awesome Blog Post</a>
      </h1>
      <p>Posted by
        <address itemprop="author" itemscope itemtype="http://schema.org/Person" class="p-author h-card">
          <a rel="author" href="http://example.com/me" itemprop="name url" class="p-name u-url">Me</a>
        </address>
        at
        <time datetime="2015-01-01" itemprop="datePublished" class="dt-published">a time</time>
      </p>
    </header>
    <div itemprop="articleBody" class="e-content">
      <p>This is the best post in the universe.</p>
  </div>
  </article>
</body>
</html>

Now with class attributes! Because Microdata and Microformats have a lot in common, what really happens is that all tags that have a itemscope get a class (h-entry, h-card) and itemprops get a property in a class attribute. This might also be an indicator that the end is nigh for Microformats.

There is also an older version of h-entry called hentry, which just means that the class attributes have a bit differently named properties (fe. entry-content instead e-content) and you just need to add those if you want to support the older spec.

The awesomeness in hAtom and its successors is that they implement an Atom or RSS in your blog without any additional files.

OpenGraph

On the more proprietary front, Open Graph (and Twitter Cards) are an entirely different beasts. With Microdata and Microformats, everything happened in the <body>. The good thing with these two was that no duplication of content was really necessary. However, Open Graph requires that the content is to some extent duplicated in the <head> element.

The point obviously being that OpenGraph only has to check the document’s <head> element to get all the metadata, unlike Microdata or Microformat who have to scan through the whole document.

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#">
<head>
  <title>My Awesome Blog Post</title>
  <meta name="author" content="Me">
  <link rel="canonical" href="http://example.com/post">
  <link rel="author" href="http://example.com/me" title="Me">
  <link rel="home" href="http://example.com/" title="My Awesome Blog">
  <meta property="og:title" content="My Awesome Blog Post">
  <meta property="og:type" content="article">
  <meta property="og:image" content="http://example.com/post/image">
  <meta property="og:url" content="http://example.com/post">
  <meta property="article:published_time" content="2015-01-01">
  <meta property="article:author" content="http://example.com/me">
  <meta name="twitter:card" content="summary">
  <meta name="twitter:site" content="@me">
  <meta name="twitter:creator" content="@me">
  <meta property="og:description" content="This is the best post in the universe.">
</head>
<body>
  <!-- snip -->
</body>
</html>

The first thing is to add some prefixes to the root html tag, as OGP extends HTML5 a bit. If these prefixes are not added, your friendly HTML5 ~~validator~~ conformance checker will not ~~validate~~ be happy.

The social graph requires a new additional piece of metadata, an image – the web is visual after all^[3]. This should not be a version of the site’s logo, but related to the blog posting. Some CMSes allow to set a post’s featured image so you might want to go with something like that – or a random picture off /r/blop. For our example post here the image will be http://example.com/post/image.

The first four og properties are the required OpenGraph properties. Note that the og:type is set to “article” instead of the default “website”, which gives additional metadata properties that are more suitable for blog posts (published_time, author).

Unfortunately, the author property’s content value is currently a mess. OGP defines it as an array of http://ogp.me/ns/profile# objects, Facebook defines it as a link to the user’s Facebook profile^[4], Wordpress uses it as a link to the author’s local profile page^[5], and on top of that, Pinterest has defined it as the username^[6].

Twitter’s implementation is a bit lighter, as it can reuse existing OpenGraph tags. Twitter does require a description (either og:description or twitter:description). Also note that Twitter uses name and content unlike OGP (and thus does not require html#prefixes).

However, Twitter requires both twitter:site and twitter:creator accounts. For a small blog, these are probably both just the author himself.

Note that when impelementing OGP and Twitter Cards, the document got additional metadata (image, Twitter username, …) that is not visible in other metadata serialization. At this point, we would need to go back and add the newly introduced data also to the other formats.

JSON-LD

JSON-LD is the latest cool thing, but I feel it is a step backwards. Essentially, JSON-LD is a JSON object hidden in a <script type="application/ld+json"> element on a web page. In a way, Linked Data is the latest reincarnation of the Semantic Web, and JSON is just a different serialization vehicle but not entirely unlike XML but just as capable of delivering a load of something that reminds of RDF or Yahoo!'s DataRSS.

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#">
<head>
  <!-- snip -->
  <script type="application/ld+json">
  { "@context": "http://schema.org",
    "@type": "BlogPosting",
    "name": "My Awesome Blog Post",
    "url": "http://example.com/post",
    "datePublished": "2015-01-01",
    "author": {
      "@type": "Person",
      "name": "Me",
      "url": "http://example.com/me"
    },
    "articleBody": "This is the best post in the universe."
  }
  </script>
</head>
<!-- snip -->

See, it’s a web page inside a web page! The ultimate in duplication. I thought the whole point was that the document metadata could be included in the original document.

JSON-LD is good for many other things, and its creator readily states that

I’ve heard many people say that JSON-LD is primarily about the Semantic Web, but I disagree, it’s not about that at all. JSON-LD was created for Web Developers that are working with data that is important to other people and must interoperate across the Web. The Semantic Web was near the bottom of my list of “things to care about” when working on JSON-LD, and anyone that tells you otherwise is wrong. :P

I’m sure that for the googlebots, JSON-LD is a nice way to distill the data of a website, but I fail to see the benefits over Microdata for the current use case.

Google and other search engines

For a while, Google encourage people to link their author links to their Google+ profiles, like http://plus.google.com/<profile-id>/?rel=author. Google has since dropped support for this.

Also, as mentioned, the venerable Microformats seem to sadly be on the way out. Google’s Structured Data Testing Tool still seems to parse them but they are not listed on Google’s Structured Data Policies page.

Other uses

All the above markup is not just for search engines, or social media sites. Throughout this post, I have linked to other sites (this is hypertext, after all) and wouldn’t be awesome if your browser could show a short summary about the link when you hover on one? OpenGraph and Twitter Cards enable Facebook and Twitter, and the web site owners, to create more relevant summaries^[7] of the content people post. There are services like Embed.ly and code like Onebox that let people do nicer embeds on their sites – and with stuff like OpenGraph they don’t have to create per-site customizations for all sites.

So, it’s not just for the machines.

As the examples show, there’s quite a lot of duplication of data going on even though even the basic HTML5 setup already had all the most relevant metadata. Obviously, the other specs and protocols give many more properties to be hanged on the simple blog post. However, the nicest part in all these things is that nowhere were ~~monsters~~ things like XML or RDF mentioned^[8].

Notably absent is a sane implementation of Dublin Core with HTML5. One would expect their implementation to be somewhat close to OGP's or Twitter’s.

Somehow, this all reminds me of the early days of MP3s when (probably the same) people were wondering how to correctly tag classical pieces in ID3 tags. ↩︎
Note how HTML5 and ISO 8601 both allow very coarse timestamps. There is no reason to add fake precision by pretending the post was posted at 2015-01-01T00:00:00+00:00. HTML5's time would be ok with 2015-W01 but some consumers downstream won’t be happy. ↩︎
Except for you folks out there on Links and Lynx. ↩︎
The inability of the original author to follow the protocol, and additionally, making their own implementation limited to their own service, is quite mind-boggling. ↩︎
This profile page, in turn, defines a http://ogp.me/ns/profile# object. ↩︎
Pinterest’s is unquestionably the only wrong implementation here. ↩︎
Hey, why don’t we call them Rich Site Summaries? Oh wait… ↩︎
Although, with JSON-LD we got really close. ↩︎