We need to talk about tagging

Ian Piper
14 December 2020
Published in: Artificial intelligence

We need to talk about tagging; specifically, assistive tagging; what it is, how you do it and why it works better than either manual tagging or auto-tagging. In this article I'm just introducing the different moving parts. Later I'll be talking about exactly what you need to do to make these moving parts work together. Impatient? Know this stuff already? Feel free to jump ahead to the more detailed articles: Using the Drupal PoolParty integration module (coming soon), Using the PoolParty Extractor for assistive tagging and Content classification without building an integration.

Why tagging matters

Let's begin with why classification of content (or tagging - these mean slightly different things but for our current purposes they are interchangeable) matters. Let's say you have a collection of content amounting to, say, 40 million separate items. Unlikely, you say? I worked with a publishing house that estimated their content at more than double this amount. Any large organisation would probably have content numbers well into the millions. I say estimated, because they simply didn't have any kind of handle on how much content they actually had. And that's the issue, more than the raw number; there are good, sound business reasons to have a clear understanding of the content you hold;

If you can find your content, you can re-use it
If you can understand your content, you can link it to other, related information in your business to enhance the value of both
If you know your content, you know more about your business - you can build a knowledge network

When I talk with businesses about this, It is usually around this point that someone says one of two things:

"Our content management system gives us all the structure we want. Problem solved."
"We have a great search engine; we can find everything we have. Problem solved."

It turns out that neither of those things is correct.

Many content management systems use a fixed folder structure for storing pieces of content. That means that a single piece of content tends to be placed in a single location within a folder hierarchy. To make that work there needs to be a guiding architectural vision that determines where in the hierarchy each new piece of content belongs. This may all sound fine, but in practice the decision about where content belongs in the rigid hierarchy requires a good understanding of the content (not always easy) and a good understanding of the business (also hard).

Worse, content is rarely about just one thing. If you want to be able to get to content that is relevant to multiple different needs you need to be able to find it in whatever location it is relevant to find it. Organisations tie themselves in knots trying to accommodate this requirement when they have to work in fixed folder structures.

Worse still, locking content into a rigid fixed hierarchy of storage may make sense initially, but of course the needs of business are continually changing, and that fixed storage quickly becomes outdated and progressively less tuned to the needs of the business. Content storage needs to be flexible.

I've talked about the manifest shortcomings of search engines elsewhere, so I'm not going to harp on about it here. Suffice to say, relying on a search engine alone to solve your information discovery problems is futile. To improve precision and relevance in search requires structure in the content and classification of the content. Even Google, the ultimate full-text search engine, is now championing this approach through the Google Knowledge Graph.

Solving these problems requires two things; better content architecture and content tagging.

Better content architecture will give you a way to store your content items in such a way as to let you manage them effectively and discover and re-use them in the future as your developing needs dictate.
Content tagging gives you a way to unambiguously say what your content is about.

You can read more about effective content architecture for better content management elsewhere on this site. I'm focusing on tagging in this article, though I need to say in passing that tagging content gives you a way to organise it into logical parcels of related information, in turn allowing you to build different perspectives on your knowledge based on its tagging.

More on that in other articles; let's get back to the benefits of tagging. Tagging is a way of building up a profile around a piece of content that describes precisely what the content is about, within the context of the knowledge domain that it is intended for. This profile is achieved by creating a taxonomy of concepts that are relevant for that knowledge domain. For example, within a bank, a taxonomy of concepts for tagging its content needs to be focused on banking and finance. Because this taxonomy is aimed specifically at the banking knowledge domain, the content can confidently be tagged with those concepts. So for example, if we have a piece of content like this fictional example:

"We provide customers with financial services that help people better manage their lives. As technology advances and competition increases, banks are offering different types of services to stay current and attract customers.

Whether you are opening your first bank account or have managed a current account for years, it helps to know the different types of banking services available. This ensures you get the most out of your current financial institution. Deciding which services are most important can lead you to the bank that best fits your needs."

We might tag this content with the concept Services. Because we have used a taxonomy built for the knowledge domain of banking, we know that this concept means services within that knowledge domain, and not religious ceremonies, shots in a game of tennis, the armed forces or the place where you stop for coffee on the motorway (all of which might also be tagged with a services concept). In case there were any doubt, we also have a definition, relations and other properties that spell out exactly what we mean by services. Sorry if these seems like I'm going on at undue length, but it's important to be clear about the advantages of tagging with a rich concept within a structured taxonomy, rather than a simple term in a keyword list.

So now we know, beyond any ambiguity, what this content is about. If we have many pieces of content tagged in the same way, then we know that there is a body of content about our services, and if we look for content based on those tags we can find that body of content. You may have come across the word metadata; well, tagging content using a taxonomy gives you high quality metadata.

In passing I'll also mention corpus analysis. If you want to create a taxonomy that is tailored for a knowledge domain, corpus analysis is a good way to do it. Without going into detail for now, you start with a very basic hand-made taxonomy. Then you locate a representative body of content for that knowledge domain - this is the corpus. Then you feed the taxonomy and the corpus into PoolParty, and do the analysis. What comes out of this is a set of results that lets you fine-tune your taxonomy, adding in new concepts, retiring those that aren't relevant and refining those that are. Look out for a detailed article about this here, coming soon.

Different ways to tag

When you tag a piece of content, you don't actually change the content at all. The tag is stored separately. Where it is stored, and how the tagging itself happens, is the subject of the rest of this article. It's going to get a bit involved, so this may be a good point to grab a richly-deserved cup of coffee.

Traditional content management systems often include a taxonomy component. This usually provides some sort of tag list that you can use to specify the tagging for the content item you're working on. It's fairly simple; you may be the author or editor, and you are reading the content in some sort of editor view, and you decide what tags to apply. You choose one or more terms. Very straightforward. However, it's limited;

Content management system taxonomies are just lists of terms or keywords rather than collections of concepts. Tagging with terms rather than concepts means that you don't get the full benefit of rich semantic tagging. Take a look at our short note on Concepts versus Terms, and the longer article Why things are better than strings, for more.
Content management system taxonomies are internal to the CMS and proprietary. The tag list, and the fact that a particular piece of content is tagged with a particular term, are not visible to other systems running in your business. They're yet another information silo.
The tagging is a hard database link between two different entities in your CMS. This may not seem such a big deal, but it becomes a big deal when (not if) you decide to move to another CMS in future. Those proprietary links will almost certainly not be preserved in your next CMS, and all of your hard tagging work will be wasted.
Manual tagging using CMS taxonomies can be a painful process for authors and editors, adding another layer of effort on top of the work involved in just writing and editing. It is also a classic case of WIIFM; the author or editor who painstakingly classifies their content against controlled vocabularies is rarely the person who benefits from it. The result is depressingly predictable; people will find ways to subvert systems that don't add value to them or seem bureaucratically pointless.

So anything that we can do to improve the standardisation of tagging, to reduce the burden of classification and to make the process run more smoothly is likely to be beneficial, both tactically to the author/editor and strategically to the business.

The problems of proprietary tags and of linking content with wider information can be solved by using a separate taxonomy management system, and by storing the tagging fact itself independently. Holding the right information in the right systems (sometimes called separation of concerns), and allowing those systems to talk to each other through standard interfaces, all help organisations to address proprietary data issues. I'll come back to some practical ways to do this later in this series of articles.

An increasingly common response to the manual tagging problem is auto-tagging. The idea is that the author or editor hands off the job of tagging to a robot derived from an artificial intelligence program. Rather like a search engine robot, this analyses the content and automatically assigns the most appropriate taxonomy concepts. It's easy to see why organisations elect to try this approach; it's automatic, it's quick and it takes the burden away from people.

However, auto-tagging is not a solution to the problem of classification, any more than search engines solve problems of information discovery. It's a siren song. I have worked with quite a few auto-tagging systems over the years and in my experience none of them is an effective substitute for thoughtful classification by a human subject matter expert.

Luckily, there is a half-way house; assistive tagging. Like auto-tagging, this uses a computer program to analyse your content and to suggest candidate tagging concepts from a taxonomy. Unlike auto-tagging, a human being, usually a subject matter expert, gets the final say about whether the suggestions make sense, and can choose to use or not use them as appropriate.

I believe assistive tagging represents a pragmatic approach to effective classification of content in the real world, and the following articles introduce some practical tools to help you use it.

Tagging in a content management system is only part of the picture, though. That content usually has to be surfaced somewhere else in the business to be of use. The most common place is a website, and it's helpful if the tagging can survive somehow into the published website. I'll be covering this in a separate article. As a bit of a spoiler though, the key is to use PoolParty's GraphSearch feature. This allows you to build faceted browsing and search features based on semantic, structured classification of content as opposed to just full-text indexing.