Building a content graph, part one: Guiding principles

This is the first in a series of six articles on building a content graph. In this part I will introduce the concept of a content graph and set the scene for the following articles. At the end of this article I will set out some golden rules for designing and building content graphs.

This should be a gentle introduction; I need to cover some ideas that may be new to some people, but hopefully I will do it in plain English. I am making no assumptions that you know what a content graph is, but I hope that you have sufficient curiosity about the subject to stick with it for now.

In this first part I will introduce the different graph design components that will feature significantly in later articles:

  1. Guiding principles for content graphs (this article)
  2. Content design for content graphs
  3. Taxonomy design for content graphs
  4. Middleware design for linking information
  5. Information model design for content graphs
  6. Using a graph database to tie everything together

 

In this article

Backgrounder; content architecture in content management systems ^

Organisations of all types and sizes use content management systems to collect together information within a website or an intranet or a documentation system. Typically, a content management system contains a relational database that has tables, columns and rows. It will have some means of information storage; for most content management systems, this is usually textual information, and it is often stored in the database. There is usually some kind of management layer that sits on top of the database, and feeds the data out of it into a user interface. The user interface is the main route by which an end user consumes the data. A well-designed content management system will have an architecture designed with a degree of micro structure, describing the way an individual item of content is assembled, and a macro structure, describing how content items can in turn be assembled into a larger work.

Think of a book; it may have components such as front matter, tables of content, preface, index, parts, chapters, and sections. I’m making an assumption here, that the section is the smallest component in this content type; it could easily be further sub-divided if necessary (for example, an educational book may have a number of lesson plans as components within a larger section structure), but eventually you will arrive at the indivisible item of content. A good rule of thumb for an ideal content architecture is to consider the level at which content might be re-used, and take that as the lowest level of the macro content architecture. Notice that I didn’t include the page as a component. For most content, a page is not an architectural component. The content presented within the lowest level of the content architecture may span several pages, but pages are more about the flow of presentation of content rather than the structure of the content.

That is the macro structure. The micro structure comes from how a single content item is made up. If the lowest level of macro architecture is a section in a book, it may have a heading, one or more lower-level headings, possibly an author, footnotes and of course the main body of content. Those structural elements go to make up the micro structure in the content architecture.

Content management systems and architectures built in this way have been extremely successful in improving the publication of content to the web and providing a degree of management capability to organisations wanting to keep control of their content. It’s not surprising that so many organisations take this approach, and it’s also not surprising that so many competing content management system products have come onto the market.

Content management systems vary widely in the features they offer, but the attractions are common:

  • They allow safe and secure storage for text, images and other content.
  • They allow some separation of content from presentation, meaning that the content can be displayed in a variety of formats to different devices.
  • They can be made searchable by indexing the content.
  • There may be an opportunity to design content architecture to allow structure in the content.

It's not all good news, though ^

In my work with organisations, I have noticed that the same information problems crop up time after time:

  • It’s hard to find things.
  • It’s hard to track the history of information (who, when, why, where?).
  • It’s difficult to re-use (or re-purpose) information.
  • Search engines don’t deliver well with unstructured content.
  • Content architecture, where it exists, is monolithic, not granular.
  • Classification is haphazard and based on keyword tagging.
  • It’s difficult to link information between systems.

It would be nice to think that using a content management system would address these problems, and to a degree this may be true. However, if some basic architectural design features are absent then simply introducing a content management system and adding content to it does not address the core issues. Let’s take a look at some of these issues to understand why.

Finding information ^

Most content management systems rely on a structure based on a hierarchical menu system with content items attached to (stored in) specific locations in the navigation. In such cases, provided that the content belongs logically in just one location, it can actually be quite easy to find things. However, that relies on there being a well-exposed, well-understood and shared model, so that people visiting the content management system will know where information should be stored. This is rarely the case in practice; the menu system tends to be designed as a fixed feature of the content management system application, or built by IT experts. I suspect that little consideration is given to what the end user wants to get out of the navigation. More often, information may logically belong in more than one location, and a visitor might expect to see the information in any of a number of places. For many content management systems this is not possible; a content item is tied to a single location (sometimes as a result of editorial decision rather than an understanding of what the end user is looking for). This is great in a well-designed system, but marooned and lost content is likely to be the result otherwise.

Search ^

Problems with finding content via search tools are distinct from problems with finding content through navigation or structural design.

Conventional search tools are built on full-text indexing. This involves a piece of software called a search engine which is able to load the plain text content from a collection of content items and build a database of all of the words and phrases contained in the content.  Search tools then allow the user to type in a word or phrase, and the search engine looks through the database for any occurrences. It will then show a set of links pointing to content items that contain the relevant words. Generally, a content item that has more occurrences of the word under search will receive a higher weighting (or relevance ranking) and will appear higher up in the results list.

Search tools that work like this have been extremely successful, to the extent that for many people searching for content means full-text searching (the word Google has become synonymous with searching!). Search engines such as Solr and Elasticsearch are very common in organisation websites and intranets.

However, searching using full-text indexing has some significant drawbacks.

  • Search engines don’t know what you’re looking for, only what you’ve typed. A string of text in a search box – maybe just one word – is not enough to indicate meaning or context. All the search engine can do is return that weighted list of occurrences of your search word where it finds the word across the body of content. More importantly, for the most part search engines are indexing unstructured content; usually web pages, where the markup (the html elements that wrap the content) is not structural and does not convey context or meaning.
  • A consequence of the above is that search engines don’t do very well with context. This is unsurprising given that they often have so little to go on. A search for “ladybird” in Google will retrieve results about the insect (also known as a ladybug), a type of web browser, an educational book publisher, a recent movie and even, if you are a Potterhead, the Patronus of Symposia Rawle. The search engine has no way of knowing which of these (if any) reflects the context in which I was searching. Presumably there is an algorithmic decision driving the choice of results; which leads me to wonder, given the millions of possible hits, who or what decided the collection of hits that I saw?
  • A search engine usually delivers too many results. A typical Google search may return millions of results (though I have noticed that at the time of writing, September 2024, Google searches no longer seem to proudly announce the number of hits at the top of page 1). The chances are that a user will not venture beyond the first couple of pages of results – perhaps only seeing the first 25 hits out of the vast number of results. Again, there must be a question about why these results were the chosen ones.

Classification ^

Classification or tagging of content tends to be an afterthought in content management. Understanding what a piece of content is about often relies on simple keyword tagging, a blunt instrument at best. It is rare to have a structured taxonomy available, but that’s not the main issue. Equally important is the fact that it is usually no-one’s job to ensure that content is properly classified. Content authors may see no obvious pay-back from diligent careful tagging of their content, so it tends either to be an afterthought or not to happen at all.

Content architecture and re-use ^

One of the biggest missed opportunities with content management systems is that of content re-use. Taking the example of educational publishing again, a collection of content representing a book may be broken down into smaller units such as chapters and sections. But at the lowest level of storage there is a lot of intellectual investment in the information stored for, say, a section dealing with a specific topic. That is the distilled knowledge of a subject matter expert. It is highly likely that this information will be applicable to other publications in similar subject areas, but only if:

  • The content is designed in such a way as to be usable elsewhere.
  • The content can be found (see above) and classified, and is thus identifiable as something that can be re-used.

Unfortunately, for a variety of reasons, content in organisations tends to be created to meet a current need; strategic approaches to content architecture for re-use are rare.

Information loss through interoperability and migration ^

What happens when an organisation moves to a new content management system? Decisions on products are often driven by (tactical) organisational needs that tend not to favour (strategic) content architecture considerations. Content management systems are on the whole built on proprietary platforms, so it is not easy to simply pick up content and plug it into a new product. In principle, interoperability standards like CMIS could provide a way to move content, but inevitably the requirements of proprietary designs override conformance to an abstract standard. In decades of working in this area, I have never encountered a seamless migration of content between content management systems.

If the migration of content itself is difficult, the migration of classification is nearly impossible. The structures used to store keyword lists, and the actual tagging facts recording the linking of content to tags, are rarely amenable to migration.

More often than not (and I know this from long experience) organisations make a pragmatic choice when moving to a new content management system, and end up simply cutting and pasting content from the old system to the new. In the process classification often goes by the board.

All is not lost, however ^

As you will see, a large component of the architectural approaches that I am proposing here is intended to address the issue of information loss during migration. I firmly believe that done correctly, information migration can work without such information loss.

The proprietary nature of content management systems is a fact of life and it is highly unlikely that there will be a move towards a common content management system standard. However, as I will set out across this set of articles, proprietary storage is fine provided that there are clear interfaces for information interchange. In fact, a common thread running through the articles is separation of concerns; by which I mean keep each kind of information (and only that kind of information) in the most appropriate system, and provide well-understood and reliable interfaces for other systems to retrieve the information.

Introducing the graph ^

A graph is a way of organising information that is fundamentally different from traditional systems based on relational databases. Where a database stores information in tables that are in turn arranged in rows and columns, a graph treats information items as individual objects.

Here is a very simple example of information for a collection of books in a relational database.

IdTitleAuthorFirstNameAuthorLastName
00012Applications of grapheneAdamGrover
00034Diamond-like moleculesPaulSepp
00048FullerenesJaneBuntine
00051Experiments with graphiteVictoriaRetsov

Figure 1 Book information in a relational table

This represents a table called Books. The table has columns representing id value, title, and first and last names for an author. Each row represents a record for an individual book.

Now let’s think about how to represent this information in the form of a very simple graph. A graph is made up of information objects, each of which has a type. In our very simple case, each object has a type of Book. Assigning a type to an object is a way of defining how you expect it to behave. The objects in a graph will also have associated metadata or properties; an identifier (of which, much more later) and associated data containing the title, first name and last name of the author. Schematically this could be represented as in this diagram.

Figure 2 A very simple graph for a collection of Book objects

The ovals represent the object in each case, and you will see that it has a type of Book (in the semantic world, types or classes of information are usually capitalised). The rectangular shapes represent metadata associated with the object, and for now these are listed with their static string values. The id needs a special mention; in a graph each object needs to have a unique identifier of some sort. For now, we can’t say much about this id – it could be a number, or a string. Later on, we’ll cover exactly what kind of identifier we need to use.

This is a very simple starting point for a graph. The diagram also doesn’t say anything about how these objects are stored. Things are going to become much more sophisticated soon.

I’m going to start with an overview of the rules of graphs.

A graph contains objects. The objects are characterised by their conformance to a design called a model. In an object model each object has metadata, which on the whole is static text information (like the Firstname and Lastname properties in the diagram above). It may also have links (or relations) to other objects; this is the (very important) next stage in the development of our graph. Generally speaking, when designing a graph, it is important to look at metadata carefully to assess whether in fact it might be treated as an object in its own right. To illustrate this, let’s think a little more about the authors of these books. Rather than storing these as pairs of static text metadata, wouldn’t it make sense to build an Author type into our graph? This might look like the diagram below.

Figure 3 Author object with simple metadata

Now there is an Author object which has Firstname and Lastname properties, and its own id. This might not look like progress, but it is for a couple of reasons.

  • We now have a thing that represents a real-life person who happens to be an author, that can have its own metadata and links to other objects.
  • We have more flexibility in linking this author object to other objects. If Adam Grover has authored other books, we can express this fact unambiguously. We’ll get to that shortly.

We can add other metadata to the Author object. In Figure 4 below, the Author object now has a salutation, an email address and an organisational affiliation. You can elaborate this object further with metadata properties such as date of birth, job title and so on. It would also be possible to improve the model by creating an object representing the organisational affiliation. I’m not going to do that here, but doing so would be fairly straightforward. For me, the rule of thumb is whether there is additional information value available with the metadata that would make it worthwhile to split it out as a separate object.

Figure 4 Author object with richer metadata

Now let’s look again at the Book graph, replacing the author static metadata with a link to the relevant Author object. This is how it might look.

Figure 5 Elaborated Book and Author graph

This improved graph deserves some analysis. We now have linked objects; each Book has a link to the corresponding Author object. Each object has an id plus the appropriate metadata. We don’t know directly what author last name and first name belong to each Book, but have access to this information indirectly.

Notice too, that each object now has only the properties that it needs to have. The Book and Author objects each have only the properties that they need to have. A book object has a link to the corresponding Author object.

We could, by the way, achieve something similar in the relational world if we had a Book table, an Author table and a linking table to connect them. The purpose of these articles is not to devalue relational approaches, but to show the benefits that can accrue from an alternative approach for many types of information.

The graph that we’ve created so far is a long way from complete. We have a model that contains two different types of object (three if you’ve been following along and took the bait to create an organisation type). We can see that Book(s) have Author(s). But there is an additional layer of meaning that is currently missing. We can add semantics to this model by describing the nature of the links.

  • A Book has one or more Authors.
  • A Book has a title (which is of type string).
  • An Author may have written one or more Books.
  • An Author has one or more first names.
  • An Author has one last name (I don’t know for sure, but I don’t know of any cases where an author has more than one last name).
  • An Author has one or more email addresses.
  • An Author has one or more organisational affiliations (a string property for now, though it could be a link to another information type).

Taking one Book-Author pair, we can show how these semantics look. In Figure 6, there is a single Book and its Author. The colour coding indicates the type of the information; yellow indicates integer data, green shows string data, while the orange and grey gradients indicate objects. The lines linking objects to objects and objects to data have arrows to indicate the direction of behaviour. There are two links directly joining the Book and Author objects; a hasAuthor relation from the Book to the Author, and an inverse isAuthorOf relation pointing from the Author to the Book.

Figure 6 A Book and an Author with semantic relations

At this stage of the model development, we have a very interesting new feature, one which is fundamental to building and using graphs. We can write the relations between objects and metadata in a new form.

For example, the Author object has a firstName property of Adam; we can write this as

Author (with id 0123) has firstName Adam

We can describe other properties in a similar way:

Author (with id 0123) has affiliation Bogus University
Book (with id 0012) hasAuthor Author (with id 0123)
Author (with id 0123) isAuthorOf Book (with id 0012)
The Book has a title Applications of graphene

This new form is called a semantic triple. It has three parts; a subject (an object), a predicate (which is a little like a verb in written English) and an object (either another object or a metadata property). In the list above I have colour-coded these three parts to differentiate them.

The word semantic is used to denote the fact that each of the subject, predicate and object carry meaning derived from their conformance to the rules of a model. When we use the predicate hasAuthor, we are not just saying that this Book and this Author are linked; we’re saying how they’re linked. The relation might include rules such as “A Book must have at least one Author”, and “An Author may have written many Books”. A very important feature of triples is that the object of one triple may also be the subject of another. It’s not difficult to see how this gives rise to networks of linked information.

Triples are fundamental because they are the building blocks for graphs. Every graph is made up of many triples, each describing a key semantic relation between things and other things, or things and data.

Triples are not just used to build a graph; they are also the key to exploring, searching and navigating a graph. Once you have a collection of triples in a suitable repository such as a graph database, you are able to navigate through the information. We know that the author of Applications of graphene is Adam Grover because of the hasAuthor relation linking the two objects that we have in the graph. But we can also ask questions like:

  • What other books has Adam Grover written?
  • Who else was an author of Applications of graphene?
  • What other information do we have about Adam Grover?
  • If the Book also has a hasPublisher link to another object identifying the publisher of the book, what other books are published by that publisher?

You can probably see how we can accumulate, and then explore, a large amount of useful information linking together a variety of business objects; Books, Authors, Organisations, Projects and Publishers (obviously I’ve only described a couple of these in any detail). This flexibility is one of the biggest benefits of the graph approach to information management.

One key aspect remains unexplained; the way in which objects are identified. In the example above I’ve used very simple id values that are probably integers. When creating new objects in a graph it is necessary to mint a new unique id for every single object. By convention, graphs use Uniform Resource Identifiers (or URIs) for this purpose. These are important because they provide both location and identification information, and should be globally unique. A URI looks quite similar to a URL on a website and will appear something like this:

https://[domain]/[informationspace]/[identifier]

The generic URI structure breaks down like this:

  • Domain. This refers to the knowledge domain in which the graph will be stored. For example, content.tellurasemantics.com and vocabulary.tellurasemantics.com represent knowledge domains describing content items and vocabulary (taxonomy) concepts.
  • Information space. This is used to identify a particular application within the knowledge domain. This may be something like Publication or ContentManagement.
  • Identifier. This is the unique identifier for the object itself. This only needs to be unique within the knowledge domain and information space, but many practitioners adopt the Unique Uniform Identifier (UUID). This is conventionally made up of 32 characters split by hyphens into groups of 8, 4, 4, 4, and 12 characters; for example c3f6090a-1c0c-458b-adb5-f0c0f6d5d384 (so the whole id will be 36 characters long).
    A well-crafted UUID will be globally unique (not just within the knowledge domain or information space). This is determined by the message digest method used to generate the string.

Putting these together, a Book object might have this URI:

https://content.tellurasemantics.com/Content-graph/Content-item/c3f6090a-1c0c-458b-adb5-f0c0f6d5d384

Content and tagging ^

Earlier I mentioned that a common information problem encountered in organisations concerns tagging or classification. In most cases, tagging of content is very simple, involving linking an item of content to one or more keywords. These links are usually managed by a relational database and are simply a way of expressing a connection between the content and the keyword.

This kind of simple keyword-based tagging has a few problems.

  • Keyword lists are not usually controlled vocabularies. Because of this it is common to see a proliferation of similar keywords appearing in a keyword list; multiples along with singular forms, -ing and -ed forms, misspellings and so on. This is a problem because when a content author comes to tag a piece of content it may not be clear which of the keywords she should use. This in turn can lead to a situation in which content items that should be tagged using the same keywords are instead tagged differently, making information discovery imprecise.
  • Keyword lists are not portable. As described above, when a content migration happens, it is hard enough to move the content itself with fidelity; migrating the tags as well is a long way down the priority list.
  • Keywords are not rich information; they carry no semantics. A keyword is just a word, and it can be difficult to discern what the word means. Some more sophisticated keyword systems in content management systems allow description fields, which helps, but these are the exception.
  • Keywords have no synonyms or contextual information.

One way of improving the value of content tagging is to use a well-designed taxonomy. A taxonomy in this context is a collection of objects that can be used to express the aboutness of a content object. A single concept in such a taxonomy would be much more than a keyword:

  • It will have a human-readable text label (essentially, the only thing it has in common with a keyword). In taxonomy terms I’m going to use the phrase preferred label (or prefLabel) for this.
  • It may have other metadata, like synonyms (alternative labels or altLabels), definitions, examples, scope notes or other explanatory information.
  • It may have relations to other taxonomy concepts, related either in a hierarchy to parent (broader) or child (narrower) concepts, or to similar (related) concepts elsewhere in the taxonomy.
  • Crucially, it will have a globally unique identifier (probably a Uniform Resource Identifier or URI) that unambiguously locates and identifies the concept. The URI is designed to be machine-readable, and is independent of any human-readable text labels.

It turns out that a concept in a taxonomy of this kind is an information object, just like the other objects that I’ve described above, and the taxonomy is a collection of concepts that conform to an information model. In other words, we can use a taxonomy concept as a component of a graph along with content objects and author objects and so on. And this combination of content objects, taxonomy concepts and other information objects in a graph is what I am calling a content graph.

Quick recap ^

So far, I’ve described some basic features of a graph. To recap, a graph will contain information objects that conform to one or more specified models, it may have a variety of metadata of different specified formats, and it may have one or more relations to other objects, the semantics of which also conform to the information model.

If the graph contains content objects and taxonomy concepts (and possibly other information objects), all conforming to their information models, then we can think of this graph as a content graph.

Introducing (finally) the content graph ^

We can now develop our picture of a content graph.

A content graph is a map of content objects and other related objects. Think of each content object as a structured collection of data including, at a minimum, a way of identifying it unambiguously with respect to all of the other content objects and a data payload.

So far this is a fairly strict description of a content graph, so the objects here are all Book objects and Author objects, not other objects.

Let’s add in some tagging. Our information model needs to develop a little, to allow for the incorporation of taxonomy concept objects. As I mentioned a little earlier, taxonomy concepts conform to an information model, have unique IDs, metadata and relations to other concepts. Since we are now in the object world, I’ll call these Concepts here. As I will cover later, I have chosen to have Concepts conform to the Simple Knowledge Organisation System (usually just abbreviated to SKOS). SKOS is a good choice for describing taxonomy information, because it is based on the Resource Description Framework (RDF), and thus has all of the graph-centred features we need:

  • Unique IDs.
  • Metadata in the form of preferred labels, alternative labels, definitions and so on.
  • Relations to other concepts including broader, narrower and related links.
  • Extensibility to import objects from other models (we will increasingly use the word ontology here for information models).

The model can include Concepts easily enough, just as it included Authors when we discussed those. The key thing that we will need to include is the relations that will link Books and Concepts. These are not included in the SKOS ontology, or in the Book model, so we need to add them.

Which begs the question; what relations? How exactly should Books and Concepts be linked? The decisions about this part of the model are at the heart of graph design; objects are linked together using semantic relations. That is, thinking back to the idea that an object in a graph can represent a real-life object, what is the nature of the link between two objects? We looked at a simple case for Book and Author objects, and used hasAuthor and isAuthorOf as inverse relations between them.

In the case of taxonomy Concepts, we’re looking for something that expresses the aboutness of the content. The simplest way to semantically link them to a Book is to have a relation framed around the purpose of the taxonomy. Let’s assume that it’s a taxonomy of subjects, such as Science, Mathematics, Information Technology and so on. Each Concept in the taxonomy is thus a subject. We could therefore have a simple relation such as hasSubject (with an inverse, if required, of isSubjectOf). We could, even more simply, have a symmetrical relation of matches. This however doesn’t convey much in the way of semantics, so it’s probably better to express the relations as meaningful things. We’ll stick with hasSubject / isSubjectOf for now. Let’s see how it might look when linked to a Book object.

 

Figure 7 Book and Concept objects

Notice that the objects now have Uniform Resource Identifiers (URIs). The URI for the Book has a base of content.tellurasemantics.com, uses a model called Content-graph and a class or type name of Content-graph-item. As it is instance data (that is, a specific example of a Book), it has a UUID value (c655ceba-5568-46ee-a201-bdf2614f8341) too. Each of the classes in the model will represent a type of content object, so Persons, Projects (if we get to them) and so on will have their own names, object and data properties. I will use the object name Person from now on, because the hasAuthor relation should really refer to the more generic object name of Person rather than Author. This makes sense because a Person is more than simply an author of a book and may have other relations to other types of object.

It's worth looking at the Concept too. It may seem odd, at first glance, that it has a URI based on vocabulary.tellurasemantics.com rather than content.tellurasemantics.com.

The reason is that the taxonomy of Concepts is built using a different model from the Content Graph Model; it is a collection of Concept objects that may be applicable to a variety of different information models. This is actually a very powerful aspect of semantic information models; there is nothing to prevent one information model linking to another. You can imagine how this enhances the value of a graph of information objects; having objects from more than one information model linked to the same taxonomy Concept indicates wider relations between those different information model objects.

A Book may have more than one subject, so the object may have more than one link to different Concepts. A single Concept may be used as a tag for more than one Book. Since an object in a semantic triple may also be a subject in another triple, it is clear how the network grows.

I’m going to finish up this initial article by showing a small part of a content graph that illustrates the growing linked information network. I have simplified the display slightly for clarity; I am using icons to denote the different classes in the graph, and I have left out some of the metadata. It’s implied, just not spelled out here in the interests of clarity.

Figure 8 Simplified content graph; books, authors and concepts

It is interesting how much valuable information can be found even in a simple graph like this. The graph shows three content objects, which I’m simplifying to just their titles: Applications of graphene, Fullerenes and Diamond-like structures (these, like all of the other items I’m describing, are fictional). We also have three people: Adam Grover, Paul Sepp and Jane Buntine. Finally, we’ve made use of four concepts from our taxonomy: Graphene, Diamond, Fullerene and Carbon macromolecules. Let’s look at some simple things we know from this graph.

  • Applications of graphene has two authors; Adam Grover and Paul Sepp. We might infer that in the section of the information model dealing with Person objects, if we have a symmetric relation called knows, then Adam Grover and Paul Sepp should have this relation. We will build the Person class in more detail in a later article.
  • Paul Sepp is also the author of Diamond-like molecules. We might infer, since scientists tend to publish in their specialisms, that Applications of graphene and Diamond-like molecules may be related books.
  • Applications of graphene is about Graphene, and also about Carbon macromolecules.
  • Applications of graphene and Fullerenes are both tagged with Carbon macromolecules. We might infer that they are in related knowledge domains. This may seem obvious to a human reader, but it is important to be clear and explicit about object relations in a graph where we can.
  • There are some rules governing the graph; each content object must have at least one author, a content object may link to zero or more concepts, but person objects do not have links to concepts (at least, not yet). Later we will cover rules that specify the format of metadata, such as string, integer, date and boolean.
  • We can ask questions such as “What other books have been authored by Adam Grover?”

Let’s briefly unpack that last bullet point. Isn’t it obvious that there is only one book with this author? It is obvious in this limited section of the graph, but a graph may have thousands or millions of objects and billions of links. The core design of a graph helps us out here. Although the graph above is shown as a network or map of objects, under the covers it is actually made up of a set of individual statements called triples (I introduced triples earlier in this article). Recall that a triple connects two objects together using a semantic relation called a predicate. The triple in this case can be written like this:

Applications of graphene hasAuthor Adam Grover

In this triple, the subject is Applications of graphene, the predicate is hasAuthor and the object is Adam Grover.

(Actually, the triple links together the URIs of the content object and author objects, and the predicate is also a URI, but let’s stick with human-readable forms for now).

If this triple is managed in a graph database, then we can interrogate the graph using a query that translates into plain English as “Show every content object that hasAuthor Adam Grover” (it’s not grammatically good English, I know). This query will look at the hasAuthor predicate across the entire graph and return the triple subject wherever this author appears in the object position. Two of the three parts of the triple are known, and we want to get hold of the matching object.

This kind of interrogation also works with metadata. Suppose we want to know which people have an affiliation to Bogus University. We can do a similar query to the one above, but this time it is asking for data rather than another object:

“Show me all people with affiliation = “Bogus University”

The subject is what we are after here since we have the predicate (affiliation) and the object (Bogus University).  This illustrates that a triple doesn’t only connect objects to other objects; it may also, as in this case, connect an object to a piece of data. There is one nuance of triples that's worth bearing in mind here. I mentioned earlier that a thing in the object position of one triple may also be in the subject position of another triple. However, this is only true when both things are objects; if the thing in the object in one triple is a piece of data, it cannot be in the subject position of another triple. Essentially, once a network of triples ends up at a piece of data, that's the end of the road.

In summary, one of the most useful features of graph information is that everything reduces to triples, and you can aggregate triples data through any of its three components; subject, predicate and object. This is a very powerful way to explore and find information.

Principles of content graph design ^

Finally in this article, I am introducing a set of guiding principles that I believe are key to successful content graph design. I’ve already mentioned some of these in passing, and I will return to them several times over the coming articles.

  • Principle 1; Build a semantic information model. This model should define the objects that need to be tracked, the relations between different objects and object metadata properties for each object. Use the Resource Description Framework (RDF) as the underlying structure of the model, and ensure that all objects have Uniform Resource Identifiers (URIs).
  • Principle 2; Build a content architecture that describes content objects from macro- to micro- scale and ensures separation of concerns. Each component in a system should contain only the information it is designed to contain.
  • Principle 3. Use Application Programming Interfaces (APIs) to ensure that the components of the content graph are able to communicate with each other. The biggest problems in complex multi-component systems occur at the interfaces between components.
  • Principle 4. Build a classification scheme – a taxonomy – designed by human experts and built on the Simple Knowledge Organisation System (SKOS). Use this scheme to classify content manually or by assistive tagging confirmed by human insight. Do not rely on auto-tagging or AI (text analysis and concept extraction are fine, as long as humans get the final say).
  • Principle 5. Store the model and instance information for the content graph in a graph database. This will ensure that you can take advantage of all of the features of RDF in exploring the content graph.

End of Part 1 ^

This has been an introduction to the topic of content graphs. I’ve presented the principles of content graph design. In the next article in this series, I’m going to look in detail at the content component. Later parts will deal with the taxonomy component, the graph store and the middleware that ties them all together.

Acknowledgements ^

I would like to thank Cornell University for making the arXiv library (https://arxiv.org) available as a free distribution service and an open-access archive for nearly 2.4 million scholarly articles.

I am grateful to Semantic Web Company for providing me with a cloud instance of PoolParty and excellent support. I am also grateful to Ontotext for creating a freeware version of GraphDB which has been very useful in building the content graph.

Subscribe

Get the latest news, updates and more delivered directly to your email inbox