Desktop content graph part 1. A content model for file linking and classification

The main topic for this article is how files in the file system need to behave when incorporated into a content model. The main purpose of the content model (or ontology) is to describe all of the class structure, object properties and data properties for the things that the model contains. This will be a very simple content model with only a few classes, relations and other metadata to have to handle.

Given this content model it should be fairly clear what will be needed to build the same model in my Desktop content graph management system.

This is article 1 in the Desktop content graph series. You might want to read article 0 (Setting the scene) too.

Part 0 Setting the scene
Part 1 The model for the Desktop content graph (this article)
Part 2 Tags and taxonomy concepts
Part 3 Building the initial Desktop content graph system

The starting point is not ideal

One of the biggest difficulties in this project is that I have to build the content graph outside the operating system and any content-handling application. I can’t (and wouldn’t want to) subvert the way that MacOS and (say) Microsoft Word work or the interfaces that they expose. Mind you, wouldn’t it be nice if the OS managed file system objects as objects, so that there was better access to object and data properties? Wouldn’t it be nice if Word had an object model and an open RESTful API that allowed external services to interact with documents in a structured fashion?

We are where we are, and as a consequence building a content graph requires building a semantic layer over the file system objects, and implementing the graph design in that layer. In the process, it's also important to ensure that I keep the connections from the abstract model layer down to the real files. It will be an interesting journey!

Modelling the real world

Where do you start in building a content model? For me, the starting point is to think about the real world things whose behaviour needs to be modelled. Then describe them in plain English, considering ideas such as:

Is this thing a more specific example of another thing (in information terms, a sub-class)?
Is it a more general example of another thing (a superclass)?
Is it similar, or related, to another thing?
If so, describe the relationship as meaningfully as possible (these are object properties in the ontology)
How would the thing be described in terms of its features; name, synonyms, relevant dates and times, and so on (these are data properties in ontology terms, or more generally, metadata)?

A model described in this way is occasionally called a digital twin; that is, an information structure that represents (as far as possible) the behaviour of the real world thing.

In this case we are talking about things in the file system; files and folders. And actually I’m only going to deal with file objects; a folder (sometimes called a directory) is really just a virtual container, an organisation system for files, so let’s stick with the truly definable objects.

Aside: why aren't file systems a bit more object-oriented?

I'm always a little surprised that folders don't work well as file containers. As a simple example, it's actually quite difficult to have a file that belongs to multiple folders; if you did have that feature, it would make contextual navigation in the file system much more useful, because you could find your way to the right content by whatever routes are relevant.

In the kind of design that I'm considering here, it could be quite valuable to have a family of file system objects, like FileSystem, Folder and File. Each class would then have its own relationships to other classes, and its own metadata. A folder (an object of the Folder class) could have contains relationships to one or more files (objects of the File class), each could have taxonomic relationships to concepts in an information taxonomy, and metadata like preferred labels (filenames), synonyms and other data properties that could enrich their value.

There have been attempts at object-oriented operating systems; notably NeXTStep and Taligent/Pink, but all fell by the wayside. Elements of the NeXTStep operating system have found their way into the modern MacOS, but not, alas, object-oriented features. Missed opportunity if you ask me.

Modelling files

Given the questions above, how should I characterise a file object?

Is a file a more general or more specific example of another thing? No, for these purposes a file is the only object we need to deal with. Even where a software program allows for structured objects (e.g. Word allows composite documents with a master and child documents) these have to be handled within the software. The operating system and the file system don’t keep track of those relationships, and files on the whole stand alone. When you move a file from one location in the file system to another in the same file system, most of the properties of the file (file name, size, creation and modification dates, inode; basically, the ones reported when you use the stat command) don’t change. One exception is the nativePath property; since that describes the (virtual) folder structure leading from the file up to the top of the file system, this property will change when the file moves, even within the file system.

Is it similar to another thing? Absolutely; that’s the core of what I’m doing here. We all know that there are files that are similar to other files. The current Mac OS however does not provide ways to objectively define relationships between files. Since we don’t have those tools, we have to think of how to model the relationships that we want to build, then build those tools and do so outside of the files themselves or any current software tools that manage files.

This where I'm going to implement things in stages. From a semantics point of view, it's not a big step to say that something is related to something else (I would articulate that relationship as isRelatedTo). It would be much better if we could define a real relationship that says not only this thing is related to that thing, but how they are related. But my design is not (yet; we'll get there) developed enough for that, so I'm starting with isRelatedTo.

You will see later on that there are (at least) two ways forward. One way will be to elaborate the model with richer relationships. Some hopefully self-explanatory examples might be:

Thing A summarises Thing B (inverse of hasSummary)
Thing A isNewerVersionOf Thing B (inverse of hasNewerVersion)
Thing A supports Thing B (inverse of isSupportedBy)
Thing A refutes Thing B (inverse of isRefutedBy)

The second way to develop a graph is to build a taxonomy and use it to tag or classify content objects. This is actually a much more flexible and open-ended approach than building sophisticated object-to-object relationships. Instead, you develop an structured taxonomy that has its own information model, and then link content objects to objects in that taxonomy.

For example, suppose I had a taxonomy concept with a label Song lyrics. (as I will describe later, a concept is a strict definition of part of a taxonomy which goes far beyond a simple keyword, so a Finder tag would be a text property of a taxonomy concept). If I identify a few content objects in my file system that contain song lyrics, I can classify those objects with that concept:

Thing A isClassifiedWith Song lyrics

(This is a gross simplification, but we'll get to that. One step at a time).

If I also classify a number of other content objects with this concept, I will begin to develop a graph like this:

The important thing to note for our content graph is that by building these relationships, we are in effect linking the objects themselves together. They have something in common; they are all classified with the same concept; Song lyrics. There is thus an implicit relationship between the things. A single object might have a number of linked concepts, and those concepts might be used to classify a number of objects, so we have the start of a network - a content graph. Note that I've left out the inverse classifies relationships from the diagram, but they are explicit in the content model (see below). The inverse relationship is important, because eventually it will allow the user to explore the graph in multiple dimensions and ask questions like "what content objects are about song lyrics?"

How would I describe the features of the file? Files already have some of these metadata provided by the operating system (filename, file size, created and modified dates, etc.), and they are quite accessible to build into the graph. Later on, I may decide to implement other data properties for content objects (for example, string properties for storing filename synonyms, or descriptive notes about the object), but for now the operating system metadata is enough.

Uniform Resource Identifiers (URIs)

A core feature of objects in an RDF graph-based ontology is that they have Uniform Resource Identifiers (URIs), which are unique, immutable identifiers and locators. And that is a problem when working with file objects, because there is no concept of a URI or even a GUID (Globally Unique identifier) in the main operating systems.

Since they are foundational for RDF graph models, in my content graph model I am going to need URIs for my file objects. URIs are similar to Uniform Resource Locators (URLs), the structured links that are used to connect web pages together. URLs only need to locate another resource somewhere on the web, hence their name. URIs need to unambiguously identify resources as well as to locate them. The location part of a URI structure is the first part. comprising the scheme (sometimes called the protocol) which for our purposes will be https, the hostname (for this model I'm using schema.tellurasemantics.com), and a path (ContentGraph). Completing the URI is an unique identifier (unique for the object, but taking the form of a Uniform Unique Identifier or UUID) which will look like 078dba6d-1b55-0188-680e-33846e6db122, so that the whole URI will be:

https://schema.tellurasemantics.com/ContentGraph/078dba6d-1b55-0188-680e-33846e6db122

Let’s just review where we are with the model.

The initial objects that the model knows about are files. There is no hierarchy of files, since I’m not dealing with folders.
Files may be related to other files. To begin with I’m just going to have a generic isRelatedTo relationship. This will be a symmetrical relationship; if file A has this relationship to file B, then file B must have the same relationship back to file A. The model may develop sophisticated relationships as the requirements develop, and the corresponding semantics will also develop.
There are some clear data properties that can fit into the model. These are mostly normal file properties provided by the operating system
I have introduced taxonomy concepts into the model, and later on I will be going into much more detail on these. Concepts are absolutely essential components, since they form the semantic glue that will hold objects together and make the scaffolding for the content graph.

Building a model in Protégé

I use the free software tool Protégé to build graph models. The UI is a little quirky but once you've got over the learning curve it is a productive tool. Since this is not an article about Protégé I’m not going to describe the whole process, but the screenshot below is a typical view for the desktop content model in Protégé. This shows the object properties view, and the highlighted property is isRelatedTo. The domain and range for this property are both ContentObject (this is the top level class that defines an item of content). The property is shown as Symmetric (so it applies in both directions; if A isRelatedTo B then b isRelatedTo A).

The data properties look like this. In the image below the highlighted property is fileName, which is a string property of a file.

Without going any further into the Protégé aspects, the model that I developed looks like this:

This needs some explanation. The image above shows classes (yellow ovals), object properties (labelled links) and data properties (green rectangles).

Classes

Thing is the top level class in the model. All classes eventually can trace their parentage up to Thing.
ContentGraph is a sub-class of Thing, and the top class in the content graph hierarchy. All of the classes I’m working with descend from ContentGraph.
ContentObject is a sub-class of ContentGraph. Any thing that manages content will descend from ContentObject.
File is a sub-class of ContentObject. I’m distinguishing between the two because File deals specifically with file system file objects, while ContentObject could manage other content objects that are not files.
Person is a sub-class of ContentGraph. Any thing that is to do with a person will descend from Person.
Author is a sub-class of Person, because the graph could contain persons who are not authors. However, an author (even in the days of AI-generated “content”) is a person.
SKOSVocabulary is a sub-class of ContentGraph. It is the top level class of SKOS-related classes in this model.
SKOSConcept is a sub-class of SKOSVocabulary. It is the main class used in classification of ContentObjects.
SKOSConceptScheme is a sub-class of SKOSVocabulary. In RDF terms, a SKOS ConceptScheme is a virtual container for a SKOS vocabulary. Note that SKOSConcept and SKOSConceptScheme are disjoint classes. This might seem odd; you might think that a SKOSConcept should be a sub-class of SKOSConceptScheme. But in the RDF definition of the SKOS ontology they are separate and disjoint, so that’s how I am treating them here. I am not using SKOSConceptScheme in this current piece of work; it is there for completeness of design.

Object properties

isRelatedTo has a domain and range of ContentObject. It is a symmetric property, so if ContentObject A is related to ContentObject B then B also has the same relation back to A.
hasAuthor has a domain of ContentObject and a range of Author. It is an inverse property together with isAuthorOf, which has the inverse domain and range.
classifies has a domain of SKOSConcept and a range of ContentObject. That is, a SKOS concept is able to classify a content object. It is an inverse property together with isClassifiedWith, which has a domain of ContentObject and a range of SKOSConcept.
The object properties for skos:broader, skos:narrower and skos:related are not part of the model design, because the taxonomy in use is flat (no hierarchy) and has no need for thesaurus capabilities.
topConceptOf and hasTopConcept is an inverse relation linking SKOSConcept with SKOSConceptScheme. It is used in taxonomy development to identify the special case of a concept that sits immediately below (in virtual terms; no class hierarchy is intended) the concept scheme. I am not using it in the current iteration of this work. That may change in future.

Data properties

fileName is a xsd:string property of the File class. It is used to hold the file name of the file.
fileNativePath is a xsd:string property of the File class. It holds the fully qualified path in the file system to the file.
fileOSOwner is a xsd:string property of File, used to contain the operating system owner name.
fileOSGroup is a xsd:string property of File, and contains the operating system group name.
fileInode is a xsd:string property of the File class. It contains the inode value. It is used in the current project to provide additional entropy in URI generation.
fileCreationDateTime is a xsd:dateTimeStamp property of the File class.
fileModificationDateTime is a xsd:dateTimeStamp property of the File class.
authorFirstName, authorLastName and authorTitle are xsd:string properties of the Author class.
SKOSPrefLabel, SKOSAltLabel and SKOSDefinition are xsd:string properties of the SKOSConcept class. When used for classifying File objects, the Finder tag will be converted into a SKOSConcept object and the text of the tag will become the SKOSPrefLabel.
tripleSubjectURI holds the URI of the file that was defined as the subject for the triple. It is of type xsd:anyURI.
tripleObjectURI holds the URI of the file defined as the object for the triple. It is of type xsd:anyURI.
triplePredicateURI holds the URI of the file defined as the predicate for the triple. It is of type xsd:anyURI.

This basic model could be (and may be in future) developed further, but for initial design and build purposes this is enough. By the way, I won't be using Protégé any further in this article series; it's a great tool for building a model, but I can simply use the exported model file to inform the rest of the development. If you are interested in getting hold of the model in file form, get in touch; I'll be happy to share it.

The next article (coming very soon) will go into designing and building a taxonomy for the graph. Then I will describe how I went about designing and building the initial Desktop content graph application.

I hope you'll come back to read it. Thanks for reading so far.