Unpacking SHACL
An abundance of acronyms
Semantic technologies seem to suffer more than most from galloping acronymitis; the tendency to come up with the most tortured, obscure and indecipherable abbreviations possible for terminology. Add to this the tooth-grindingly irritating habit of using recursive acronyms (for example, SPARQL is short for SPARQL Protocol and RDF Query Language) and it all gets a bit difficult for the average human.
The worst offender that I've encountered in the Semantic world is SHACL. Not only does it say nothing to the reader about its purpose; it maintains its mystery once you unpack the acronym to the full form of Shapes Constraint Language. It doesn't get much clearer when it is described in Wikipedia:
Shapes Constraint Language is a World Wide Web Consortium standard language for describing Resource Description Framework graphs. SHACL has been designed to enhance the semantic and technical interoperability layers of ontologies expressed as RDF graphs. SHACL models are defined in terms of constraints on the content, structure and meaning of a graph. SHACL is a highly expressive language.
Note: I'm a big fan of Wikipedia, and of W3C standards for that matter. But I'm a reasonably intelligent reader and after a couple of reads through the paragraph above I still don't really know what SHACL is or what it's for.
In this short article I am going to try to use plain English to illuminate this topic.
TL;DR
Sorry, there goes another irritating acronym. This one is short for "Too Long; Didn't Read" and in a way it's a lesson for people like me to keep it short and simple. Anyway...
SHACL is a structured language that allows you to write rules and use them to test RDF graphs.
That's it in a nutshell. If you want, you can stop here and go and look at a site like SHACL playground (https://shacl.org/playground/). This is a very nice illustration of how to build rules and use them to validate a graph. If you're feeling in need of comprehensive (if mind-numbing to people like me) detail, head to the W3C pages: https://www.w3.org/TR/shacl/ .
Or you are welcome to stay here while I go through progressively more detailed information on what it is, why you would want to use it and how to use it.
Where SHACL is used
SHACL is used to validate an information model for an RDF knowledge graph. The validation process involves setting up a collection of rules based on the properties of the graph. For example (and I'll be working through this step by step later on) you might have a knowledge graph based on family relationships - parent, child, brother, sister, etc - and a set of individuals as part of that graph. You may design, as part of the graph model, the fact that a person (the topmost class in the model under Thing) has to have at least one first name. You can build a SHACL rule to test the person objects and check whether each has a value assigned to the firstName property. When you run the validation on that graph using that rule you will get a report showing any individuals that fail the rule. See the step by step run-through below.
Structure of SHACL
A SHACL file is an RDF file. Here is a simple SHACL file (it's the default that is displayed in Protégé's SHACL tab).
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://www.example.org/#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
ex:PersonShape
a sh:NodeShape ;
sh:targetClass ex:Person ; # Applies to all persons
sh:property [ # _:b0
sh:path ex:ssn ; # constrains the values of ex:ssn
sh:maxCount 1 ;
] ;
sh:property [ # _:b1
sh:path ex:ssn ; # constrains the values of ex:ssn
sh:datatype xsd:string ;
sh:pattern "^\\d{3}-\\d{2}-\\d{4}$" ;
sh:severity sh:Warning ;
] ;
sh:closed true ;
sh:ignoredProperties ( rdf:type owl:topDataProperty owl:topObjectProperty ) ;
.
As usual with RDF the data reduces to triples. The top of the file contains a number of prefix triples that help readability in the rest of the file. The rest of the file is one clause, with the subject ex:PersonShape. Within this clause are some triples that set the rules for PersonShape. The first couple of triples define the PersonShape as being of class NodeShape (the use of "a" is a universally recognised shortcut for rdf:type) and that the rules here are aimed at a Person.
ex:PersonShape
a sh:NodeShape ;
sh:targetClass ex:Person ;
Note that the semi-colon ";" indicates that we are still in the main triple, and will be until it is terminated with a full stop ".". So each of the triples has PersonShape as the subject.
Next the file defines some properties to test.
sh:property [
sh:path ex:ssn ;
sh:maxCount 1 ;
] ;
This focuses on the property ssn (social security number), and specifies that the maximum number of ssn properties is 1. In plain English, a person can have only one social security number.
The next rule focuses on other specifics of this property.
sh:property [
sh:path ex:ssn ;
sh:datatype xsd:string ;
sh:pattern "^\\d{3}-\\d{2}-\\d{4}$" ;
sh:severity sh:Warning ;
] ;
This rule specifies that a ssn is a data property with a format of xsd:string, and that it has to conform to a pattern defined by the regular expression above. The severity statement defines what happens at validation if the rule fails.
Why you would want to use SHACL
The simple answer is that it provides an excellent way to validate the data in a graph. It may be that the individual data has been manually entered with transcription errors, or it may be the original rdf structure has been designed with insufficient constraints. For example, in the case of the ssn property, the data stored in that property is just a xsd:string. If you are in the US a social security number has to conform to a pattern of 3 digits, then a dash, then 2 digits, then another dash and finally 4 digits. But in the absence of a validation process a user or application could populate the property with any string; "ABC123" for example, or "Mister Mxyzptlk". Building a SHACL rule to validate the graph would circumvent this kind of problem.
How to use SHACL - a worked example
I created a simple Family graph containing a Person class with a variety of sub classes (shown here in Protégé):
data:image/s3,"s3://crabby-images/cb89e/cb89eaf316d0aac30bb53d683fa9d278edc5c0e6" alt=""
Within the graph each Person has a number of object properties such as hasHusband, hasAncestor and so on, and a number of data properties:
data:image/s3,"s3://crabby-images/c1b3b/c1b3b9e84db5bb32ed57a8733b75b818150f6384" alt=""
Suppose I decide that a Person must have at least one first name. I need to create a SHACL rule that tests for this. Here is the rule in full.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://www.example.org/#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix fam: <http://www.semanticweb.org/ianpiper/ontologies/2024/12/Family/> .
fam:NameShape
a sh:NodeShape ;
sh:targetClass fam:Person ; # applies to all Person objects
sh:property [
sh:path fam:firstName ;
sh:minCount 1 ;
] ;
sh:property [
sh:path fam:firstName ;
sh:datatype xsd:string ;
sh:severity sh:Warning ;
] ;
sh:closed false ;
sh:ignoredProperties ( rdf:type owl:topDataProperty owl:topObjectProperty ) ;
.
The rule defines the class that it applies to (fam:Person), then specifies that the fam:firstName property must have a minimum count of 1, and that it must be a xsd:string.
To test out my graph, I created a number of individuals, and made sure that two of them (Ian and James) did not have firstName properties.
data:image/s3,"s3://crabby-images/c36ce/c36ce767c86773f1fb8391529eb9b21867dea27f" alt=""
Now, to test the graph, I moved to the SHACL Editor tab, and used the Open button to import my SHACL rule set (which I had saved as a .txt file on disk - sadly, Protégé doesn't store these rule sets internally). Once loaded, I chose the Validate button to run the SHACL rules. Note: when using SHACL validation in Protégé you must be running the reasoner (I use Pellet for this since it seems in my hands to be the most useful) and the reasoner must be synchronised (on a Mac you just get accustomed to hitting ⌘R at regular intervals).
data:image/s3,"s3://crabby-images/ca55c/ca55cac98c5c79642c9d946d902c83dfa9af0a71" alt=""
In the image above the report pane in the window shows 2 violations. It is quite difficult to read the violations without increasing the width of the columns, which takes a bit of twiddling. However, this is what the results show.
data:image/s3,"s3://crabby-images/7ef23/7ef23312d4c66709907416341b570d20e36e476a" alt=""
The results give clear messages about the affected individuals and the violations, and thus point me towards the solution. I know that Ian and James lack firstName properties, so it's easy now for me to go and fix these issues. Here I am adding the firstName property to Ian.
data:image/s3,"s3://crabby-images/296ae/296ae7dcde5e20f4a06aedb1cf1f830737d80ef2" alt=""
Once I corrected the issues I synchronised the reasoner and re-ran the validation. As you can see, there are now no SHACL violations.
data:image/s3,"s3://crabby-images/7fb9e/7fb9ebdaa010bb105392e92a75dacc66ce800196" alt=""
SHACL in GraphDB
GraphDB includes support for SHACL validation, but it works in quite a different way from Protégé. Rather than working on demand when running the reasoner, the only time that you can check a graph for SHACL validation is when importing the data. Also, you have to configure SHACL support when creating the repository. As far as I know it's not possible to retrospectively add SHACL support. It's all a bit involved, so let's go through it step by step.
For this example I created a new repository called FamilyWithSHACL. I checked the box labelled "Enable SHACL validation" and then checked the options.
data:image/s3,"s3://crabby-images/30d0b/30d0bf95fbb602471cd42d084c0195ce344ed962" alt=""
I simply took the defaults, but noted the Named graph entry as I'll need that shortly.
The remaining steps are:
- Enable the new repository
- Import the SHACL file into a named graph (the one we just noted above)
- Import the RDF file that needs to be validated into the default graph
- Check for messages
Importing the SHACL file was straightforward; I used the same file as that used in Protégé. I just had to ensure that I specified the named graph option, and gave this a value of http://rdf4j.org/schema/rdf4j#SHACLShapeGraph . The data imported without errors.
Next I exported the graph data (the version with missing firstName properties for James and Ian) from Protégé, and selected it for import into GraphDB. I specified the default graph this time.
data:image/s3,"s3://crabby-images/18f0e/18f0ede22b5eca91059fa91f9edd11684d96a43e" alt=""
Note that the import failed. Clicking on the red message brought up a dialog with more information:
data:image/s3,"s3://crabby-images/6ce35/6ce356953b331eda01c6b9d9c25615267fa61e48" alt=""
Here is the full validation message:
org.eclipse.rdf4j.sail.shacl.GraphDBShaclSailValidationException: Failed SHACL validation
@prefix dash: <http://datashapes.org/dash#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rsx: <http://rdf4j.org/shacl-extensions#> .
@prefix rdf4j: <http://rdf4j.org/schema/rdf4j#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
_:72fe1892810146efb96123c2b9b95fcf118327 a sh:ValidationReport;
sh:conforms false;
rdf4j:truncated false;
sh:result _:72fe1892810146efb96123c2b9b95fcf118328, _:72fe1892810146efb96123c2b9b95fcf118329 .
_:72fe1892810146efb96123c2b9b95fcf118328 a sh:ValidationResult;
sh:focusNode <http://www.semanticweb.org/ianpiper/ontologies/2024/12/Family/Ian>;
rsx:shapesGraph rdf4j:SHACLShapeGraph;
sh:resultPath <http://www.semanticweb.org/ianpiper/ontologies/2024/12/Family/firstName>;
sh:sourceConstraintComponent sh:MinCountConstraintComponent;
sh:resultSeverity sh:Violation;
sh:sourceShape _:61cdac1a-0819-4e87-98fb-2a9ec2b377d2-1 .
_:61cdac1a-0819-4e87-98fb-2a9ec2b377d2-1 a sh:PropertyShape;
sh:path <http://www.semanticweb.org/ianpiper/ontologies/2024/12/Family/firstName>;
sh:minCount 1 .
_:72fe1892810146efb96123c2b9b95fcf118329 a sh:ValidationResult;
sh:focusNode <http://www.semanticweb.org/ianpiper/ontologies/2024/12/Family/James>;
rsx:shapesGraph rdf4j:SHACLShapeGraph;
sh:resultPath <http://www.semanticweb.org/ianpiper/ontologies/2024/12/Family/firstName>;
sh:sourceConstraintComponent sh:MinCountConstraintComponent;
sh:resultSeverity sh:Violation;
sh:sourceShape _:61cdac1a-0819-4e87-98fb-2a9ec2b377d2-1 .
I have highlighted the violations in red. In practice this means that the RDF data wasn't imported, and won't be until it passes validation. On one hand this makes sense, since it reduces the risk of having to manage bad data inside GraphDB. On the other hand it means that I needed to use another tool to hunt down and correct the error.
End
That's it for this short article. It is probably clear that there is much more that can be done to build sophisticated validation tools in SHACL, but it all builds on the basic principles described here. I hope you have found this useful; thanks for reading.