How do we merge data about resources that may not be consistently identified?
It will often be the case that different data publishers have used different identifiers for the same resource. In some cases there may be direct Equivalence Links between resources. In others their equivalence might be inferred based on other data, e.g. common properties.
How can we merge statements made about these distinct resources into a single description?
Apply the technique of "smushing" to manipulate an RDF graph containing the descriptions of each of the resources. Broadly a smushing algorithm will consist of the following steps:
owl:sameAs
statements, or by property values that indicate that two resources are similar
(e.g. Inverse Functional Properties, see below).The end result will be an modified RDF graph with all properties of the equivalent resources being "copied" to the target resource. In addition, any references to the equivalent resources will also be made to the target resource
By applying this to all resources in a graph, the available data can be normalized into a consistent set of descriptions based on a known set of resources. An application may then generate a Bounded Description of any resource and guarantee that it will include all available data
Assume we start with the following graph, which contains two equivalent resources, as defined by
an owl:sameAs
link.
<http://example.com/product/6> rdfs:label "Camera"; owl:sameAs <http://example.org/cameras/10>. <http://example.org/cameras/10> ex:manufacturer <http://example.org/company/5>. <http://example.org/company/5>. ex:manufactured <http://example.org/cameras/10>.
Assuming we want to collate all data around resources from example.com
, we can apply smushing to create the
following graph:
<http://example.com/product/6> rdfs:label "Camera"; owl:sameAs <http://example.org/cameras/10>; ex:manufacturer <http://example.org/company/5>. <http://example.org/cameras/10> ex:manufacturer <http://example.org/company/5>. <http://example.org/company/5> ex:manufactured <http://example.org/cameras/10>, <http://example.com/product/6>.
We can also tidy up the graph to remove statements about the equivalent resources, leaving:
<http://example.com/product/6> rdfs:label "Camera"; owl:sameAs <http://example.org/cameras/10>; ex:manufacturer <http://example.org/company/5>. <http://example.org/company/5> ex:manufactured <http://example.com/product/6>.
Smushing is essentially a process of inference: by using available data we create new statements. Any OWL reasoner will carry out this kind of data merging automatically based on the available data and schema/ontology without the need for custom code. Applications that are using a triple store that applies inferencing by default will not need to use this approach. However for applications that don't need a full inferencing engine, or need only lightweight merging of data, then a custom smushing algorithm can achieve the same goal.
There are several different variations on algorithm described above. For example, applications might vary in how they nominate the target resource. Typically though this will be based on a preferred URI. Algorithms can also be divided into those that preserve the original statements, e.g. so that the equivalent resources remain in the source RDF graph, or whether their statements are removed from the graph to leave only a normalized description. Applications could also use Named Graphs to separately stored the "smushed" view of the data, preserving the original data in another graph or triple store.
As noted above there are also several ways to identify equivalent resources. Equivalence Links are an obvious approach. Other cues can also be used including the use of Inverse Functional Properties. An inverse functional property is simply a property whose value uniquely identifies a resource, such as Literal Keys.
An application is also free to apply it's own rules about what consistutes "equivalence". For example an application may decide to merge together resources with similar property values, even if those properties are not declared as Inverse Functional Properties. This allows for local customization of smushing rules, but runs the risk of generating false positives. One way to apply these custom rules is to use local Schema Annotations to declare specific properties as being equivalent. This has the benefit of working with both custom code and OWL reasoners.
Smushing is often used to normalize an RDF graph resulting from a Follow Your Nose approach to data discovery