This post explains some concepts about semantic technologies that I learnt while taking one of the courses offered by the PoolParty Academy.
What are Semantic Technologies?
W3C defines the semantic web as a group of technologies that enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS. (I’ll delve into these acronyms later).
By implementing semantic technologies, you enable machines to “understand” data. The goal is to extract and analyse knowledge from heterogeneous sources, and be able to categorise, process and discover links between data.
Relevance
The amount of data handled by companies has increased exponentially in the last few years. However, most of the information is stored in silos, and cannot be processed to provide insights that allow people to, for example, make informed decisions to improve business performance.
To classify the information and make it “readable” by machines, the use of metadata is required. As an example, you can think of product recommendations that pop-up in online stores like Amazon. By applying metadata to all the products and making them digestible by machines, you can increase your company’s turnover, as the probability that your customers buy more products by showing them similar articles is also higher.
Metadata can be also used in a variety of tasks and roles, e.g. technical documentation, which can prove useful to provide better search results and link suggestions, among others. Read my dedicated post to this topic. To get better results, however, it is necessary to move away from defining subjective tags and manage the documents by considering the entities they include. Machine-supported metadata recommendations allows you to define more accurate tags. How? By analysing the content and relations between the terms with the help of a knowledge graph, which provides you with a list of metadata that is more suitable to classify the content. Semantic layers maintain the metadata of different data sources separately.
About Knowledge Graphs
Knowledge graphs are a type of ontology that depicts knowledge in terms of entities and their relationships. They also include information about aggregations, misspellings, synonyms, labels in different languages, etc…
You can use knowledge graphs to model the data you have and define how the concepts relate to each other. They can grow and be adapted over time to reflect the changes in your business. For example, if some concepts are no longer relevant or if a concept evolves or is renamed. Knowledge graphs are more dynamic and adaptable to change than relational databases. Maintaining knowledge graphs does not require any advanced technical skills.
Google has developed one of the most popular knowledge graphs. The company uses it to enhance its search engine’s results with information gathered from a variety of sources. The information is presented to users in an infobox next to the search results. Google refers to these info boxes as “knowledge panels”. Not all searches on Google return knowledge panels, only the concepts or people that have been defined as “relevant entities” by the company.
Another good example of a company which makes use of knowledge graphs is LinkedIn. LinkedIn’s defined entities include, among others, skills, schools, jobs, companies and knowledge. The company uses these data to provide contact suggestions, improve query results and content recommendations.
Knowledge graphs have three main elements:
- Entities (e.g. shoes)
- Attributes (e.g. 40)
- Relations (e.g. has size)
Entities are linked to each other via relations. Each entity can have attributes, which can be specific to an entity or can be shared by several entities.
To define the approach to use when building a knowledge model, you must first identify your goal, that is, what do you want to achieve with the data. Complex knowledge models with too many attributes are not always necessary, frequently underutilised and result in a waste of valuable resources.
There are four main approaches to knowledge modelling when it comes to semantic expressiveness:
- Folksonomies: allow users to freely add tags without restrictions. Therefore, the results are inconsistent and not very helpful to classify information.
- Controlled vocabularies: users can only add predefined tags from a list. Pros: the vocabulary used to create tags is controlled. They can be used to create glossaries. Cons: The terms are not related to each other.
- Taxonomies: include a hierarchical relation between the entities. They are classified in sub levels, from broader concepts to more specific ones. Pros: A basic relation model is defined and findability and recommendations are more accurate. Faceted search is supported. Navigation menus can be dynamically managed. If you want to use taxonomies to classify your content, the SKOS standard is recommended. SKOS is an area of work developing specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web.
- Ontologies: allow you to define relations in a more specific way. Unlike taxonomies, where relations can only range from more general to more specific, ontologies support all kind of relations between attributes and elements. E.g. sport shoes are a type of sports gear, has a color, a size, are to be used on a specific surface, etc… by using ontologies, you can provide highly personalized content recommendations and you can build cool things like questionnaires that adapt dynamically to you audience’s answers. I can recommend Protégé, a taxonomy builder developed by Stanford: https://protege.stanford.edu, as I have used it to develop knowledge graphs for classifying technical content. It is easy to learn and works like a charm. Note: the tool is free and you can use it online (you can also install a local version). The OWL standard is the one you should follow to define ontologies. The W3C Web Ontology Language (OWL) is a Semantic Web language designed to represent rich and complex knowledge about things, groups of things, and relations between things. OWL is a computational logic-based language such that knowledge expressed in OWL can be exploited by computer programs, e.g., to verify the consistency of that knowledge or to make implicit knowledge explicit. OWL documents, known as ontologies, can be published in the World Wide Web and may refer to or be referred from other OWL ontologies. OWL is part of the W3C’s Semantic Web technology stack, which includes RDF (standard model for data exchange on the web), RDFS, SPARQL, etc.
Metadata Types
There are four main types of metadata:
- Descriptive metadata: describe the physical appearance of the data. In the shoes example, that would be size, colour, brand, etc..
- Structural metadata: relate to the format of the digital object. Some tags could be URI, picture and text.
- Administrative metadata: when the object was created or modified.
- Rights management metadata: who created the object, who has edit rights on it, etc…
You can follow some standards to ensure metadata consistency. The Dublin Core Metadata Initiative is one of the most popular ones and widely used in different industries.
When it comes to using metadata for search engine optimisation, you can take a look at schema.org. Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. Schema.org vocabulary can be used with many different encodings, including RDFa, Microdata and JSON-LD. These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model.