Documentation

Data Model

Statistical data, originally represented in SDMX (an ISO standard for exchanging and sharing statistical data and metadata among organizations) is natively multi-dimensional, therefore it cannot be represented on a flat table.

To expressively represent statistical data in a RDF graph without losing information an expert group ad W3C created the Data Cube vocabulary, which captures the core information model for a multi-dimensional cube of data in a way that is compatible with SDMX and several other statistical data formats.

A data cube captures the multi-dimensional nature of a statistical dataset.

A data set is made of observations organized along one or more dimensions, where each observation includes one or more measures and the values of those measures are interpreted according to some attributes.

Let’s look at an example fake dataset to better clarify what observations, dimensions, measures and attributes are (notice that we’re representing a three dimensional cube, this is not a table)

2004-2006 2005-2007 2006-2009
Male Female Male Female Male Female
Trentino Alto Adige 75.5 79.1 75.5 79.4 74.9 79.4
Piemonte 76.7 80.7 77.1 80.9 77.0 81.5
Toscana 78.7 83.3 78.6 83.7 77.7 83.4
Lazio 76.6 81.3 76.5 81.5 76.6 81.7

There are 3 dimensions here (there could be many more, but we wouldn’t be able to represent them easily on a single table):

  1. the time period (2004-2006)
  2. the region (Toscana)
  3. sex (male)

There’s one measure per observation the whole dataset (again, there could be more): Life expectancy.

The observations are the actual numbers (78.7, 83.3 and so on...).

Finally, there is one attribute: Years, that represents the unit of measure of the observations.

The set of dimensions, attributes and measures is explicitly represented as a first class information object, the DSD that describes the cube. This in turn makes a data cube data set self-describing, enabling us to build tools that can automatically generate APIs and visualizations for a data set.

A suggested read is the the W3C Recommentation describing the Data Cube Vocabulary, which is a very good and clear introduction to understanding the structure of the data.

I.Stat Entities

Entities for the I.Stat database are defined as follows:

Observation
is a statistical observation.
Descriptors
are concepts (dimensions, attributes and measures) that are used to identify and describe statistical observations.
Code
is a unique short and language independent name that is used to represent a particular value of a dimension.
Code List
is a collection of codes.
Concept
is a value of some dimension which is defined in a particular controlled vocabulary with the purpose of re-using it across different datasets.
Concept Scheme
is a collection of concepts.
Dataset
is a set of statistical observations grouped according to common values of some dimensions.
Key Family or Data Structure Definition
is a set of descriptor concepts to identify and describe observations of a particular dataset.
Category Scheme
is a hierarchy of categories.
Category
is a subject matter domain.

RDF Graphs

Data is stored in several RDF graphs:

<http://linkedstat.spaziodati.eu/categories>
contains all the categories downloaded from ISTAT. Explore graph contents (first 100 row) »
<http://linkedstat.spaziodati.eu/concepts>
contains all the concept schemes and concepts downloaded from ISTAT. Explore graph contents (first 100 row) »
<http://linkedstat.spaziodati.eu/data>
contains observations. Explore graph contents (first 100 row) »
<http://linkedstat.spaziodati.eu/dsd>
contains structures. Explore graph contents (first 100 row) »
<http://linkedstat.spaziodati.eu/codes>
contains all the code lists and codes downloaded from ISTAT. Explore graph contents (first 100 row) »
<http://linkedstat.spaziodati.eu/codes/<version>>
contains code lists and codes of <version>.
There are two ways to organize codes in RDF graphs:
  • all code lists of all the versions are stored in one graph
  • a separate graph is created to store code lists of a single version
Depending on the query you are writing one of them might yield better performance, but the data they contain is identical.

Querying schemes and metadata

Below you will find examples of SPARQL queries to retrieve the scheme concepts of Linked I.Stat.

To write more descriptive queries you can use the prefixes described in the following table where you can also find a brief description of the namespaces.

PrefixNamespace URIDescription
linked - istat http://linkedstat.spaziodati.eu/ Linked I.Stat
linked - istat - structure http://linkedstat.spaziodati.eu/structure/ Data Structure Definitions of Linked I.Stat
linked - istat - dataset http://linkedstat.spaziodati.eu/dataset/ Datasets of Linked I.Stat
sdmx http://purl.org/linked-data/sdmx# SDMX-RDF vocabulary
sdmx - concept http://purl.org/linked-data/sdmx/2009/concept# SKOS Concepts for each COG defined concept
sdmx - code http://purl.org/linked-data/sdmx/2009/code# SKOS Concepts and ConceptSchemes for each COG defined code list
sdmx - dimension http://purl.org/linked-data/sdmx/2009/dimension# component properties corresponding to each COG concept that can be used as a dimension
sdmx - attribute http://purl.org/linked-data/sdmx/2009/attribute# component properties corresponding to each COG concept that can be used as an attribute
sdmx - measure http://purl.org/linked-data/sdmx/2009/measure# component properties corresponding to each COG concept that can be used as a measure
qb http://purl.org/linked-data/cube# the RDF Data Cube vocabulary
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# The RDF Schema for the RDF vocabulary defined in the RDF namespace
rdfs http://www.w3.org/2000/01/rdf-schema# The RDF Schema vocabulary (RDFS)
owl http://www.w3.org/2002/07/owl# The Web Ontology Language
skos http://www.w3.org/2004/02/skos/core# Simple Knowledge Organization System
xkos http://purl.org/linked-data/xkos# Extension of SKOS for describing statistical classifications

Datasets

Datasets are a set of statistical observations grouped according to common values of some dimensions.

How many datasets?


Retrieve datasets’ URIs


What is the title of a given dataset?

Note: the title of a dataset is a blank node if there was no title in the original dataset (see how to retrieve the original dataset ).

What is the license of a given dataset?

Categories

Retrieve all the category schemes

Note: for the time being there is only one category scheme defined.

What categories are defined in a given category scheme?


What categories and subcategories are defined in a given category scheme?

Note: that the category scheme ISTAT DW organises categories in a two-level hierarchy. The hierarchy is retrieved using the property xkos:hasPart. The lower category contains datasets.

What datasets are defined in a given category?


What datasets are defined in a category that contains Reddito in its name?


Retrieve a hierarchy of categories (super-category/category) a given dataset belongs to?

Note: the usage of the property xkos:isPartOf, the inverse property of xkos:hasPart. The property was used in the example above in which we explore categories top-down starting from the category scheme. In the last example we explore categories bottom-up starting from the dataset.

Concept Schemes

A concept is a value of some dimension which is defined in a particular controlled vocabulary with the purpose of re-using it across different datasets. Concept schemes is a collection of concepts.

What concept schemes exist?


Retrieve concept schemes of a given version


Retrieve concepts of a given concept scheme


Retrieve concepts that contain "totale" in their names

Code Lists

A code list is a collection of unique short and language independent name that is used to represent a particular value of a dimension.

What code lists are defined?


Retrieve code lists of a given version

Note: there is another possibility to query for code lists of a given version: use the corresponding RDF graph.
See the following query that retrieves code lists of version 1.0.

Retrieve codes of a given code list


Retrieve all codes that have a given notation


Retrieve the class of a given code

Each code is typed with the corresponding class. For example, the code http://linkedstat.spaziodati.eu/­code/1.1/CL_AGGREG_PERSONE/SPETT is defined as persons aged 6 and over who declare to have attended some entertainments at least once in the last year. The code is typed as an instance of the class http://linkedstat.spaziodati.eu/­class/1.1/CL_AGGREG_PERSONE which defines the Type of person or meeting.
Classes are needed to define the ranges of a property. Thus, the class http://linkedstat.spaziodati.eu/­class/1.1/CL_AGGREG_PERSONE defines the range of the property http://linkedstat.spaziodati.eu/­property/AGGR which connects observations with the instances of the class

What properties take values from a given class?

Data Structure Definitions

Data Structure Definition is a set of descriptor concepts to identify and describe observations of a particular dataset.

What data structure definitions exist?


Retrieve the data structure definition of a given dataset

Exploring Descriptor Concepts of Observations

Retrieve the URI of the data structure definition of a given observation


What component properties are used to describe a certain observation?

This query returns all the component properties including dimensions, attributes and measures.


What dimension properties are used to describe a certain observation?

This query returns only dimensions of the given observation. Similarly, attribute and measure properties can be retrieved using the classes qb:AttributeProperty and MeasureProperty correspondingly.

Interpreting Properties

Component properties in Data Cube are used to represent descriptor concepts of SDMX2. The meaning of component properties is important to interpret observations. For example, consider the following query:

It retrieves the value of the property REF AREA, which is given by the code http://linkedstat.spaziodati.eu/­code/1.2/CL_REFAREA/IT (Italia). But what does the property itself mean?

The meaning to properties is given by a dedicated concept which can be retrieved through the property qb:concept

But the result of the query contains not just one concept, but a list of them. Why did it happen? In Data Cube it is possible to re-use concepts across different datasets. Let’s consider two concepts retrieved by the previous query:

How to understand which concept is the one that defines our property? Look at the name of the structure of the dataset or observation in question. The name contains the same set of letters as the right concept.

Provenance Information

Provenance information is modeled by two vocabularies: DC Terms and PROV-O. PROV-O uses three main types of entities to describe the provenance information: prov:Entity, prov:Agent and prov:Activity. In short, entities in PROV-O are physical, digital, conceptual, or other kinds of thing (e.g. document). Activities are how entities come into existence and how their attributes change to become new entities. An agent takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place.

The following Linked I.Stat entities were typed as prov:Entity:
  • Datasets
  • Code Lists
  • Concept Schemes
  • Data Structure Definitions
They contain the following provenance information:
  • prov:wasAttributedTo (as well as dcterms:creator) ascribes an entity to an agent
  • prov:generatedAtTime (as well as dcterms:issued) defines the time when the production of an entity was completed
  • prov:wasDerivedFrom references the entity which the current entity was derived from (i.e., the original XML document that contains observations)
  • prov:wasGeneratedBy references the activity that generated the entity

What dataset a given document was derived from?

Note: the URI of the original dataset contains the pointer to the original xml document that was used as input to the transformation script.

Who is the creator of the dataset?

Querying the data

Below you will find examples of how to query statistical observations

Retrieve Observations of a given Dataset

Retrieve Observations when One Dimension is Fixed

Let’s consider an example:
Retrieve statistics relevant for Trento and municipalities of Trento.

The dimension that is fixed is the geographical dimension, it is defined through the property REF AREA. Istat uses eurocode values from NUTS 2003-2006 for Italy to provide values for the geographical dimension

  • the code of Trento is ITD2
  • the codes of the municipalities of Trento are defined within the hierarchy of ITD2

The example question can be interpreted as:
Retrieve statistical observations in which REF AREA is equal to ITD2 or its constituents.

The following is the three steps procedure to achieve this:

  1. retrieve all the observations and their values relevant for Trento and its municipalities
  2. retrieve and interpret datasets of the observations
  3. retrieve and interpret properties of the observations

The following query implements the step 1 Retrieve all the observations and their values:

  • In the first part of the query, all the codes of ITD2 are retrieved from <http://linkedstat.spaziodati.eu/codes>. Note the usage of the Virtuoso transitive OPTION that allows one to define a transitive property. In our case, the property xkos:hasPart is transitive and we are interested in all the codes up to depth 2. Changing parameter t_max to t_max(1) will result in all the codes of the depth 1 (i.e. codes that define only Trento).
  • In the second part of the query, the observations are retrieved from the graph <http://linkedstat.spaziodati.eu/data> together with their values and the datasets they belong to.

The following query implements the step 2 Retrieve and interpret datasets of the observations:

  • The first part of the query is equivalent to the first part of the previous query plus the titles of the datasets of the observations. This step is similar to the first part of the query above. In theory, it is possible to combine these two queries into one. In practice, the execution of such a combined query was not completed due to the performance issues.
  • In the second part of the query we retrieve the titles of the datasets which help to interpret the observations.
  • In the third part of the query we retrieve the category information which also helps to interpret the observations.

The following query implements the step 3 Interpret properties of the observations:

Generalization of the example

The procedure that was used to retrieve and interpret observations of Trento and its municipalities can be generalization to retrieve observations with one or more dimensions fixed:

  1. Retrieve all the observations and their values that have certain values of certain properties
  2. Retrieve and interpret datasets of the observations
  3. Retrieve and interpret properties of the observations