Harvesting

Simplified harvest process

The harvest solution consists of 3 parts:

  • Harvest - The resources are downloaded from several sources, split into single resource-graphs and given an id
  • Reason - Each resource graph is enriched with relevant data
  • Parse - Each resource graph is converted to JSON

The finished JSON-representation of each resource is then picked up by both the backend for the search page, fdk-search-service, and the backend for the details page, fdk-resource-service. Changes in the resources are available in data.norge when these two services have been updated.

The harvest process can be initiated by two services:

  • Harvest scheduler - Initiates harvest of all sources from predefined schedules.
  • Harvest admin - This is where sources are registered for harvest, each source has a button to initiate harvest of that specific source.

The communication between the relevant services is handled by a combination of RabbitMQ and Apache Kafka. The harvests are triggered by messages published in RabbitMQ, and the harvesters will publish harvest reports for each source in RabbitMQ when they are done. These reports contain information about each resource with changes and each resource that has been removed from the source since the last harvest. These reports are picked up by different versions of fdk-kafka-event-publisher, and will produce events in Kafka for each changed resource. Reasoning consumes events about changed resources and produces new events with the enriched graphs. Parsing consumes events about reasoned resources and produces new events with a JSON version of the resource. The events about parsed resources are consumed by fdk-search-service and fdk-resource-service and the harvest process is finished.

Parts of FDK not strictly part of the harvest process that are also dependent of the kafka events produced by the process:

  • fdk-sparql-service Listens for reasoned and removed messages to maintain updated graphs available for sparql queries
  • Metadata quality Listens for DATASET_HARVESTED to produce an assessment of the harvested datasets

Detailed schema of the harvest process

Detailed harvest process