The harvest process is triggered by messages from RabbitMQ with the routing key *.#.HarvestTrigger
, where * is the relevant resource type, ie dataset, and # is an unused part of the key. The unused part of the key was in earlier versions used to supply the id of a publisher, this has since been moved to the message body, see “publisherId” in the next section.
The body of the trigger message has 3 relevant parameters:
A triggered harvest will download all relevant sources from fdk-harvest-admin, download everything from the source and try to read it as a RDF graph via a jena Model. If the source is successfully parsed as a jena Model it will be compared to the last harvest of the same source. The harvest process will continue if the source is not isomorphic to the last harvest or forceUpdate is true.
All blank nodes will be skolemized in the resource graphs, which means that an URI is generated for the blank node.
When all sources from the trigger has been processed a new rabbit message will be published with the routing key *.harvested
, the message body will be a list of harvest reports, one report for each source from fdk-harvest-admin.
Each report will contain these fields:
Since subsequent part of the harvest process is dependent of kafka events, the services deployed from fdk-kafka-event-publisher will listen for harvest reports in rabbitMQ and produce kafka events for removed and changed resources.