This application is developed and maintained by David P. Shorthouse using specimen data periodically downloaded from the Global Biodiversity Information Facility (GBIF) and authentication provided by ORCID. It was launched in August 2018 as a submission to the annual Ebbe Nielsen Challenge. Since then, wikidata identifiers were integrated to capture the names, birth, and dates of death for deceased biologists to help maximize downstream data integration, engagement, and as means to discover errors or inconsistencies in natural history specimen data.
Names of collectors and determiners are parsed and cleaned using the test-driven dwc_agent ruby gem available for free integration in other projects. Similarity of people names is scored using a graph theory method outlined by R.D.M. Page and incorporated as a method in the dwc_agent gem. These scores are used to help expand the search for candidate specimens, presented in order of greatest to least probable. If you declared alternate names in your ORCID account such as a maiden name or if aliases are mentioned in wikidata profiles, these are used to search for candidate specimen records. Processing this large number of specimen records is an intensive though repeatable process using MIT-licensed, open source code.
Integration with GBIF
Approximately 206M specimen records are downloaded from GBIF as Darwin Core Archive files. Records with the basisOfRecord PRESERVED_SPECIMEN, FOSSIL_SPECIMEN, or LIVING_SPECIMEN are selected and then processed for entries in recordedBy (collector) or identifiedBy fields.
- all occurrence data fully refreshed every 2 weeks
- daily poll for cited data download packages (less than 100MB zipped), extracted and linked to attributed specimen records
Integration with Wikidata
Synchrony with wikidata is maintained in several ways. With the exception of the last item on the list below, all automated methods executed via scheduled cron jobs using a ruby gem require that people pages on wikidata have death dates as well as a value for any of the properties: IPNI, Harvard Index of Botanists, Entomologists of the World, ZooBank Author ID, BHL Creator ID, Stuttgart Database of Scientific Illustrators ID.
- daily poll for new pages ( SPARQL )
- daily refresh for entries that were modified within previous 24 hours ( SPARQL, using 2020-01-01 as example date )
- weekly query for merge events ( SPARQL, using 2020-01-01 as example date )
- a Trainer can refresh on demand
Example SPARQL queries above are limited to Entomologists of the World (P5370) but all the watched properties are used in production
Integration with ORCID
Synchrony with ORCID is maintained in several ways.
- OAuth2 pass-through authentication
- daily poll for new accounts by querying the API for any of the keywords: taxonomy, taxonomist, mycology, zoology, entomology, botany, systematics, phylogenetics, biodiversity
- cache full name, aliases, keywords, employment, and education along with start and end dates
- incorporate employment and education data using ORCID-supplied Ringgold or GRID identifiers for organizations
- resolution of Ringgold or GRID identifiers against wikidata Q numbers
- periodic full refresh of ORCID profiles
- user or a Trainer can refresh on-demand
Integration with Zenodo
From the settings panel in your account, you may connect with Zenodo in two clicks using your ORCID credentials. Once you make this set-it-and-forget-it connection, Bloodhound pushes your specimen data into this industry-recognized, stable, longterm archive and mints a new DataCite DOI. Your Zenodo token is cached in Bloodhound and every week on your behalf, a new version of your specimen data is pushed to the archive when you make new claims. You will also receive a DataCite DOI badge on your Bloodhound profile page and a formatted citation for your professional resume. The versioned data packages stored in Zenodo each consist of a csv file and a JSON-LD document, preparing the way for future Linked Data integrations. If you accept DataCite as a trusted organization in your ORCID account, you will receive a new formatted work entry there for your specimen dataset.