The Global Biodiversity Information Facility (GBIF) has prioritized a focus on people. The very first planned item in their Implementation Plan for 2017-2021 and Annual Work Programme states:
"1.a.i: Develop mechanisms to support and reflect the skills, expertise and experience of individual and organizational contributions to the GBIF network, including revision of identity management system and integration of ORCID identifiers".A bottleneck that prevents the full execution of GBIF's plan is a legacy of intractable, text-based content shared from natural history museums that ambiguously record people or organizations implicated in specimen data, none of which include pre-determined links to ORCID identifiers. Typical content shared by natural history museums under the Darwin Core terms recordedBy (collector) and identifiedBy (determiners or identifiers) is unstructured and variable. It may include variously ordered or unorderd people names, suffer from insensitivity to cultural preferences, additionally express full or abbreviated names of organizations, or other annotations that collectively make extraction of people names from these fields extremely difficult. The full solution requires multiple approaches. A progressive, modern approach would be to associate ORCID identifiers with specimens as their labels are digitized. Another, retrospective approach is to engage natural historians by giving them the freedom to claim their specimens and in so doing, illustrate their breadth of expertise and efforts. If sufficient numbers of people claim their specimens, we may unwittingly develop the necessary authority files to help museum staff employ a more modern, integrated approach that could use look-ups of disambiguated people names along with their professional identifiers.
This application is developed and maintained by David P. Shorthouse using specimen data periodically downloaded from the Global Biodiversity Information Facility (GBIF) and authentication provided by ORCID. It was launched in August 2018 as a submission to the annual Ebbe Nielsen Challenge. Since then, wikidata identifiers were integrated to capture the names, birth, and dates of death for deceased biologists to help maximize downstream data integration.
The approximately 180M specimen records included in this project have content in their recordedBy (collector) or identifiedBy Darwin Core fields. Names of collectors and identifiers are parsed and cleaned using the test-driven dwc_agent ruby gem available for free integration in other projects. Similarity of people names is scored using a graph theory method outlined by R.D.M. Page and incorporated as a method in the dwc_agent gem. These scores are used to help expand the search for candidate specimens, presented in order of greatest to least probable. If you declared alternate names in your ORCID account such as a maiden name or if aliases are mentioned in wikidata profiles, these are used to search for candidate specimen records. Processing 180M specimen records is a scalable, repeatable process and requires approximately 2-3 hours on a laptop with 16GB of RAM using MIT-licensed, open source code. Citations are found by periodically downloading cited data packages (less than 100MB zipped) that GBIF serves on behalf of the research community.