Herbis is the Erudite Recorded Botanical Information Synthesizer

HOME | LOGIN | SEARCH COLLECTION | PROJECT CONTACTS | NLP DEMO


        This project offers proof of concept and an initial implementation of 'one-button' specimen imaging and data capture. Clicking the shutter on a digital camera initiates a sequence that culminates with the population of label data and a specimen image into a structured collection database. Our ultimate goal is to reduce the total cost of digital collection data capture by significantly reducing human labor required and total project duration. Significant gains can be achieved by developing appropriate protocols and methodologies, then packaging them as web services. Much of this can be accomplished by applying existing technology to data acquisition bottlenecks.

        The technology required for digital image capture of specimens has become affordable, if not yet commonplace. Labor costs rather than the cost of equipment are likely the impediment for making digital image capture ubiquitous. Digital imaging projects often do nothing more with images than make them available for web display. However, an image can serve the additional purpose as the basis for label data capture. Along with specimen images, label data, particularly georeferenced label data, is a valuable public product for collections.


        Components of the technology we intend to implement are derived from computer vision and automated document processing domains and have been commercialized into off-the-shelf technologies. Our aim is to use open source or commercial solutions (and to develop solutions where necessary) that accelerate the herbarium specimen data capture process. Each of these solutions (other than camera operation) will be embedded into web services, providing benefits such as cross-platform interoperability and scalability.

        Specific challenges in developing one-button herbarium specimen data capture include 1) rapid image capture, 2) web services development, 3) image to text conversion of label data, 4) text markup into data elements to simplify database loads, 5) and georeferencing. Our goal is to address these challenges as modular services that are mutually aware and configurable.


        Services will de designed and coded using an open source policy with future applicability in mind, so that distributed mirrored services can provide redundancy, flexibility, and control for institutions wishing to operate some or all of their own services. Image capture stations could conceivably be set up at any number of institutions. On a larger scale, distributed, mirrored services for optical character recognition (OCR), natural handwriting recognition (NHR), and natural language processing (NLP) would be available for image capture operators to choose amongst. Configuration options will allows service providers and clients to define the specifics of their own data pipeline.



Funded by the National Science Foundation Award #: DBI-0345341