Selection and Digitisation of Exchequer Port Books

The National Archives’ catalogue listing for the Exchequer Port Books is rather bewildering. The 20,000 or so books range over 233 years, and the date coverage for each book is not easily assessed computationally due to the use of the church calendar terms such as Easter and Michaelmas. For this reason in 2014 I relisted all of the books with conformed dates in a new catalogue (see Portfolio). Using this catalogue it is much easier to see where there are gaps in coverage for distinct port members (red dashed lines between dates), and also which books are currently unavailable for consultation due to their need for conservation treatment (indicated by padlock icons).

Dunwich Port Books, 1565-1572. Click to see these records within the Portfolio catalogue.

There are no years for which anything like national coverage survives, and many of the years (beginning at Michaelmas) are divided in 6-month periods between books with varying survival and cleanliness. Nevertheless, I have devised a computational method whereby for any chosen year book lists are generated with missing or unavailable books substituted by the closest books within a defined date range, balancing winter and summer coverage. There’s also an option to select the next n closest books, available or requiring conservation. Choosing sample years with the maximum geographical coverage will maximise the possibility of reconstructing data from missing ports using ‘other-end’ voyage data; it will also be possible to assess the consistency of cargo-reporting between each end of voyages, and provide evidence of journey times.

Conservation of the mould-infested Port Books is an ongoing process, but with current availability it appears that useful datasets could be derived for the years 1580, 1630, and 1682, (using proxies up to 5 years either side) if 135 of the 254 selected books were to be cleaned. These selections offer the best balance of geographical coverage of the core year and minimal date spread for full national coverage, and might generate data for about 50,000 voyages. Additional proxy sets would increase the volume of data but also increase the costs of cleaning and the spread of date coverage.

Example of a Port Book page spread [TNA E 190/835/1].
The most efficient method for photographing the books employs a camera with a pedal-operated shutter control mounted at a lateral angle of 10 degrees, 1 metre above a book lit from both sides and supported on one side by a 20-degree foam wedge. The centre and edges of the book are then roughly equidistant from the lens, ensuring that the whole book is within the depth of field even when lighting is compromised. The operator has both hands free to ensure optimal handling of the book without the need to adjust weights between images. Thirty books can be photographed at The National Archives in a single day.

The images will be uploaded to a server running Goobi, ‘an open-source software application for digitisation projects’, where they will be processed and catalogued ready for transcription.

Work is currently underway to automate and streamline the processing of images, in order to save perhaps thousands of hours over the course of the project. One stage of this process (still under development) is illustrated below:

Programmatic identification and extraction of individual pages from page spreads.
Example of a programmatically-extracted page image.

Using the OpenCV library and scripts developed in Python, the largest subject in an image is identified and rectified. Because the corners of pages are in many cases damaged, the most reliable method of finding pages requires the identification of the book spine (if present) and each of the edges in turn. Corners are then estimated as the average intersection points of candidate spine and edge lines. The workflow will allow these postulated corners to be adjusted before committal.

Extracted and rectified page images are to be optimised for handwriting-recognition, and then prepared for online publication.