AI Is Decoding the Vatican Secret Archives, One Pen Stroke at a Time

Tiziana Fabi, AFP/Getty Image
Tiziana Fabi, AFP/Getty Image / Tiziana Fabi, AFP/Getty Image
facebooktwitterreddit

The Vatican Secret Archives comprise 600 collections of texts spanning 12 centuries, most of which are nearly impossible to access. The Atlantic reports that a team of scientists is hoping to change that with help from some high school students and artificial intelligence software.

In Codice Ratio is a new research project dedicated to analyzing the vast majority of Vatican manuscripts that have never been digitized. When other libraries wish to make a digital archive of their inventory, they often use optical-character-recognition (OCR) software. Such programs can be trained to recognize the letters in a certain alphabet, pick them out of hard-copy manuscripts, and convert them to searchable text. This technology posed a challenge for the Vatican, however: The many older texts in its collections are written by hand in a cursive-like script. With no spaces between the characters, it's impossible for OCR to determine what's a letter and what isn't.

To get around this, the research team at In Codice Radio tweaked OCR software so that it could recognize pen strokes instead of letters. The OCR can identify the pen strokes that make up letters in an alphabet by looking for spots in the text where the ink narrows rather than presents full gaps between characters. The strokes aren't very useful on their own, but the software can combine the pieces to form possible letters.

To help the software perform even better, researchers recruited students from 24 Italian high schools to check its work. As the researchers explain in their paper, the students were shown a list of acceptable versions of a real letter, such as the letter A, and were then given a list of characters the software had guessed might be the real letter. By selecting the characters that matched the acceptable versions, they were able to slowly teach the software the medieval Latin alphabet.

All this information, plus a database of 1.5 million Latin words that had already been digitized, eventually brought the OCR to a place where it could use artificial intelligence to identify real letters on its own. The final results aren't perfect—a good portion of the words transcribed so far contain typos—but Vatican archivists are a lot better off than they were before: The software can identify individual handwritten letters with 96 percent accuracy, and misspelled words can still provide important context to readers. The goal is to eventually use the software to digitize every document in the Vatican Secret Archives.

[h/t The Atlantic]