Uncovering Data from Ancient Family Records
In the digital age, the amount of new text authored on Twitter alone in a year surpasses a staggering 5.5 trillion characters, a figure almost equivalent to all the books in the pre-digital Library of Congress. Yet, when it comes to historical documents, the task of making them searchable and accessible online presents a unique set of challenges.
FamilySearch International, a non-profit organisation dedicated to helping people discover their family's history, is at the forefront of this endeavour. With a vast collection of over 12 billion images of historical records freely available on their website, FamilySearch is leveraging cutting-edge technology to bridge the gap between the past and the present.
One of the key technologies they are utilising is Machine Learning (ML), specifically Optical Character Recognition (OCR) and Natural Language Processing (NLP). However, the challenges in using these technologies to digitise and make searchable historical documents are numerous.
Data Quality and Noisy Data
Historical documents often contain errors introduced by scanning, OCR mistakes, missing information, or duplicated entries. These errors increase data inconsistency and incompleteness, complicating preprocessing and reducing the accuracy of AI models for entity recognition and relationship extraction crucial for genealogical research.
Diverse and Complex Document Formats
Historical archives include a mix of paper documents, electronic records, and multimedia, often with no standardised formats or encoding. Layout complexities such as multi-column pages, marginalia, decorative borders, or irregular text blocks in documents like folkloristic texts make automated layout analysis difficult, leading to OCR errors and requiring manual intervention.
Degraded and Unusual Typography
The physical condition of documents (e.g., faded ink, stains, bleed-through) and usage of archaic fonts (such as 19th-century Fraktur) or handwriting with significant variability hinder the performance of standard OCR engines. Handwritten documents require specialized handwriting recognition, which remains challenging.
Language and Vocabulary Issues
Historical documents often contain dialectal, archaic, or domain-specific language that modern NLP models and OCR engines, primarily trained on current language data, struggle to understand or transcribe accurately. This semantic gap leads to noisy outputs and complicates entity recognition and relationship extraction essential for building genealogical databases.
Bias and Incomplete Historical Records
AI models trained on biased and unevenly collected historical records risk amplifying historical inequities or missing critical perspectives. Some historical information may also be inherently non-computable by AI, providing an incomplete or skewed picture in genealogical databases.
Need for Manual Intervention and Specialized Models
Despite advances, off-the-shelf OCR and NLP tools frequently require human-aided segmentation and correction to handle the peculiarities of historical documents adequately. Fully automated recognition of all historical texts remains an open challenge in digital heritage preservation.
Challenges in Model Training and Adaptation
The variability and complexity of archival data demand sophisticated preprocessing, model training, and fine-tuning methods. Lack of uniform data standards and structural heterogeneity complicate automated AI algorithm development and degrade their performance on genealogical data extraction tasks.
Despite these challenges, FamilySearch is making significant strides in overcoming these obstacles. They have assembled a capable internal team of research scientists and engineers dedicated to Computer Assisted Indexing (CAI) of historical records. They have also partnered with researchers at Brigham Young University to shape their Machine Learning recognition approach, resulting in a state-of-the-art HTR system for historical paragraph/prose-style, handwritten cursive documents.
FamilySearch's primary approach to dealing with less common languages and scripts involves transfer learning and harvesting training data from unconventional sources. They also manually update their standards database with new information to handle local variations in historical documents.
In summary, while ML technologies provide powerful tools to accelerate the digitization and searchable indexing of historical archives, ongoing challenges in data quality, document complexity, language variance, bias, and incomplete automation persist. Continuous improvements in domain-specific OCR/NLP models, better data standards, and hybrid human-AI workflows are currently essential to overcome these challenges.
[1] M. L. Kucukyavuz et al., "Challenges in OCR and NLP for Historical Genealogical Documents," IEEE Access, vol. 8, pp. 16469-16484, 2020. [2] A. N. Kuznetsova et al., "Historical Text Analysis: A Survey," IEEE Transactions on Computational Social Systems, vol. 7, no. 1, pp. 32-48, 2020. [3] J. E. M. van den Berg et al., "The HTR Challenge: A Survey," Proceedings of the IEEE, vol. 108, no. 12, pp. 2212-2230, 2020. [4] M. J. A. van den Brand et al., "The HTR Challenge: Evaluating Historical Text Recognition Systems," Proceedings of the IEEE, vol. 108, no. 12, pp. 2231-2248, 2020. [5] S. A. K. Jain et al., "Bias in AI: Challenges and Recommendations," Nature, vol. 577, no. 7788, pp. 355-362, 2020.
- In the field of environmental science, the impact of climate change on health-and-wellness is a growing concern, as increasing temperatures and erratic weather patterns can exacerbate medical-conditions like allergies and asthma.
- The challenge of digesting and understanding the vast amounts of data generated from various industries, including science, finance, and technology, has led to the development of data-and-cloud-computing solutions.
- For those interested in personal-finance and investing, understanding the trends and predictions in the stock market, business, and economy is crucial, making accurate weather forecasts a valuable tool.
- In the realm of travel, predicting weather patterns can help travelers plan their trips, while in sports, weather conditions can significantly affect the outcome of events, from football matches to marathons.
- As the global economy and business continue to evolve, education-and-self-development become increasingly important, with many turning to technology and artificial-intelligence for lifelong learning opportunities.
- In the world of casino-and-gambling, the manipulation of data and AI algorithms can lead to biased results, highlighting the need for transparency and fairness.
- Food-and-drink industries rely on understanding consumer behavior, trends, and preferences, which can be influenced by lifestyle choices and cultural factors – information that can be gleaned from data analysis.
- The digitization and preservation of historical documents present unique challenges, such as data quality issues, diverse document formats, degraded typography, language variance, and bias in recorded data.
- FamilySearch International, a non-profit organization focused on genealogical research, is leveraging machine learning technologies like OCR and NLP to digitize and make searchable historical records, overcoming these obstacles through manual intervention, specialized models, and partnerships with academic institutions.