From Print to Digital: The Technical Challenge of Converting 19th Century Cookbooks into Searchable Content

Building a content platform is one thing. Building one that can handle everything from Victorian-era cookbooks to modern digital files is an entirely different technical challenge. Matthew Cockerill, founder of CKBK, learned this firsthand when he set out to digitize the world's best cookbooks into a unified, searchable database.
On the Levels Podcast, Matthew shared the complex technical journey of converting thousands of cookbooks - spanning over a century of publishing formats - into structured data that works seamlessly across web and mobile platforms. His approach offers valuable insights for any startup dealing with legacy content digitization.
The Scope of the Challenge
CKBK doesn't just work with new, digitally-native cookbooks. The platform includes content that spans the entire history of cookbook publishing, each requiring different technical approaches.
"The starting point for that can be anything from a 19th century Victorian cookbook, which exists only in print form, where we get an internet archive PDF scan, and figuring out how best to structure that. Or it can be a more modern book, but still where the publishers lost all the files, or the author never had the files, and so again, we scan from print."
The technical team faces everything from hand-scanned PDFs of century-old books to modern EPUBs with embedded links and high-resolution photos. Each format presents unique challenges, but they all need to end up in the same structured database.
Building a Universal Data Schema
The key breakthrough was developing a standardized data schema that could accommodate any cookbook format. Matthew drew inspiration from semantic web principles, where recipes were actually one of the original test cases for structured content representation.
"Our starting point there was to think about what we want to have as our core data set for the flexibility going forward into the future. Funnily enough, the whole world of the semantic web, recipes were the ultimate test case example of how you would represent structured content."
This schema captures not just recipes, but the complete structure of cookbooks: introductory text, technique explanations, ingredient glossaries, and cross-references between sections. By standardizing everything into this format, CKBK can offer features that would be impossible with the original source materials.
The Digitization Pipeline
The technical process varies dramatically depending on the source material, but every cookbook goes through the same standardization pipeline:
"Whatever we start from, we ensure that all those links, all that classification is brought into a consistent format. And that's what allows us to add so much digital integration to the service we provide."
For older books, this means manually recreating the structure that was implicit in the original print layout. Page references like "see page 92" get converted into digital links. Ingredient lists get parsed into structured data that can be automatically converted between measurement systems.
For modern books that come as EPUBs, much of this structure already exists, but it still needs to be validated and enhanced to meet CKBK's standards.
Measurement Conversion as a Technical Challenge
One seemingly simple feature - converting between imperial and metric measurements - illustrates the complexity involved. This isn't just a matter of mathematical conversion; it requires understanding cooking context.
"That's what allows us to be able to convert from imperial and metric and US cuts measurements, whether we're starting from this kind of historical cookbook or whether we're starting from a brand new American cookbook or whether we're starting from a scan of a book we got from the 1960s, all of them can go into that same format."
A "cup" means different things in different countries and contexts. Converting "a pinch of salt" or "butter the size of an egg" requires culinary knowledge, not just algorithms. CKBK's system has to understand these nuances while maintaining the authenticity of the original recipes.
AI and Human Annotation
The sheer volume of content makes purely manual processing impossible, but the quality requirements make fully automated processing unreliable. CKBK uses a hybrid approach combining human expertise with AI assistance.
"Some of it also a mixture of human increasingly AI annotation to capture the structure so that we can have bookshelves which represent, okay, these are the books we've got on Southeast Asian cooking. These are the books we've got on confectionery or sourdough baking."
AI helps with initial structure recognition and classification, but human reviewers ensure quality and handle edge cases that automated systems can't manage. This creates a scalable process that maintains the high standards necessary for professional-grade content.
The Training Data Advantage
Years of manual digitization have created an unexpected asset: a comprehensive training dataset for improving automation.
"What we actually have right now is a fantastic training set where we've taken this disparate material and converted it into the standardized form. And so there is a huge opportunity, I think, for us to be able to scale up the volume."
This dataset enables CKBK to train AI systems specifically for cookbook digitization - a highly specialized task that general-purpose AI tools can't handle well. The more books they process manually, the better their automated systems become.
Quality Control and Validation
With thousands of books in the pipeline, quality control becomes a major technical challenge. CKBK has built systems to catch common digitization errors before they reach users.
"Because one of the things that AI can do pretty well now is go to a well-defined format and ensure that everything is valid to that schema when it's delivered. Now it won't necessarily be perfect in terms of how it's doing that, but humans aren't perfect either."
The validation system checks for missing ingredients in method steps, impossible cooking times, and measurement inconsistencies. This automated quality control allows a small team to process large volumes of content while maintaining professional standards.
Cross-Platform Compatibility
The structured data approach pays dividends when deploying across multiple platforms. The same content works seamlessly on web, iOS, and Android because it's stored in a platform-agnostic format.
"We knew that having an app would be really important and there were different approaches to doing that. We don't have a big development team, so we had to think about what approach we're gonna use to being on iOS and Android."
By separating content structure from presentation, CKBK can optimize the user experience for each platform without maintaining separate content databases. Recipe timing features work the same way whether you're using the website on a laptop or the mobile app in your kitchen.
Integration Possibilities
The structured approach enables integrations that would be impossible with unstructured content. CKBK can send cooking instructions directly to smart ovens because every recipe's temperature and timing data is in a standardized format.
"One of the examples of standardized structure which we do for all the recipes is if a recipe uses ovens, temperatures and timings, all those oven temperatures and timings are captured in a structured way."
This technical foundation positions CKBK to integrate with future kitchen technologies as they emerge, without requiring massive content restructuring projects.
Lessons for Content Startups
Matthew's technical approach offers several lessons for startups dealing with content digitization:
Invest in schema design upfront. The time spent creating a comprehensive data schema pays dividends across every other technical challenge.
Plan for hybrid automation. Pure AI and pure human approaches both have limitations; successful systems combine both strategically.
Build training data as you go. Manual processing creates datasets that improve automation over time.
Separate content from presentation. Platform-agnostic content storage enables multi-platform deployment with smaller technical teams.
Key Takeaways
- Schema standardization enables advanced features: Converting diverse formats into structured data unlocks capabilities impossible with original source materials
- Hybrid human-AI workflows scale quality: Combining automated processing with human oversight maintains standards while handling volume
- Historical processing creates competitive advantages: Years of manual digitization create training datasets that improve automation
- Platform-agnostic storage reduces technical debt: Separating content structure from presentation enables efficient multi-platform deployment
- Quality systems must be built-in: Automated validation catches errors before they impact user experience
CKBK's technical approach demonstrates that content digitization is far more complex than scanning and uploading files. Success requires systematic thinking about data structure, quality control, and long-term scalability.
Listen to the full conversation with Matthew Cockerill on the Levels Podcast to dive deeper into the technical challenges of building content platforms.

Get updates
Stay in the loop with all things gamification.