ePub Books Project Part 2: A Little History

My first suspicions that eBooks were going to be the future was way back in the day when all those CD’s were coming out and taking over our beloved Vinyl LP, but the real light-bulb moment was when I first discovered Project Gutenberg sometime at the end of the 1990’s.

If memory serves me correctly, I even considered trying to set up an eBook site back then. I believe what stopped me was that there were just not yet any decent reading devices available – reading from a computer monitor was, and still is, the most uncomfortable experience ever.

And so the years rolled by…

Then in 2004/05 I heard about the Sony Librie and immediately knew that the eBooks’ time was coming…and soon!

It took me a couple of years to get the project off the ground, but toward the end of 2006 I was seriously working out how to become a part of the eBook revolution. At this time I also started as the Project Gutenberg Newsletter editor which allowed me to get among, and learn from, those who’d been there right from the start.

I spent many months trying to figure out what eBook format would be best suited as my Master Format but after much research and some brain picking, I had my shortlist; Jon Noring’s BookX, TEI and the OPS (ePub) format.

At the time I actually rejected the ePub format as I felt trying to manage all those individual files would be just too much trouble — for my own tastes I still do. I liked the concept of BookX but felt my XML and DTD skills, which were non-existent at the time, would make it difficult.

TEI as a Master Format

I choose TEI not just because there is plenty of documentation, but also because Project Gutenberg was showing indications (albeit reluctant) that this could be an accepted format in their archives on a mass scale. Even if PG won’t accept it as a Master format, you know those PG volunteers are going to keep on producing TEI eBooks for the archives.

Another thing that really attracted me to TEI was that it utilises the ODD; One Document Does it all.

I’ve continued to keep eye out for alternate formats, even considering DTBook at one point, but I guess by sticking with the TEI format I can eventually make my files available for inclusion into the PG archives, so I stayed put.

In April 2007 I started teaching myself Perl and so work began on my pg2tei.pl script. The Gutenberg.org webmaster (Marcello Perathoner) kindly allowed me to use his gut2tei.pl script as a starting point and although the basic structure and a handful of routines remains the same, I’ve rewritten much of the original and added numerous new routines.

From this I created a sample base of around 70 PG eBooks in the TEI P4 format, converting over to TEI P5 when this was released in November 2007. I continue to make improvements and fix bugs wherever possible.

Alas the script is not fully automated, although it does catch most things. The biggest manual work required includes;

  • <teiHeader> – Mostly automated but does need to be double-checked. The odd error happens so editing and adding changes for missing information is a must – luckily this normally takes just a minute or two, although the more awkward documents can take longer.
  • Images – The inclusion of the <figure> tags are automatic but as none of the original TXT documents include filenames, I have to manually work out which image goes with which <figure> tag. This can be quite time consuming.
  • Quote tags – This is probably the biggest consumption of time in the whole process. Although 99.99% of the <q> tags are correct, fixing that tiny percent can add many minutes to a conversion. Several times I’ve considered omitting the <q> tag mark-up, either leaving the original “double” and ‘single’ quotes in place or just replacing with a quote entity. However, I still feel the versatility this can offer makes them well worth the work.

The final output from this whole process produces what I call my Super-Lite TEI; creating a set of around 22 TEI tags (excluding the <teiHeader>, <front> and <back> sections) and no more than a dozen attributes.

In the final article of the ePub Books Project, I’ll talk about the plan to convert to the ePub format and the future of ePubBooks.com website.

If you liked this post, say thanks by sharing it.