Generating ePub Books From HTML

Converting a single HTML file to an ePub is straightforward, with many free tools available for this purpose. But, if your goal is to convert multiple HTML files, and only a portion of each file, into an eBook with a proper table of contents, cover image, etc., what do you do?

This was exactly the crossroads I found myself at when attempting to create an ePub version of my book. Each chapter of the book was represented by a unique web page, and I needed an automated way of quickly downloading all of those and combining them into an eBook. To make things more interesting, only a portion of each page was necessary – who wants to see a web page’s header, footer, and navigation bar on an ePub? Additionally, images needed to be downloaded and embedded into the ePub, and Github Gist code snippets needed to be downloaded and represented without the use of Github’s Javascript tags.

All of these requirements are necessary for creating a professional ePub, but yet surprisingly no tool existed which could do all of these things without considerable manual effort. Like any good software developer, if no tool exists for a job, and the only other option is manual work, I took the laziest path and created a new tool to get the job done.

Introducing html2epub

That new tool is called html2epub and is a command line app which can:

Generate a professional looking ePub from a series of web pages
Strip out unnecessary HTML
Convert HTML into XHTML as to be compliant with the ePub spec
Embed images
Embed Gist code snippets
Rewrite chapter to chapter links for proper ePub navigation
Support for Table of Contents navigation
Support forms-based authentication

I have tried to keep this utility as simple to use as possible, despite its many features. Let’s look at how to get started.

Getting Started

On macOS installing html2epub is greatly simplified by brew. Simply run:

brew install jwhitehorn/brew/html2epub

This will download and install htmlepub, and its dependencies, and register the command in your PATH. With that completed, you can generate an ePub as easily as:

html2epub --url https://www.datasyncbook.com \
    --toc ./example/toc.xhtml \
    --cover ./example/cover.png \
    --contents ./example/contents.json \
    --title "Data Synchronization" \
    --subtitle "Patterns, Tools, & Techniques" \
    --author "Jason Whitehorn"

That is, assuming you have a few configuration files already – which we’ll discuss in a moment. The rationale for this is that I’d rather have a tool that was easy to invoke over and over again, over a duration of time, even if that meant creating a couple of config files on the outset. Specifically, as I’m writing Data Synchronization: Patterns, Tools, & Techniques, I needed an easy way of re-generating an ePub distribution as new content is published.

There are three files you’ll need – let’s discuss those in order of simplicity.

A Cover Image

No professional eBook is complete without a cover image. The –cover parameter dictates the path to use for specifying just that. This file will be embedded into the eBook, so a web-safe format such as PNG or JPEG is recommended.

This is one area where I can see this tool evolving, but for now, the table of contents, as specified by the —toc argument, is an XHTML file straight out of the ePub spec – yuck!

Luckily the table of contents is not something that is overly complicated in most books, nor should it vary much. Here is a sample of what one would look like:

This represents a pretty average table of contents. Each href references the file name of the chapter as it will be inside your ePub. This is arbitrary in itself, but must be consistent with the naming in the next configuration file.

Contents.json

This represents the core of what html2epub needs to successfully generate an ePub. Let’s take a look at a sample file before discussing further:

The contents here must be in the same order as the table of contents, and the file here should also match the href there. The url property specifies the page to include, while the login attribute contains optional login information if such is necessary to reach certain content. Here is what a chapter with login information might look like:

In this scenario, the URL now points to a login page with a redirect to the destination page, as html2epub will need to login before proceeding.

Wrap Up

There you have it, a simple but effective tool for generating ePubs from websites. There are several aspects of this tool that are less than ideal, such as login handling and the table of contents, but overall it has already saved me time – and I had to build the darn thing 🙂

The tool itself is free and open source – check it out on github. I will undoubtedly continue to enhance it as my usage continues, but for the short term my goal was to make ePubs for my book, not write ePub generation tools, so I cannot promise any specific feature roadmap.

As always, thoughts, feedback, and criticisms are always welcome.

Jason Whitehorn

Generating ePub Books From HTML

Introducing html2epub

Getting Started

A Cover Image

Table of Contents

Contents.json

Wrap Up

About Jason

Leave a Reply Cancel reply

Jason Whitehorn

Introducing html2epub

Getting Started

A Cover Image

Table of Contents

Contents.json

Wrap Up

Related Posts

Free as in Speech: or Why the GPL Isn’t Free

Overcoming Writer’s Block

Data Sync & Access Rights – First Draft

About Jason

Leave a Reply Cancel reply