Converting a PDF book into EPUB
I read book on iPad using iBooks. It handles PDF very well, but I do not fell comfortable without few features that are present when you read EPUB books:
- Highlights
- Infinite scroll
- Night mode
There are 2 straightforward ways to obtain an EPUB book:
- Find/borrow/buy it
- Convert from PDF to EPUB
I chose the most fun part - converting. Actually I chose this option after a lot of unsuccessful search.
I used one of the most popular convertors: Calibre. But it has a slight problem - converted books are not readable. Each book line is transformed into a paragraph. But as I found later it is a default behaviour which can be changed via options (enabling Heuristic Processing does a nice job).
This is a list of problems that I encountered and solved:
- I didn't know how EPUB format is structured
- On OS X it doesn't work classical extracting of EPUB contents
- Each line was transformed into a paragraph
- Calibri randomly generated 5 HTML files. Normally it should be one file for a chapter (which were 15 in my book)
- Words divided at the end of line with hyphen which ended up at the ends of paragraph
- Each PDF page had a title (of chapter and book) which were transformed in paragraphs/headings
- Each PDF page had a number which were transformed into text
- Each PDF page had some hidden trash information which were transformed into text paragraphs
- Typographic ligatures were transformed into letters divided by space
- Table of contents was not generated
- Vectorial images with text were transformed into paragraphs of text
- List (ordered and unordered) were treated as paragraphs
- Custom list styles were lost
I didn't know how EPUB format is structured
This time wikipedia gave really nice description of EPUB format. Also Building ePub with PHP and Markdown article has a nice description of file contents.
In fact it is a HTML website in a zip archive with few additional structural files.
On OS X it doesn't work classical extracting of EPUB contents
This time I just took the ePub zip/unzip tool and didn’t wreck my mind with CLI.
Each line was transformed into a paragraph
I solved this problem using a RegEx which joined paragraphs if they were not ending with some specific symbols such as dot or exclamation mark. But it is much more easy to do by enabling Heuristic Processing in Calibri which will join paragraphs based on distance between lines.
Calibri randomly generated 5 HTML files.
I joined these files into one in order to easier manipulate with them. In the end I just split this file into 15 files (one file per chapter).
Words divided at the end of line with hyphen which ended up at the ends of paragraph
This problem was also solved with a RegEx but Heuristic Processing will do it for you.
Each PDF page had a title (of chapter and book) which were transformed in paragraphs/headings
I just removed all of them using a RegEx.
Later I added manually all 15 chapter titles at the beginning of chapters.
Each PDF page had a number and some hidden trash information which were transformed into text
212 9781591888884_BookTitle_TX_p1-230.indd 212 9781591888884_BookTItle_TX_p1-230.indd 212 2/21/12 12:14 AM 2/21/12 12:14 AM
I just removed all of them using a RegEx.
Typographic ligatures were transformed into letters divided by space
I just searched for most common ligatures and removed spaces between letters.
Table of contents was not generated
Done it by hand.
Vectorial images with text were transformed into paragraphs of text
I did screenshots of these images in big format and embedded images into text.
List (ordered and unordered) were treated as paragraphs
Just found them all and replaced by hand.
Custom list styles were lost
The same as for lists - I found a pattern, added CSS classes through Find and Replace using a RegEx. Than just added necessary CSS to the main stylesheet.