Tuesday, May 18, 2010

Convert .html to .pdf in gnu/linux

There are various options for converting .html files to .pdf in a gnu/linux operating system. Your choice of methods will depend on the complexity of the file you wish to convert, and your familiarity with the tools a gnu/linux system provides.

What you'll need:

  • Gnu/linux operating system

  • Html file

  • Web browser


Optional:

  • Openoffice.org office suite

  • wget

  • html2ps

  • ps2pdf


Simply "Print to file"
One very simple option for creating a .pdf file from an .html file is to simply open the file in your browser, and choose, print. When the print dialog arises, choose "Print to File", and indicate "PDF". This will write the html file out to pdf format.
html to pdf conversion: print to file

Here is a pdf of this article generated in this fashion: converthtml2pdfgnulinux.pdf

OpenOffice.org

"Print to File" works well for basic html files with simple text and some images. If the html file in question has more complex formatting, this option may not always produce the best results. Luckily, other options exist.

Save the html file to your computer (if you haven't already done so), and open it with OpenOffice.org's html editor (ooweb). Then simply go to the "File" menu, and choose "Export". OpenOffice.org will then offer you the usual options for saving a file, such as choosing where to save it, and what title to give the file, and, preso-magico, will produce a .pdf file from your .html file.

Command Line

Of course, no linux how to article would be complete without instructions on how to accomplish your task using only the magical Bash command line interface. For those so inclined, then, the following is a complete process for acquiring an .html file and converting it to a .pdf file. In order to proceed with this method, the following software must be installed on the your computer: wget, html2ps, and ps2pdf. These programs are either already a part of most gnu/linux distributions, by default, or can be easily acquired with your favorite package manager (apt, yum, pacman, portage, etc.)

First, let's save the file to your computer:
wget http://www.somesite.com/yourfile.html

Next, let's convert the .html file to a postscript or .ps file:
html2ps yourfile.html > yourfile.ps

Then, we'll convert the postscript file, finally, to a .pdf file:
ps2pdf yourfile.ps

Voila!
You should now have "yourfile.pdf".

This could, of course, all be scripted.

#!/bin/bash

# convert webpages to pdf files
# get url
echo "Enter the url of the page to be converted:"
read page
#download page
wget $page

file=$(basename $page)
#convert to postscript
html2ps $file > $file.ps
#convert to pdf
ps2pdf $file.ps
#clean up extraneous files
rm -f $file
rm -f $file.ps
#clean up file name
rename "s/.html.pdf/.pdf/g" *.pdf

echo "done"

exit


Here is a pdf of this article, generated via this command line method: convertweb2pdflinux.pdf
Notice, it is different from the above pdf created with "Print to file".
One difference, which, depending on your goals, may be either advantageous or undesired, is that text in this file can be selected and copied, which is not true of the first file.

XHTML2PDF

In many cases, you may wish to create a pdf file from a complex .html or .xhtml file that includes .css (cascading style sheet) or other elements, that will not render in the above methods in such a manner as to produce a file that appears as it does on the Internet.

For those cases, there is a program called xhtml2pdf. This program is not as likely to be a part of most gnu/linux distributions by default, nor available from said distributions' repositories. As such, you may to have to download and install it by hand. Thankfully, the site for this program is easily enough found at http://www.xhtml2pdf.com/, and, of course, the program is free, open source software.

And, of course, here is a pdf of this article generated with xhtml2pdf: xhtml2pdfconversion.pdf

There's more!

Yet other methods exist for generating .pdf file from .html files, of course, and an attempt to compile an exhaustive list, with instructions for each, would be beyond the scope of this article.

No comments: