Pino bio photo

Pino

I (infrequently) blog about data science, business intelligence, big data, web technologies and free software.

Twitter LinkedIn

I’ve been struggling to generate PDF files from simple HTML code that used non-ascii characters. The tools that I was using were wkhtmltopdf (version 0.12.2 which doesn’t need an X server) and python-pdfkit. Unfortunately, I spent quite some time working on the wrong solutions. Namely, these are the things that I took a look at:

  • Python. An usual suspect, but this time it was not a case of “Unicode madness”.
  • An open issue regarding a problem with certain characters in python-pdfkit.
  • Using different fonts. Apparently there’s another project also called pdfkit (for node) that has nothing to do with wkhtmltopdf and pdfkit for Python/Ruby.

After testing different websites (some working and some failing) I realised that I just needed to include a meta tag specifying that UTF-8 was the correct charset. It’s not surprising that some websites (and my own HTML code) did not include this attribute because this is what w3schools says about it:

If an HTML5 web page uses a different character set than UTF-8, it should be specified in the tag.

Sometimes easy solutions are difficult to find! Below you can find a working example.

import pdfkit

html_content = """
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<p>&euro;</p>
<p>áéíóúñö</p>
<body>
</html>
"""


pdfkit.from_string(html_content, 'out.pdf')