Fun with FreeType and libotf

Just a short post to share a wonderful C library I recently came across: libotf which provides a really nice API for working with OpenType fonts. Of course, LuaTeX already has excellent support of OpenType fonts through the use of code from FontForge and the excellent fontspec package by Will Robertson and Khaled Hosny. So, for sure, with LuaTeX  you don’t need to leverage the services libotf provides but it does offer an additional route to explore OpenType fonts and access OpenType font features in a direct way, which can be extremely instructive. The only downside is that the liboft API is not documened in great detail: you have to rely on comments within one of the header files (otf.h) and reading the source code of the examples… plus a bit of trial and error.

I use Microsoft’s Visual Studio for my C programming hobby (except for compiling LuaTeX) which can make for some “interesting challenges” when using C libraries that originate from the Linux world: particularly where there are complex dependencies on many other libraries (“dependency hell”). Thankfully, liboft has only 1 dependency, FreeType, which itself builds really cleanly and easily using Visual Studio. libotf is also fairly straightforward to compile as a Windows library (.lib).

Compiling on Windows, a tip: After building libotf I found that the API calls kept failing and tracked it down to 1 line of the liboft source code (note that I am using libotf version 0.9.12). In otfopen.c there is 1 line that you’ll need to change on Windows.

Line 2974 of otfopen.c uses fopen but did not use binary mode, so for Windows change

fp = fopen (otf_name, "r");

to

fp = fopen (otf_name, "rb");

and that seems to have fixed all the problems. If only all ports were that easy!

To use libtof/FreeType as a DLL plug-in with LuaTeX you will, of course, need to use the Lua C api to create a Lua “binding”, something I’m not going to cover here.

A nice UTF-8 decoder

If you want to explore passing UTF-8 string data to/from LuaTeX to your C code/library you may want to convert the UTF-8 data back into Unicode code points (reversing the UTF-8 encoding process discussed in this post). To do that you’ll need a UTF-8 decoder: here is a nice implementation of a UTF-8 decoder in C. Examples, source code and explanations are available from The Flexible and Economical UTF-8 Decoder. Just to note that  irrespective of the decoder you use make sure you read up and are aware of  UTF-8 security exploits.

Unicode, Glyph IDs and OpenType: a brief introduction

As you read about OpenType fonts and Unicode you come across terms such as “Glyph IDs”, Unicode characters/code points and suchlike. And this can be a bit puzzling: what’s the relationship between them? In this post I’ll try to give a brief introduction with the usual notice that I’m skipping vast amounts of detail in the interests of simplicity.

Just as a reminder, one extremely important concept to understand/appreciate is the difference between characters and glyphs. I’ve discussed this in a previous post but will summarise here (quoting from the Unicode standard):

  • Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation.
  • Glyphs represent the shapes that characters can have when they are rendered or displayed. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters.

I’ll try to expand on this a little. Among the many things that the Unicode standard provides is a univeral encoding of the world’s character set: in essence, allocating a unique number to the characters covered by the standard. Unicode does not concern itself with the visual representation of those characters; that is the job for fonts: they provide glyphs.

Today, OpenType font technology is the dominant font standard and is supported by modern TeX engines such as LuaTeX and XeTeX. However, as you start to explore OpenType in more detail you start to see references to terms such as “Glyph ID” or “Glyph index” and may wonder how, or if, these relate to the Unicode character encoding (code points). The two key points to understand are:

  • OpenType is concerned with glyphs.
  • Unicode is concerned with characters.

For present purposes we can take the very simplistic view that an OpenType font is a container for a large collection of glyphs in the form of the lines and curves required to draw (render) them. Of course, OpenType fonts can provide a lot more than just the glyphs themselves. OpenType fonts can provide extensive support for high quality typesetting via “features” and “lookups” which provide information  that a typesetting or rendering engine can use to do its job (think of them as a set of “rules” for the typesetting/rendering engine to apply).

However, for now just think of an OpenType font as containing a set of glyphs where each glyph has a name and a numeric identifier called its Glyph ID. The Glyph ID is simply a number allocated  to each glyph (e.g., by the font’s creator) ranging from 0 to N-1 where N is the number of glyphs contained in the particular font. The point is that the Glyph ID has nothing to do with the Unicode encoding or code points: it’s just an internal bookeeping number used internally within the font.

So, we have two sets of numbers: a universal standard for the world of Unicode characters (code points) and another arbitrary set of numbers (specific to each font) for the internal world of OpenType glyphs: the Glyph ID. So the question arises  how and where are these two universes joined together? The answer is that the magic glue is contained within the OpenType font itself: the so-called cmap table or, to give its full name, the Character To Glyph Index Mapping Table.

As the specification says

“This table defines the mapping of character codes to the glyph index values used in the font.”

Even a brief perusal of the OpenType specification will make it clear that it’s a complex beast and certainly not a topic for detailed discussion here. However, the cmap table is the “secret sauce” within an OpenType font which glues together the Unicode world of characters with the OpenType world of glyphs.

Note: OpenType fonts can contain multiple cmap tables for different encodings and may also contain a significant number of glyphs which are not covered by the cmap table. OpenType fonts may contain many different glyphs (representations) for a particular character and these visual variations fall outside the remit of the Unicode standard. For example, small caps, oldstyle numbers, swash characters etc differ only in visual design, they do not bring additional semantic meaning.

One excellent Windows utility for inspecting cmap tables is the free SIL ViewGlyph — Font Viewing Program. The following screenshot displays the cmap table from arabtype.ttf shipped with Windows.

Open a font and choose Options --> View cmap.

The screenshot clearly shows Unicode character code points in the first column, with the second column displaying the Glyph ID mapped via the cmap table.

The following screenshot from FontLab Studio displays some glyphs in arabtype.ttf listed in order of Glyph ID (or “index” as FontLab Studio refers to it).

Whilst FontLab Studio is a very nice piece of software it is quite expensive. A free alternative to FontLab Studio is the excellent FontForge.

Digging deeper

Another superb resource for exploring the low-level details of OpenType fonts is the Adobe Font Development Kit for OpenType which is a free download for Windows and Macintosh. One of the utilities it provides is an excellent command line tool called TTX which will generate an XML text file representation of an entire OpenType font file (or just those parts you are interested in).

One extremely useful TTX command line option is -s which will dump the “components” of an OpenType font to individual XML files. For example, the exquisite OpenType Arabic font shipped with Windows, Arabic Typesetting, by Mamoun Sakkal, Paul C. Nelson (sorry could not find a link!) and John Hudson can be exported to XML via

ttx -s arabtype.ttf

which will produce more than 20 XML files containing data from numerous tables within the font.


Dumping "arabtype.ttf" to "arabtype.ttx"...
Dumping 'GlyphOrder' table...
Dumping 'head' table...
Dumping 'hhea' table...
Dumping 'maxp' table...
Dumping 'OS/2' table...
Dumping 'hmtx' table...
Dumping 'LTSH' table...
Dumping 'VDMX' table...
Dumping 'hdmx' table...
Dumping 'cmap' table...
Dumping 'fpgm' table...
Dumping 'prep' table...
Dumping 'cvt ' table...
Dumping 'loca' table...
Dumping 'glyf' table...
Dumping 'name' table...
Dumping 'post' table...
Dumping 'gasp' table...
Dumping 'GDEF' table...
Dumping 'GPOS' table...
Dumping 'GSUB' table...
Dumping 'DSIG' table...

The ones of interest here are the GlyphOrder table and the cmap table. The GlyphOrder table will show you the complete list of glyphs, including ther names, ordered by Glyph ID and the cmap table shows you the character to glyph mappings (using the glyph names).

Unicode for the impatient (Part 3: UTF-8 bits, bytes and C code)

I promised to finish the series on Unicode and UTF-8 so here is the final instalment, better late than never. Before reading this article I suggest that you read Part 1 and Part 2 which cover some important background. As usual, I’m trying to avoid simply repeating the huge wealth of information already published on this topic, but (hopefully) it will provide a few additional details which may assist with understanding. Additionally, I’m missing out a lot of detail and not taking a “rigorous” approach in my explanations, so I’d be grateful to know if readers feel whether or not it is useful.

Reminder on code points: The Unicode encoding scheme assigns each character with a unique integer in the range 0 to 1,114,111; each integer is called a code point.

The “TF” in UTF-8 stands for Transformation Format so, in essence, you can think of UTF-8 as a “recipe” for converting (transforming) a Unicode code point value into a sequence of 1 to 4 byte-sized chunks. Converting the smallest code points (00 to 7F) to UTF-8 generates 1 byte whilst the higher code point values (10000 to 10FFFF) generate 4 bytes.

For example, the Arabic letter ش (“sheen”) is allocated the Unicode code point value 0634 (hex) and its representation in UTF-8 is the two byte sequence D8 B4 (hex). In the remainder of this article I will use examples from the Unicode encoding for Arabic, which is split into 4 blocks within the Basic Multilingual Plane.

Aside: refresher on hexadecimal: In technical literature discussing computer storage of numbers you will likely come across binary, octal and hexadecimal number systems.  Consider the decimal number 251 which can be written as 251 = 2 x 102 + 5 x 101 + 1 x 100 = 200 + 50 + 1. Here we are breaking 251 down into powers of 10: 102, 101 and 100. We call 10 the base. However, we can use other bases including 2 (binary), 8 (octal) and 16 (hex). Note: x0 = 1 for any value of x not equal to 0.

Starting with binary (base 2) we can write 251 as

27 26 25 24 23 22 21 20
1 1 1 1 1 0 1 1

If we use 8 as the base (called octal), 251 can be written as

82 81 80
3 7 3

= 3 x 82 + 7 x 81 + 3 x 80
= 3 x 64 + 7 x 8 + 3 x 1

If we use 16 as the base (called hexidecimal), 251 can be written as

161 160
15 11

Ah, but writing 251 as “1511” in hex (= 15 x 161 + 11 x 160) is very confusing and problematic. Consequently, for numbers between 10 and 15 we choose to represent them in hex as follows

  • A=10
  • B=11
  • C=12
  • D=13
  • E=14
  • F=15

Consequently, 251 written in hex, is represented as F x 161 + B x 160, so that 251 = FB in hex. Each byte can be represented by a pair of hex digits.

So where do we start?

To convert code points into UTF-8 byte sequences the code points are divided up into the following ranges and use the UTF-8 conversion pattern shown in the following table to map each code point value into a series of bytes.

Code point range Code point binary sequences UTF-8 bytes
00 to7F 0xxxxxxx 0xxxxxxx
0080 to 07FF 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
0800 to  FFFF zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
010000 to 10FFFF 000wwwzz zzzzyyyy yyxxxxxx 11110www 10zzzzzz 10yyyyyy 10xxxxxx

Source: Wikipedia

Just a small point but you’ll note that the code points in the table have a number of leading zeros, for example 0080. Recalling that a byte is a pair of hex digits, the leading zeros help to indicate the number of bytes being used to represent the numbers. For example, 0080 is two bytes (00 and 80) and you’ll see that in the second column where the code point is written out in its binary representation.

A note on storage of integers: An extremely important topic, but not one I’m not going to address in detail, is the storage of different integer types on various computer platforms: issues include the lengths of integer storage units and endianness. The interested reader can start with these articles on Wikipedia:

  1. Integer (computer science)
  2. Short integer
  3. Endianness

For simplicity, I will assume that the code point range 0080 to 07FF is stored in a 16-bit storage unit called an unsigned short integer.

The code point range 010000 to 10FFFF contains code points that need a maximum of 21 bits of storage (100001111111111111111 for 10FFFF) but in practice they would usually be stored in a 32-bit unsigned integer.

Let’s walk through the process for the Arabic letter ش (“sheen”) which is allocated the Unicode code point of 0634 (in hex). Looking at our table, 0634 is in the range 0080 to 07FF so we need to transform 0634 into 2 UTF-8 bytes.

Tip for Windows users: The calculator utility shipped with Windows will generate bit patterns for you from decimal, hex and octal numbers.

Looking back at the table, we note that the UTF-8 bytes are constructed from ranges of bits contained in our code points. For example, referring to the code point range 0080 to 07FF, the first UTF-8 byte 110yyyyy contains the bit range yyyyy from our code point. Recalling our (simplifying) assumption that we are storing numbers 0080 to 07FF in 16-bit integers, the first step is to write 0634 (hex) as a pattern of bits, which is the 16-bit pattern 0000011000110100.

Our task is to “extract” the bit patterns yyyyy and xxxxxx so we place the appropriate bit pattern from the table next to our code point value:

0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0
0 0 0 0 0 y y y y y x x x x x x

By doing this we can quickly see that

yyyyy = 11000

xxxxxx= 110100

The UTF-8 conversion “template” for this code point value yields two separate bytes according to the pattern

110yyyyy 10xxxxxx

Hence we can write the UTF-8 bytes as 11011000 10110100 which, in hex notation, is D8 B4.

So, to transform the code point value 0634 into UTF-8 we have to generate 2 bytes by isolating the individual bit patterns of our code point value and using those bit patterns to construct two individual UTF-8 bytes. And the same general principle applies whether we need to create 2, 3 or 4 UTF-8 bytes for a particular code point: just follow the appropriate conversion pattern in the table. Of course, the conversion is trivial for 00 to 7F and is just the value of the code point itself.

How do we do this programmatically?

In C this is achieved by “bit masking” and “bit shifting”, which are fast, low-level operations. One simple algorithm could be:

  1. Apply a bit mask to the code point to isolate the bits of interest.
  2. If required, apply a right shift operator (>>) to “shuffle” the bit pattern to the right.
  3. Add the appropriate quantity to give the UTF-8 value.
  4. Store the result in a byte.

Bit masking

Bit masking uses the binary AND operator (&) which has the following properties:

1 & 1 = 1
1 & 0 = 0
0 & 1 = 0
0 & 0 = 0

We can use this property of the & operator to isolate individual bit patterns in a number by using a suitable bit mask which zeros out all but the bits we want to keep. From our table, code point values in the range 0080 to 07FF have a general 16-bit pattern represented as

00000yyyyyxxxxxx

We want to extract the two series of bit patterns: yyyyy and xxxxxx from our code point value so that we can use them to create two separate UTF-8 bytes:

UTF-8 byte 1 = 110yyyyy
UTF-8 byte 2 = 10xxxxxx

Isolating yyyyy

To isolate yyyyy we can use the following bit mask with the & operator

0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

This masking value is 0000011111000000 = 0x07C0 (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 Bit mask
0 0 0 0 0 y y y y y 0 0 0 0 0 0 Result of operation

Note that the result of the masking operation for yyyyy leaves this bit pattern “stranded” in the middle of the number. So, we need to “shuffle” yyyyy along to the right by 6 places. To do this in C we use the right shift operator >>.

Isolating xxxxxx

To isolate xxxxxx we can use the following bit mask with the & operator:

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

The masking value is 0000000000111111 = 0x003F (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Bit mask
0 0 0 0 0 0 0 0 0 0 x x x x x x Result of operation

The result of bit masking for xxxxxx leaves it at the right so we do not need to shuffle via the right shift operator >>.

Noting that
110yyyyy = 11000000 + 000yyyyy = 0xC0 + 000yyyyy

and that
10xxxxxx = 10000000 + 00xxxxxx = 0x80 + 00xxxxxx

we can summarize the process of transforming a code point between 0080 and 07FF into 2 bytes of UTF-8 data with a short snippet of C code.

unsigned char arabic_utf_byte1;
unsigned char arabic_utf_byte2;
unsigned short p; // our code point between 0080 and 07FF

arabic_utf_byte1= (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);

Which takes a lot less space than the explanation!

Other Arabic code point ranges

We have laboriously worked through the UTF-8 conversion process for code points which span the range 0080 to 07FF, a range which includes the “core” Arabic character code point range of 0600 to 06FF and the Arabic Supplement code point range of 0750 to 077F.

There are two further ranges we need to explore:

  • Arabic presentation forms A: FB50 to FDFF
  • Arabic presentation forms B: FE70 to FEFF

Looking back to our table, these two Arabic presentation form ranges fall within 0800 to FFFF so we need to generate 3 bytes to encode them into UTF-8. The principles follow the reasoning above so I will not repeat that here but simply offer some sample C code. Note that there is no error checking whatsoever in this code, it is simply meant to be an illustrative example and certainly needs to be improved for any form of production use.

You can download the C source and a file “arabic.txt” which contains the a sample of output from the code below. I hope it is useful.

#include <stdio.h>

void presentationforms(unsigned short min, unsigned short max, FILE* arabic);
void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic);

void main() {

	FILE * arabic= fopen("arabic.txt", "wb");

	coreandsupplement(0x600, 0x6FF, arabic);
	coreandsupplement(0x750, 0x77F, arabic);
	presentationforms(0xFB50, 0xFDFF, arabic);
	presentationforms(0xFE70, 0xFEFF, arabic);
	
	fclose(arabic);

  }

void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic)
{

	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1=  (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
		arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);
		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
	}
	
	return;

}


void presentationforms(unsigned short min, unsigned short max, FILE* arabic)
{
	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned char arabic_utf_byte3;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1 = (unsigned char)(((p & 0xF000) >> 12) + 0xE0);
		arabic_utf_byte2 = (unsigned char)(((p & 0x0FC0) >> 6) + 0x80);
		arabic_utf_byte3 = (unsigned char)((p & 0x003F)+ 0x80);

		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
		fwrite(&arabic_utf_byte3,1,1,arabic); 
	}

	return;

}

Hopefully useful example of \directlua{} expansion

The following example may help to understand a little more about \directlua{} expansion.

\documentclass[11pt,twoside]{article}
\begin{document}

\def\xx{Hello}
\def\yy{(}
\def\zz{)}
\newcommand{\hellofromTeX}[1]{

\directlua{

function Hello(str)
tex.print(str)
end

\xx\yy#1\zz
}}

 

\hellofromTeX{"Hello World"}
\end{document}

Here’s how it works.
Code within \directlua{} is expanded according to TeX rules and then sent to the Lua interpreter. In the above, LuaTeX “sees” \xx\yy#1\zz and expands it as follows:


\xx --> Hello
\yy --> (
#1 --> "Hello World"
\zz --> )

After this expansion, the code is fed to the Lua interpreter, which sees Hello("Hello World") and executes


function Hello(str)
tex.print(str)
end

The function tex.print("Hello World") is called and typesets some text.

Uploading graphics from Word 2007 directly to my blog

Carrying on the experiments. How well does Word render native objects to blog posts? Here are some Word charts and graphics. The images you see have been rendered by Word and then uploaded. I’ve not edited or altered them in any way: what you see is exactly as it was uploaded. Fairly reasonable results I’d say. Of course, SVG would be nicer but maybe one day : -~)