Just a random test
Just trying out conversion of fully-vowelled typeset Arabic into SVG: using three different fonts. Seems to work.
Just trying out conversion of fully-vowelled typeset Arabic into SVG: using three different fonts. Seems to work.
Microsoft’s official specification for the OpenType font file format is a somewhat dry and, of course, a very technical document. Reading through it is not a task for the faint-hearted! I’m interested to understand some parts of it so I recently purchased a copy of DTL OTMaster which has proved to be absolutely invaluable. At the time of writing DTL OTMaster costs about 250 euros but the time it can save you makes it worth every penny. This post is not intended as an “advert” for the software, just a quick demo of a really great tool that you may not have heard of; so here are some screenshots of what it will show you. In the screenshots below, OTMaster is displaying the open source OpenType (TrueType) font Scheherazade .
Here are some screenshots showing the internals of Scheherazade. Programmers will note that you are provided with information on the data types of various entries – the same data types referenced in Microsoft’s specification. Very useful indeed! It’s worth noting that OTMaster has many other features in addition to displaying the technical data – including some features present in Microsoft’s VOLT – and in some areas they are better implemented than in VOLT, particularly the ability to preview multiple glyphs with mark-to-base positioning.
On the left is the internal font structure: at the top is the “root” entry where you can see the glyphs in the font.
Summary of key data contained at the start of the font.
The following screenshot shows the font cmap table(s) – the font’s mechanism to map from character codes (e.g., Unicode) to the internal, and font-specific, glyph identifiers (indices).
Displaying a wealth of information on the low-level data for glyphs.
Just a short post to share another example from my on-going work on HarfBuzz/LuaTeX. A rather pointless example – without using any code to correctly place mark glyphs (e.g., vowels) – showing randomly coloured Arabic glyphs. Thanks to the power of HarfBuzz and the superb Lua C API (especially C closures and “for loop” iterators) the code to process the Arabic text is about 25 lines of Lua script.
Source of text for typesetting example: BBC Arabic. I don’t know what the text says but Google Translate indicated it was neither controversial or offensive – I hope that is the case!
Just to add an example with mark glyph positioning and random colours. Vowel positioning added about 10 lines of Lua script :-).
This post could easily turn into the length of a small book if I covered all the background material that may be required for a full understanding. I simply cannot justify the time it would take to explore everything in full detail; so I apologize for the brevity if there’s insufficient detail for many readers. In addition, I’ve been rather loose in my definition of “vowels” and should be more precise to distinguish between damma/kasra/fathah and other markers such as shedda, sukoon and so forth.
One side-effect of using TeX is being distracted by the typesetting quality of materials you are reading. And this happened to me whilst trying to teach myself some Arabic. I bought many books and began to notice that the quality of Arabic typesetting was extremely variable, even from the most respected publishers. In fact, some of it was atrocious, especially the placement of vowels/markers (damma, kasra, fatha, sukoon, shadda and so forth). It was not simply a question of being “picky”, or mere aesthetics, but it actually impacted on reading the material. Often, lines of fully-vowelled Arabic text were so poorly typeset that it was hard to know which vowel belonged to which base glyph. As a small example, here’s a scan of the word “yawmu” (day) taken from a book that shall remain nameless:
Even to the casual observer it is clear that the marks above the glyphs are very distant from the base glyphs they are supposed to be marking. So, I asked myself “Why”, little did I know that it would result in me being distracted away from studying Arabic to exploring typesetting it instead. To begin to explain the problem, we can replicate the above scan with a little bit of hand-rolled PostScript code. Don’t worry about how I found the appropriate glyph names for use with the PostScript glyphshow
operator. The following code initially typesets the word “yawmu” using the default glyph positions and then typesets the same glyphs by applying manual re-positioning/adjustments – moving the vowels/markers closer to the base glyphs and faking a bit of kerning too.
/ATbig /Arial findfont 30 scalefont def /AThuge /ArialMT findfont 75 scalefont def 50 250 moveto ATbig setfont (Glyphs in their default positions: ) show AThuge setfont /uni064F glyphshow %damma /uni0645.fina glyphshow %meem /uni0652 glyphshow /uni0648 glyphshow /uni064E glyphshow /uni064A.medi glyphshow 50 150 moveto ATbig setfont (Glyph positions manually adjusted: ) show AThuge setfont gsave -2 -10 rmoveto /uni064F glyphshow %damma grestore /uni0645.fina glyphshow %meem gsave 2 -10 rmoveto /uni0652 glyphshow grestore -15 0 rmoveto /uni0648 glyphshow gsave 2 -8 rmoveto /uni064E glyphshow grestore /uni064A.medi glyphshow showpage
Here’s the resulting PDF:
So, in essence, “poor quality” typesetting of fully-vowelled Arabic can arise from typesetting processes/software that do not make any adjustments to the positions of vowels/markers with respect to the base glyph they are supposed to mark. Naturally, it would be crazy if you had to manually work out the positioning adjustments for each vowel/marker according to the glyph it is marking. Of course you don’t need to do that – if you use high quality OpenType fonts all the necessary positioning data is contained in the font itself. However, the font designer still has to work very hard to put that positioning data into OpenType font to ensure that the myriad of combinations work well – not forgetting that Arabic letters have up to 4 shapes depending on their position in the word (initial, medial, final or isolated) and have a myriad of complex ligatures which also need similar positioning data. Spare a thought for the designers who labour for hours ensuring the positioning data works.
A small but important point to note is that the Arabic vowels (and some other markers) should be designed to have zero width: when you render or place a vowel it does not affect the current horizontal point or position on the page. Clearly, this is very important because Arabic is a joined/cursive script – non-zero vowel widths would seriously interfere with joining the base Arabic glyphs. The zero-width can be demonstrated very simply by amending the above PostScript to display just the vowels/markers: here you can see they all overlap because they do not move the current point after being displayed – because they have zero width.
/AThuge /ArialMT findfont 500 scalefont def 50 50 moveto AThuge setfont 0 0 1 setrgbcolor /uni064F glyphshow %damma 0 1 0 setrgbcolor /uni0652 glyphshow 1 0 0 setrgbcolor /uni064E glyphshow showpage
To support high-quality Arabic typesetting, OpenType fonts contain the necessary positioning data to adjust the positions of vowels/markers to move them closer to, or away from, the base glyph over which they appear. So, how is this done? Again, for brevity I’m omitting a huge amount of detail but in essence the process is quite easy to understand. When you think about these positioning issues you need to think about pairs of glyphs: the base glyph – i.e., the Arabic letter in one of its forms, together with the vowel glyph or, to be more general, glyphs which are classified as marks: glyphs that appear above or below base glyphs. For each mark glyph/base glyph pair the mark glyph and base glyph are each given a so-called anchor point, which is simply an (x,y) coordinate pair (in font design space coordinates). Positioning the mark glyph with respect to the base glyph means that typesetting software obtains the anchor points (from the font file) and uses them to make positioning adjustments so that anchor points of the mark and base coincide. Here’s a simplified diagram showing anchors for a damma (mark) and the medial form of kaaf.
The following diagram simulates having displayed a medial form kaaf then the damma (marker) but without the damma’s position being adjusted via the anchor point data. If you look closely, you can see that the two crosses representing the individual anchor points do not yet coincide.
Well, as you’d expect it requires specialist software and a great deal of time to manually experiment and work out the best (x,y) pairs for marks/bases. Thankfully, for TrueType fonts Microsoft has generously provided an excellent free piece of software called VOLT: Visual OpenType Layout Tool. VOLT allows you to implement very sophisticated OpenType features, not only “mark to base positioning” which is what we are talking about here. If you are interested to explore this technology, you can start with SIL’s Scheherazade Regular (OpenType) developer package which contains a VOLT project file you can load and explore. See the VOLT screenshot below.
Attempting a VOLT tutorial is far outside the scope of this post. However, here’s a screenshot showing the creation of anchor points – in the lower-right corner you can see coordinate data (in font design coordinates) which are the anchor points: an (x,y) pair for the mark and base glyph.
Well, here is where it get pretty fiddly because you have a number of coordinate systems in play plus you are dealing with right-to-left text positioning – and it all depends on the software you are using. Perhaps the easiest option (well, the easiest at 3am as I finish this article!) is to think of the damma’s position undergoing simple repositioning as indicated by this vector diagram:
In the above diagram, the vectors r1 and r2 represent the positions of the anchor points, with vector rt indicating the translation you need to apply to the damma in order for the anchors to coincide. Now, it is of course complicated by the fact that the anchor point coordinates are defined using the design space of the fonts, so you obviously need to scale the anchor point values according to the point size of your font: simply (pointsize/2048) for TrueType fonts. You obviously need to account for the coordinate system into which you are rendering the glyphs. So, if you have placed the medial kaaf at some position (a,b) on your page so you need to work out the translation vector rt to place the damma in the correct location.
Good night, I’m going to get some sleep. I’ll fix the typos later 🙂
Just to note that you can think of the mark’s anchor point as translating the origin of the mark glyph:
A fairly basic example to explain a bit about libotf: just to “get started”. To run this, I built libotf (and FreeType) as static libraries and linked against them.
#include <windows.h> #include <math.h> #include <malloc.h> #include <memory.h> #include <stdio.h> #include <stdlib.h> // I'm including FreeType #includes directly not via #include FT_FREETYPE_H #include <ft2build.h> #include <freetype.h> #include <t1tables.h> #include <ftoutln.h> #include <ftbbox.h> //#include FT_FREETYPE_H #include <otf.h> //#include <pcre.h> //#include <time.h> typedef unsigned char uint8_t; typedef unsigned int uint32_t; int main(int argc, char** argv) { FT_Library font_library; FT_Face fontface; FT_GlyphSlot cur_glyph; FT_Glyph_Metrics glyph_metrics; OTF_GlyphString gstring; char * fontpath; size_t numcodepoints; OTF *otf; int i; // "arabictext" is a "wide character" string. It contains a sequence of Unicode codepoints // for each character in our string. BUT NOTE: these codepoints will be the values of the // UNSHAPED isolated Arabic characters. What you are looking at on screen here is the result of // applying the operating system/browser shaping engine to shape the displayed version. // It is really important to understand that !! wchar_t * arabictext = L"Øَرَكَات"; // I'm using the Scheherazade font from SIL (as amended by me) fontpath="e:\\Volt\\ScheherazadeRegOT-1.005-developer\\sources\\ScheherarazadeGDversion3.ttf"; // wcslen returns the string length in "wide character" units // i.e., this gives you the number of Unicode codepoints (i.e., characters). // Obviously, if "arabictext" was encoded in UTF-8 (e.g., we read it from a file) // we'd need to counts the number of codepoints by converting the UTF-8 // back into Unicode character integers (codepoints) numcodepoints= wcslen(arabictext); // gstring is the object we pass to the OTF library. // First we need to tell it how long our gstring is. // Initially, gstring.used = gstring.size until the libotf library starts to // manipulate the gstring (glyph sequences) and perform various OpenType // features/lookups (e.g., GSUB subsitutions) which usually results in // changes to the number of glyphs present in the string. // OK, here's where we set up the gstring for use with the OTF library gstring.used=numcodepoints; gstring.size=numcodepoints; // Now we need to create our actual glyph objects // 1 for each codepoint in our text wchar_t * arabictext gstring.glyphs= malloc (sizeof (OTF_Glyph) * numcodepoints); memset (gstring.glyphs, '\0', sizeof (OTF_Glyph) * numcodepoints); // Now we are ready to use the OTF library. I should make it VERY clear // that here we are NOT, I repeat NOT doing any shaping of the Arabic // text. libotf does not transform the string of isolated Arabic glyphs form into their // initial, medial or final shapes. That must happen BEFORE you pass the // gstring to libotf. The following is just a trivial demo showing the basics. // Firstly, we need to assign the Unicode codepoint (character value) // to each of the glyphs in our gstring object --- setting gstring.glyphs[i].c for glyph i. // (as contained in arabictext[i]) for (i=0; i < numcodepoints; i++) { gstring.glyphs[i].c = arabictext[i]; } // Get our instance of the libotf library // You should check the return value: Warning, I'm being VERY lazy here!!! otf = OTF_open(fontpath); // Now we'll call the really interesting functions. // Firstly, we'll call OTF_drive_cmap2 (otf, gstring, 3, 1) // to assign GLYPH IDENTIFIERS to our gstring. What's happening is that libotf is // using the CMAP table in the font to say "Hey, I've got the Unicode code point X // can you tell me the GLYPH IDENTIFIER that maps to in the font? OTF_drive_cmap2 (otf, &gstring, 3, 1); // OK, so what's the result of this? Let's see: for (i=0; i < numcodepoints; i++) { printf("Unicode character %ld maps to GLYPH IDENTIFIER %ld \n", gstring.glyphs[i].c, gstring.glyphs[i].glyph_id); } //The output is: /* Unicode character 1581 maps to GLYPH IDENTIFIER 340 Unicode character 1614 maps to GLYPH IDENTIFIER 907 Unicode character 1585 maps to GLYPH IDENTIFIER 290 Unicode character 1614 maps to GLYPH IDENTIFIER 907 Unicode character 1603 maps to GLYPH IDENTIFIER 395 Unicode character 1614 maps to GLYPH IDENTIFIER 907 Unicode character 1575 maps to GLYPH IDENTIFIER 257 Unicode character 1578 maps to GLYPH IDENTIFIER 322 */ // Next, we'll call OTF_drive_gdef (otf, gstring) whose job it is // to tell us what TYPE of glyph (called the Glyph Class) are we dealing with. This is the OpenType // GDEF table which can be used to allocate an identifier (Glyph Class) to each glyph // in the font. // See http://partners.adobe.com/public/developer/opentype/index_table_formats5.html // Glyph Class 1 = Base glyph (single character, spacing glyph) // Glyph Class 2 = Ligature glyph (multiple character, spacing glyph) // Glyph Class 3 = Mark glyph (non-spacing combining glyph) // Glyph Class 4 = Component glyph (part of single character, spacing glyph) OTF_drive_gdef (otf, &gstring); // Let's see what we got from that: for (i=0; i < numcodepoints; i++) { printf("Unicode character %ld maps to GLYPH IDENTIFIER %ld which is Glyph Class %ld\n", gstring.glyphs[i].c, gstring.glyphs[i].glyph_id, gstring.glyphs[i].GlyphClass); } /* Unicode character 1581 maps to GLYPH IDENTIFIER 340 which is Glyph Class 1 Unicode character 1614 maps to GLYPH IDENTIFIER 907 which is Glyph Class 3 Unicode character 1585 maps to GLYPH IDENTIFIER 290 which is Glyph Class 1 Unicode character 1614 maps to GLYPH IDENTIFIER 907 which is Glyph Class 3 Unicode character 1603 maps to GLYPH IDENTIFIER 395 which is Glyph Class 1 Unicode character 1614 maps to GLYPH IDENTIFIER 907 which is Glyph Class 3 Unicode character 1575 maps to GLYPH IDENTIFIER 257 which is Glyph Class 1 Unicode character 1578 maps to GLYPH IDENTIFIER 322 which is Glyph Class 1 */ // OK, that's the end. Time to get out of here. // Let's be tidy! free(gstring.glyphs); OTF_close (otf); return 0; }
I’ve been reading about SIL International’s Graphite engine and it looks really interesting. I downloaded the code and ran the CMake-based build process through the CMake graphical interface. It didn’t work. Eventually, I found some instructions to build it from the command line, so here’s the way I did it.
cmake.exe
is in your Windows PATH
.Graphite
).-G
parameter (below) to your build environment (cmake --help
tells you the ones it supports).cmake -G "Visual Studio 9 2008" -DCMAKE_BUILD_TYPE=Release -DGRAPHITE2_COMPARE_RENDERER:BOOL=OFF
If all goes well you should see something like the following, together with a generated Visual Studio Solution file graphite2.sln
.
-- Build: Release -- Segment Cache support: enabled -- File Face support: enabled -- Tracing support: enabled CMake Warning at CMakeLists.txt:54 (message): vm machine type direct can only be built using GCC -- Using vm machine type: call -- Configuring done -- Generating done -- Build files have been written to: E:/SILgraide/Graphite
Your Visual Studio Solution should look something like this: