Looking inside TeX: C helps me to see

 

Introduction: From WEB to C, a bit of history/background

For some time I’d wanted to build TeX (the original Knuth version) from the WEB source code, but the relatively complex process to generate C from WEB meant it was one of those “tasks” I kept putting off. Well, back in early 2013 I finally decided to have a go and, eventually, I managed to create a Windows port/build of the Web2C executable and associated tools. Using those tools I was finally able to generate TeX.C from TeX.WEB and compile a working TeX executable. As part of that exercise I decided remove the Kpathsea path-searching library from my build of TeX, replacing it with a simple recursive directory search – based, at the moment, on compile-time options (which I plan to make fully configurable – probably with a Lua-based config file).

Why am I doing this… ?

I ask myself this on many occasions… Having “ported” LuaTeX to a native Windows build, I already have a TeX-based system to explore via Visual Studio (and LuaTeX is written in clean C, no need of Web2C). I guess it’s mainly curiosity but there is also the fact I can “tweak + explore” some parts of Knuthian TeX and rapidly and easily re-compile it – the C code base of Knuthian TeX is tiny fraction of LuaTeX and is thus far, far quicker to compile. I also don’t want to risk doing something dumb and somehow wrecking my port/build of LuaTeX.

Poking around inside TeX.C

Although I have quite a collection of books on TeX, I’ve always found it really, really hard to understand how TeX – the language and program – actually works. So, for me, I find it much more instructive to watch how some bits of TeX actually work by stepping through the C code as TeX is executing – single-stepping via the Visual Studio interface. However, before attempting to do that I spent some time using regular expressions to “tidy up” the machine generated C code produced by Web2C – the raw C code (produced by Web2C) is almost impossible to read/follow. At present, the “tidied C code” is still far from “easily legible code”, but it’s gradually improving, especially as I copy/paste explanatory text from TeX.WEB into TeX.C. Many parts of TeX (algorithms) are truly fiendishly complex (line-breaking, hyphenation, math typesetting, etc…) so I doubt I’ll spend too much time probing those inner depths. Whilst being in awe at the sophistication and complexity of the algorithms inside TeX, I do confess that, at times, the C code is, in places, somewhat spaghetti-like. For example, there is a significant number of global variables and some individual globals are used for more than 1 purpose. Additionally, there is extensive use of “goto” statements, causing the code to jump all over the place.

Some confusion starts to ease

Despite the difficulty in following the execution of TeX.C, it is nevertheless fascinating to watch TeX actually run: Parsing the input file, acting on catcode values, creating tokens, defining macros, building boxes, running the page-builder and shipping out pages. Although I’m only just starting to explore TeX via C code, it has, for me, started to lift some of the confusion surrounding the TeX language – even if I have barely scratched the surface of this truly extraordinary program.

A new series of posts…?

My plan is to write a series of short, but fairly frequent, posts based on some aspects of TeX’s internals: To relate/use those internals to explain, with examples, some parts of the TeX language semantics. At least, in areas that I found tricky to understand and ones that, I hope, might be instructive/useful for others.

RegexBuddy and RegexMagic: Truly superb regular expression tools

Regular expressions are part of many programmer’s toolkit but they can be quite fiddly to get right. At the moment, I’m trying to “sanitize” the C code generated for TeX (via Web2C) by post-processing the TeX.c file to make the C source code far more readable. To do that I’m using the original definitions in TeX.WEB to generate C #define statements that I can use in TeX.c. For example, in TeX.WEB you see the following “WEB macros” related to entries in TeX’s “equivalence table”:

@d eq_level_field(#)==#.hh.b1
@d eq_type_field(#)==#.hh.b0
@d equiv_field(#)==#.hh.rh
@d eq_level(#)==eq_level_field(eqtb[#]) {level of definition}
@d eq_type(#)==eq_type_field(eqtb[#]) {command code for equivalent}
@d equiv(#)==equiv_field(eqtb[#]) {equivalent value}

When WEB expressions using the above macros are processed by TANGLE and Web2C the resulting C code contains many statements that look like the following:

eqtb [curval ].hh.b1 = 1 ; 
eqtb [curval ].hh.b0 = c ; 
eqtb [curval ].hh .v.RH = o ; 

Not very readable but, of course, it is machine-generated C code so what would you expect. Through regular expressions I’m (slowly/carefully) replacing many raw C statements using #defines, such as the following:

#define equivalence_level(a) eqtb[a].hh.b1
#define command_code_equivalence(a) eqtb[a].hh.b0
#define set_value_of_equivalent(a) eqtb[a].hh.v.RH

As part of this work, I use two very useful tools for building and testing regular expressions: RegexBuddy and RegexMagic (the tools are compared/explained here). They help you build, test/develop regular expressions and support the syntax and options of many regular expression engines. Once you have a working regex, RegexBuddy and RegexMagic will generate code that allows you to use the regex in a language of your choice (many languages are supported), including C code to use the regex with PCRE – which is my favourite regex library. Again, this is not an advert for these tools, just some notes from someone who has found them to be extremely useful – and have saved me considerable amounts of time in building, testing/using powerful regular expressions with PCRE.

Screenshot: RegexBuddy

Processing INITEX’s primitive(...) function code with RegexBuddy to extract data for preparing C #defines.

TeX’s “badness” function in C

TeX uses the concept of “badness” as a measure of how much the glue in a box has to stretch or shrink. In the following C function, t is the difference between the total of the natural sizes (N) of the components in the box and the desired size of the box (d). So, t = N-d. If the total amount of glue available for stretching or shrinking is s, then the badness, according to the TeXbook, is $100(t/s)^3$ – note that t/s is also known as the glue-set-ratio (often denoted r). In reality, TeX uses an approximation to this calculation, as shown below – the C code is from the C output by Web2C.

typedef int scaled  ;
typedef int halfword  ;

halfword badness ( scaled t , scaled s ) 
{
  halfword Result; 
  integer r  ;
  if ( t == 0 ) 
  Result = 0 ;
  else if ( s <= 0 ) 
  Result = 10000 ;
  else {      
    if ( t <= 7230584L ) 
    r = ( t * 297 ) / s ;
    else if ( s >= 1663497L ) 
    r = t / ( s / 297 ) ;
    else r = t ;
    if ( r > 1290 ) 
    Result = 10000 ;
    else Result = ( r * r * r + 131072L ) / 262144L ;
  } 
  return Result ;
}

What is TeX’s memoryword structure in C?

Just a brief post, partly to record this for my own use. If you read the source code of TeX you will see references to a data structure called a memoryword. It is very carefully defined in the source file texmfmem.h, using various #ifdef blocks to account for endian-type and the “flavour” of TeX you are compiling. So, here is the memoryword, stripped to the very basics for my Windows-only build of TeX. On my machine, sizeof(memoryword) = 8 bytes – glueratio is defined as the type double (8 bytes) – TeX does use the type double for glue calculations. From section 109 of the TeX source code:

When TEX “packages” a list into a box, it needs to calculate the proportionality ratio by which the glue inside the box should stretch or shrink. This calculation does not affect TEX’s decision making, so the precise details of rounding, etc., in the glue calculation are not of critical importance for the consistency of results on different computers.

#define glueratio double
typedef unsigned char quarterword  ;
typedef int integer;
typedef integer halfword  ;

typedef union
{
   struct  {
     int LH, RH;
  } v;

   struct {
     short B1, B0;
  } u;

} twohalves;


typedef struct
{
   struct {
     quarterword B3, B2, B1, B0;
  } u;

} fourquarters;

typedef union {

   glueratio gr;
   twohalves hh;
  
   struct {
     halfword junk;
     integer CINT;
   } u;

  struct {
     halfword junk;
     fourquarters QQQQ;
  } v;

} memoryword;

typedef union {
  
  struct {
    integer CINT;
  } u;

  struct
  {
    fourquarters QQQQ;
  } v;

} fmemoryword;

Minimal FreeType program to dump PostScript font names (with file globbing)

Introduction

I needed to create an updated font map for some work with dvipng/dvips and had to update psfonts.map to contain the mapping between tfm/pfb files and the corresponding PostScript name for each font. To do that I wrote a tiny C program (a simple throw-away utility using FreeType) to extract the PostScript font name from the .pfb files. To save time I used “file globbing” so that the utility’s command line could use wildcards – e,g.,[path]\*.pfb to list all Type 1 fonts in [path]. To use file globbing with Windows you need to link your code with an object file called setargv.obj which takes care of the messy details and expands the wildcards on the command line. I use the now-ancient Visual Studio 2008 IDE (good enough for me!) and needed to add setargv.obj as an additional project dependency under “Additional Dependencies” in the project settings for the linker. With that in place, the following ultra-simple program (no error checking!!) prints the font’s PostScript name and the full path name of the corresponding font file.

#include <stdio.h>
#include <ft2build.h>
#include FT_FREETYPE_H
#include FT_GLYPH_H
#include FT_OUTLINE_H


int main(int argc, char ** argv)
{

	FT_Library libfreetype;
	FT_Face     ftface;
	int i;
    
	FT_Init_FreeType( &libfreetype );

    for (i=1; i<argc; i++){

		FT_New_Face( libfreetype, argv[i], 0, &ftface );
		printf("%s %s\n", FT_Get_Postscript_Name(ftface), argv[i]);
		FT_Done_Face(ftface);
    }

	FT_Done_FreeType(libfreetype);
    return 0;
	
}