Testing embedding some Tweets

Looking inside TeX: strings and pool files

Introduction

In this post we’ll cover TeX’s handing of strings and explain .pool files. Using Web2C to build (Knuthian) TeX from Knuth’s TeX.WEB source code involves many steps as explained elsewhere on this site. One of the initial steps when building TeX is combining Knuth’s master source file (TeX.WEB) with a “change file” (TeX.CH) to produce a modified WEB source file (let’s call it TeXk.WEB) which can be processed via the Web2C process. The TeX.CH change file applies many modifications to the master TeX.WEB source code – i.e., in preparation for conversion to C code and adding support for the kpathsea file-seaching library. After the change file has been applied, the next step is to process our modified TeX.WEB (i.e., TeXk.WEB) via the TANGLE program. If TANGLE successfully parses our TeXk.WEB source code it will output two files (download links are provided for the inquisitive):

  • TeXk.p: the source code of TeX (in Pascal).
  • TeXk.pool: a file containing the string constants defined in TeXk.WEB

Here’s a small fragment of TeXk.pool as produced during my Web2C process:

....
11expandafter
04font
09fontdimen
06halign
05hrule
12ignorespaces
10mathaccent
08mathchar
10mathchoice
08multiply
07noalign
10noboundary
08noexpand
04omit
07penalty
08prevgraf
07radical
04read
05relax
06setbox
03the
06valign
07vcenter
05vrule
09save size
15grouping levels
08curlevel
09retaining
09restoring
05SAVE(
28Incompatible magnification (
02);
36 the previous value will be retained
58I can handle only one magnification ratio per job. So I've
59reverted to the magnification you used earlier on this run.
46Illegal magnification has been changed to 1000
52The magnification ratio must be between 1 and 32768.
...
*413816964

TeXk.pool consists of many lines of the format [string length][string text][end_of_line] and final containing *CHECKSUM, where CHECKSUM in the above example is 413816964. Once upon a time, .pool files had to be preserved as an external file for use when building .fmt files via INITEX but in 2008 this was changed and the .pool file is now compiled into the TeX binaries – I’ll explain this below. For example, the following note is contained in more recent texmf.cnf files:

As of 2008, pool files don't exist any more (the strings are compiled into the binaries), but just in case something expects to find these:
TEXPOOL = .;$TEXMF/web2c
MFPOOL = ${TEXPOOL}
MPPOOL = ${TEXPOOL}

As you can see from the above fragment, the TeXk.pool file contains string constants for TeX’s primitive commands plus all the strings contained in help/error messages that TeX outputs to the terminal and/or log file.

TeX’s internal handling of strings

In addition to the string constants defined in TeXk.pool, TeX will, of course, encounter new strings – for example, when you define new macro names; consequently, TeX needs a way to store the string constants in TeXk.pool and the strings it encounters during its run-time processing of your TeX files. It should not be a surprise that TeX’s internal handling of strings is achieved through methods designed to ensure portability.

From TeX.WEB: The TEX system does nearly all of its own memory allocation, so that it can readily be transported into environments that do not have automatic facilities for strings, garbage collection, etc., and so that it can be in control of what error messages the user receives... Control sequence names and diagnostic messages are variable-length strings of eight-bit characters. Since PASCAL does not have a well-developed string mechanism, TeX does all of its string processing by homegrown methods.

How does TeX use/store strings?

In vanilla C, a simple 8-bit string is an array of characters terminated by the null character ('\0'). TeX does not store is strings as individually named string variables but allocates a single large array and uses integer offsets into that array to identify strings (and calculate lengths). Here’s how it works.

From TeX.WEB: The array |str_pool| contains all of the (eight-bit) ASCII codes in all of the strings, and the array |str_start| contains indices of the starting points of each string. Strings are referred to by integer numbers, so that string number |s| comprises the characters |str_pool[j]| for |str_start[s]<=j<str_start[s+1]|. Additional integer variables |pool_ptr| and |str_ptr| indicate the number of entries used so far in |str_pool| and |str_start|, respectively; locations |str_pool[pool_ptr]| and |str_start[str_ptr]| are ready for the next string to be allocated.

It is worth noting that when TANGLE produces Pascal code (from the WEB source) it strips out all underscores from variables defined in the WEB code. For example, the |str_pool| variable mentioned above is called strpool in the final C code produced from the Pascal.

After processing via Web2C, the WEB variables |str_pool|, |str_start|, |pool_ptr| and |str_ptr| are global variables declared as follows (near the start of TeX.C):


packedASCIIcode * strpool ;
poolpointer * strstart ;
poolpointer poolptr ;
strnumber strptr

The types packedASCIIcode and poolpointer are simply typedefs:


typedef unsigned char packedASCIIcode ;
typedef int integer;
typedef integer poolpointer ;

Stripping away all typedefs introduced by Web2C gives:


unsigned char* strpool ;
int* strstart ;
int poolptr ;
int strptr ;

To see what’s going on, i.e., how TeX identifies a string, let’s first look at the global variable strpool (practically all key variables are declared with global scope in TeX.C…!). During initialization (in INITEX mode, and when TeX is reading/unpacking a .fmt file to initialize a particular format (plain.fmt, latex.fmt etc)) the strpool and strstart variables are initialized as follows:

strpool = xmallocarray (packedASCIIcode , poolsize) ;
strstart = xmallocarray (poolpointer , maxstrings) ;

where xmallocarray is a #define:


/* Allocate an array of a given type. Add 1 to size to account for the fact that Pascal arrays are used from [1..size], unlike C arrays which use [0..size]. */
#define xmallocarray(type,size) ((type*)xmalloc((size+1)*sizeof(type)))

and xmalloc(...) is a small utility function wrapped around the standard C function malloc(...).

A Pascal legacy: In many places within TeX.C you have to account for that fact that Pascal arrays start at index 1 but C arrays start at index 0. This is a consequence that Knuthian TeX is written in Pascal, not C.

The allocation of memory for strpool uses an integer variable called poolsize: the value of poolsize is calculated at run-time from the value of other variables – including some variables whose value can be defined by settings in texmf.cnf. So, in essence:

strpool = (char *) malloc(sizeof(unsigned char)*(poolsize +1));

– which looks very much like one huge C string. And, of course, it is. strpool stores all TeX’s strings BUT within strpool all strings are contiguous (stored end-to-end) without any delimiter characters between them (such as NULL, ('\0'), space, etc). Clearly, there needs to be a mechanism to define where each individual string starts and stops: i.e., to partition strpool into individual strings. That mechanism is the task of the integer array variable called strstart. Perhaps an example will make this clearer.

We can declare a variable myfakestrpool as follows:

unsigned char fakestrpool[]="ThisismyfakeTeXstrpool";

Here, we have concatenated the 6 strings "This", "is", "my", "fake","TeX" and "strpool" into one long string. These 6 strings start at the following offsets in fakestrpool:


string 0 ("This"): offsets 0
string 1 ("is"): offset 4
string 2 ("my"): offset 6
string 3 ("fake"): offset 8
string 4 ("TeX"): offset 12
string 5 ("strpool") offset 15

So, if we define an array of integers, strstart, to record these offsets:

int strstart[6] ; // for 6 strings numbered 0 to 5


strstart[0]=0
strstart[1]=4
strstart[2]=6
strstart[3]=8
strstart[4]=12
strstart[5]=15

Then for some string identified by a number k (where 0 =< k <= 5), strstart[k] gives the offset into fakestrpool where the kth string starts. And this is exactly how TeX identifies strings: it identifies them using some integer value, k, say, where strstart[k] tells you where that string starts (in strpool) and allows the length (length(k), of string number k) to be easily be calculated using

length(k) = strstart[k + 1] - strstart[k]

For example, let us use this method to calculate the length of the string with number 4 (k=4) ("TeX" in our test array fakestrpool).


length(4) = strstart[5] - strstart[4]
length(5) = 15 - 12 = 3

Of course there is one minor complication – calculating the length of string 5, but we have other variables (poolptr and strptr) to solve issues like this.

Back to .pool files

We started this discussion by noting that running the TANGLE program on TeXk.WEB produces two output files:

  • TeXk.p: the source code of TeX (in Pascal).
  • TeXk.pool: a file containing the string constants defined in TeXk.WEB

The next stage in the discussion covers the mechanisms for processing .pool files – introduced in circa 2008. Prior to (circa) 2008, you needed to keep .pool files available (part of the TeX distribution) as separate files for use whenever you ran INITEX to generate a new .fmt file. As noted, the contents of the .pool files are string constants generated by TANGLE from string constants defined in main WEB source code to TeX. Given that those strings they don’t change (they are constants), it makes more sense to build them into the TeX executable file rather than having to access them each time a new .fmt file created by INITEX. Part of the Web2C process now involves using a small utility program called makecpool.exe (on Windows) – makecpool.C was written by Taco Hoekwater. The input to makecpool.exe is the TeXk.pool file and the output is another C file (called texpool.C or similar) which defines a function called loadpoolstrings(...):

int loadpoolstrings (int spare_size)

Downloads

If you just want to see the inputs/outputs you can download the files I produced during my private build of Knuthian TeX:

  • TeXk.pool: The .pool file input for makecpool.exe
  • texpool.C: The C file output by makecpool.exe, defining the function loadpoolstrings(...).

Once you have generated texpool.c you no longer need the original TeXk.pool file because the contents of TeXk.pool are now stored within texpool.C, stored as array of strings:

static const char *poolfilearr[] = {
  "buffer size",
  "pool size",
  "number of strings",
  "" "?" "?" "?",
  "m2d5c2l5x2v5i",
  "End of file on the terminal!",
  "! ",
  "(That makes 100 errors; please try again.)",
  "" "? ",
  "Type <return> to proceed, S to scroll future error messages,",
  "R to run without stopping, Q to run quietly,",
  "I to insert something, ",
...
...
...
NULL };

Of course, when you build TeX you will need to compile TeXk.C and texpool.C so that the function loadpoolstrings(...) is made available. The function loadpoolstrings(...) is called from TeX.C when TeX is in INITEX mode (i.e., the --ini option is set on the command line). Specifically, loadpoolstrings(...) function is called by the function getstringsstarted(...) just after it has initialized the first 256 strings in TeX’s main string container: the strpool array discussed above.

Modifying loadpoolstrings (…) to see what it does

The function loadpoolstrings(…) depends on a few of TeX’s internal global variables and the function makestring() (we’ll discuss that shortly), notably we need to declare the following vaiables as extern to texpool.C:


extern int makestring ( void ) ;
extern unsigned char * strpool;
extern int poolptr;

Here is my slightly modified version of loadpoolstrings(...) which outputs a file called "datadump.txt" to list the strings and corresponding string numbers generated by makestring():

int loadpoolstrings (int spare_size) {
  const char *s;
  int g=0;
  FILE* dumpvals;
  int i=0,j=0;
  dumpvals=fopen("datadump.txt", "wb");

  while ((s = poolfilearr[j++])) {
    int l = strlen (s);
	fprintf(dumpvals, "//string \"%s\" = number ", s);
    i += l;
    if (i>=spare_size) return 0;
    while (l-- > 0) strpool[poolptr++] = *s++;
    g = makestring();
	fprintf(dumpvals, "%ld\n", g);
  }
  fclose(dumpvals);
  return g;
}

datadump.txt

Those who might be interested to see the contents of datadump.txt can download it here. In any case, here’s a listing of the first few lines in datadump.txt:

//string "buffer size" = number 256
//string "pool size" = number 257
//string "number of strings" = number 258
//string "???" = number 259
//string "m2d5c2l5x2v5i" = number 260
//string "End of file on the terminal!" = number 261
//string "! " = number 262
...
...
//string "Using character substitution: " = number 1329

As you can see, the string number of the first string is 256 (i.e., the first string originally contained in TeXk.pool). Assuming that the string numbers start at 0 (they do), TeX has already initialized strings 0..255 before loading the strings from the TeXk.pool file. I hate to do this to you, dear reader, but can you guess what those 256 strings (0..255) might be?

The function makestring()

Here is TeX’s makestring() function which returns a string number after checking for overflows – i.e., TeX has enough space to store another string.

strnumber makestring (void) 
{
  register strnumber Result; makestring_regmem
  if (strptr == maxstrings) 
  overflow (258 , maxstrings - initstrptr) ;
  incr (strptr) ;
  strstart[strptr] = poolptr ;
  Result = strptr - 1 ;
  return Result ;
}

Time to stop

Dear reader, writing this post has absorbed the greater part of my Sunday (14 September 2014) so you’ll forgive me if I call it a day and leave it here – I’ll fix any typos tomorrow :-). I hope it is of use, or interest, to someone “out there”, somewhere.

RegexBuddy and RegexMagic: Truly superb regular expression tools

Regular expressions are part of many programmer’s toolkit but they can be quite fiddly to get right. At the moment, I’m trying to “sanitize” the C code generated for TeX (via Web2C) by post-processing the TeX.c file to make the C source code far more readable. To do that I’m using the original definitions in TeX.WEB to generate C #define statements that I can use in TeX.c. For example, in TeX.WEB you see the following “WEB macros” related to entries in TeX’s “equivalence table”:

@d eq_level_field(#)==#.hh.b1
@d eq_type_field(#)==#.hh.b0
@d equiv_field(#)==#.hh.rh
@d eq_level(#)==eq_level_field(eqtb[#]) {level of definition}
@d eq_type(#)==eq_type_field(eqtb[#]) {command code for equivalent}
@d equiv(#)==equiv_field(eqtb[#]) {equivalent value}

When WEB expressions using the above macros are processed by TANGLE and Web2C the resulting C code contains many statements that look like the following:

eqtb [curval ].hh.b1 = 1 ; 
eqtb [curval ].hh.b0 = c ; 
eqtb [curval ].hh .v.RH = o ; 

Not very readable but, of course, it is machine-generated C code so what would you expect. Through regular expressions I’m (slowly/carefully) replacing many raw C statements using #defines, such as the following:

#define equivalence_level(a) eqtb[a].hh.b1
#define command_code_equivalence(a) eqtb[a].hh.b0
#define set_value_of_equivalent(a) eqtb[a].hh.v.RH

As part of this work, I use two very useful tools for building and testing regular expressions: RegexBuddy and RegexMagic (the tools are compared/explained here). They help you build, test/develop regular expressions and support the syntax and options of many regular expression engines. Once you have a working regex, RegexBuddy and RegexMagic will generate code that allows you to use the regex in a language of your choice (many languages are supported), including C code to use the regex with PCRE – which is my favourite regex library. Again, this is not an advert for these tools, just some notes from someone who has found them to be extremely useful – and have saved me considerable amounts of time in building, testing/using powerful regular expressions with PCRE.

Screenshot: RegexBuddy

Processing INITEX’s primitive(...) function code with RegexBuddy to extract data for preparing C #defines.

Minimal FreeType program to dump PostScript font names (with file globbing)

Introduction

I needed to create an updated font map for some work with dvipng/dvips and had to update psfonts.map to contain the mapping between tfm/pfb files and the corresponding PostScript name for each font. To do that I wrote a tiny C program (a simple throw-away utility using FreeType) to extract the PostScript font name from the .pfb files. To save time I used “file globbing” so that the utility’s command line could use wildcards – e,g.,[path]\*.pfb to list all Type 1 fonts in [path]. To use file globbing with Windows you need to link your code with an object file called setargv.obj which takes care of the messy details and expands the wildcards on the command line. I use the now-ancient Visual Studio 2008 IDE (good enough for me!) and needed to add setargv.obj as an additional project dependency under “Additional Dependencies” in the project settings for the linker. With that in place, the following ultra-simple program (no error checking!!) prints the font’s PostScript name and the full path name of the corresponding font file.

#include <stdio.h>
#include <ft2build.h>
#include FT_FREETYPE_H
#include FT_GLYPH_H
#include FT_OUTLINE_H


int main(int argc, char ** argv)
{

	FT_Library libfreetype;
	FT_Face     ftface;
	int i;
    
	FT_Init_FreeType( &libfreetype );

    for (i=1; i<argc; i++){

		FT_New_Face( libfreetype, argv[i], 0, &ftface );
		printf("%s %s\n", FT_Get_Postscript_Name(ftface), argv[i]);
		FT_Done_Face(ftface);
    }

	FT_Done_FreeType(libfreetype);
    return 0;
	
}