Compiling LCDF Typetools under Windows using MinGW

Introduction

This is just a short post to share a workaround to a problem I ran into when building Eddie Kohler’s superb LCDF Typetools under Windows using MinGW. After running ./configure to create the make files I hit a problem during compilation with lots of error messages referring to undefined reference to `ntohl@4'


../typetools/libefont/otf.cc:863: undefined reference to `ntohl@4'
../typetools/libefont/otf.cc:861: undefined reference to `ntohs@4'
../typetools/libefont/otf.cc:861: undefined reference to `ntohs@4'

One solution

The cause of the error is failure to link to the library libwsock32.a (contained in the c:\MinGW\lib\ directory on my PC). The following workaround solves the problem but I’m sure there are better ways of doing it. Several tools within the Typetools collection depend on libwsock32.a to compile successfully. There are:

  • otfinfo
  • otftotfm
  • cfftot1

To build these programs you need to make a small edit to the generated makefiles.

  1. Create a directory called (say) libs within the Typetools directory tree.
  2. Copy libwsock32.a into that directory.
  3. For each application listed above, that depends on libwsock32.a, open the makefile in the appropriate application directory and look for a line starting with XXXXX_LDADD where XXXX is otfinfo or otftotfm or cfftot1
  4. Edit that line to include libwsock32.a
  5. Example: cfftot1_LDADD = ../libefont/libefont.a ../libs/libwsock32.a ../liblcdf/liblcdf.a

You should now be able to run make and achieve a successful compilation. It worked for me, I hope it works for you.

Technique to make Web2C/tangle put comments in the C code generated for TeX

Introduction

Over the last couple of evenings I’ve been looking at the C code for TeX generated by the tangle and Web2C conversion process. By default, the Web2C conversion process generates C source code which is almost completely devoid of comments and symbol strings are converted to numbers (etc), making the C source nearly impossible to read. However, by making a small change to the flex-generated source code (web2c-lexer.c) together with some careful use of regular expressions on the WEB sources (and/or some manual editing) you can get a lot of comments put into the C source. Note: I’ve not yet explored whether it is possible to use the changefile method to achieve the same (or similar) results. Here’s an outline of my experimental technique.

Outline of the technique

Naturally, via the literate programming methodology, the WEB source files for TeX contain a full description of the TeX program. The Pascal code in the WEB source files is full of comments or short descriptions (enclosed in braces {....}) which, if preserved, (in the Web2C-generated C code), would make it much more readable. However, those Pascal comments are stripped out by tangle (but are used by weave); consequently, the Pascal generated by tangle, and fed into Web2C.exe, does not contain any useful comments. I say “useful” because the Pascal does contain some line-number comments but these are not really that helpful and they are removed by the comment-handling code in web2c-lexer.c.

Comments in WEB files

So, what are we looking to do? In essence, we need a way to convince tangle to output comments into the Pascal code it generates and find a way to ensure that those comments are processed and passed into the C code by Web2C.

Web2C and Pascal comments

Caveat: I am not a Pascal programmer and have no desire to become one! However, all you need to know is that within the Pascal generated by tangle the comments are simply enclosed in braces, like this: {This is a comment in Pascal}. These comments are filtered out by web2c-lexer.c. Another caveat: be extremely careful when making any changes whatsoever to the lexer code C (or .l sources), you can break things very badly (hmmm, wonder how I found that out…). I cannot stress the importance of being very, very careful in making any changes to web2c-lexer.l/web2c-parser.y or web2c-lexer.c/web2c-parser.c: these lexical analyser and parser-generator sources are critical to the C-generation process. OK, I think I made the point. The following description probably deserves nomination for an “Ugly Hack Award” and, no doubt, a flex/bison expert (which I’m definitely not) could design an elegant solution to incorporate comment-handling in proper context within the parsing process. OK, enough self-flagellation, let’s move on.

If you look in Web2C-lexer.l the code which handles comments is simply:

"{" { while (webinput() != '}'); }

After running flex on Web2C-lexer.l this becomes (in Web2C-lexer.c)


case 2:
YY_RULE_SETUP
#line 53 "web2c-lexer.l"
{ while (webinput() != '}'); }
YY_BREAK

Basically, the lexical analyser is stripping out things like {This is a comment from Pascal}. To get comments into the C code generated for TeX you’ll need to modify the lexer code to stop it skipping comments and process them to generate C comments such as /* This is a comment from Pascal */. There are a few points to consider here: firstly, you’ll need to experiment to see exactly where the comments end up in your C code. Due to the “Ugly Hack” approach, we’re not paying any real attention to the “context” of where we are in the parsing process when outputting our comments; again, a proper flex/bison implementation is required. For example, by the time your comment is seen by the lexer a newline (\n) may already have been output so your comments may end up on a new line – easily fixed by some manual tidy-up of the C code (or via the use of running regular-expression tools on the Web2C-generated C source code). So, just to note that you’ll need to do some trial and error to see what happens.

Getting comments into WEB and surviving tangle

As noted, tangle strips out comments in the WEB sources and they don’t even reach the Pascal code it outputs. So, can we coerce tangle to preserve comments in WEB soures and put them in the Pascal for Web2.exe to process/output? A quick reading of The WEB User Manual implies that there are two ways to get text output to the Pascal source code produced by tangle:

  • use “control-text” such as @=your comment text here@> which causes the text to be output verbatim into the Pascal code, or
  • use “meta-comments”: such as @{your comment text here @} which, in the Pascal, results in a standard comment such as {your comment text here}.

Robby the Robot says Danger!

Sorry for the reference to Robby the Robot, indulge me….. Seriously, though, if you make edits to the WEB source to put in “control-text” or “meta-comments” you can very easily foul-up tangle’s parser and break tangle’s conversion process pretty badly. As yet, I’m not able to give precise rules on where it is safe to add “control-text” or “meta-comments” (I’m still experimenting) so I suggest you read The WEB User Manual to understand a little more about WEB syntax before attempting it.

Mind the pool file: Be careful inserting/using text with double quotes "..." because it can trigger tangle’s parser to output that text in the tex.pool file which you don’t want to do. I used single quotes '...' and that seems to be safe(er). I can’t recall exactly what I did that caused this to happen, but just be sure to check that the .pool file does not become polluted with any of the text you insert into the WEB sources.

Getting to the point

So far we’ve seen that to get comments into the C source code we need to:

  1. modify the behaviour of web2c-lexer.c and tell it (selectively) not to skip all Pascal’s comment construct {...} (see use of '...', below).
  2. coerce tangle to preserve comments and output them into the Pascal source so that Wb2C.exe sees them and the code in web2c-lexer.c can process them.

An example

Within the TeX WEB source code is a function which initializes TeX’s “primitives”. Here’s a small extract of the raw WEB source code


@ The symbolic names for glue parameters are put into \TeX's hash table
by using the routine called |primitive|, defined below. Let us enter them
now, so that we don't have to list all those parameter names anywhere else.

@=
primitive("lineskip",assign_glue,glue_base+line_skip_code);@/
@!@:line_skip_}{\.{\\lineskip} primitive@>
primitive("baselineskip",assign_glue,glue_base+baseline_skip_code);@/
...

When this is translated to C the result looks something like this:


...
primitive ( 381 , 75 , 24527 ) ;
primitive ( 382 , 75 , 24528 ) ;
...

Not a string or comment in sight. tangle has also converted everything into integers: "lineskip" becomes 381 … single-stepping through this C code with a debugger is not my idea of fun. So, what to do?

If you look at the form of code like

primitive("lineskip",assign_glue,glue_base+line_skip_code);

it is very amenable to processing with regular expressions. What you can do, for example, is pre-process the WEB source with your favourite regex tool to add “meta-comments” that will reach the Pascal and (with your modified lexer) make it into the C code. For example (should all be on one line):

primitive("lineskip",assign_glue,glue_base+line_skip_code); @{'lineskip,assign_glue,glue_base+line_skip_code'@};@/

Here we added the “meta-comment”

@{'lineskip,assign_glue,glue_base+line_skip_code'@}

just after the original Pascal code. Note that I have used single quotes '...' to delimit the text simply because I want to be able to detect the comments I introduced when the modified lexer is scanning my comments. To cut a long story short, through this technique you end up with C code that looks like this:


primitive ( 381 , 75 , 24527 ) ; /*lineskip,assign_glue,glue_base+line_skip_code*/
primitive ( 382 , 75 , 24528 ) ; /*baselineskip,assign_glue,glue_base+baseline_skip_code*/

Maybe not beautiful, but at least you now know what (some) of those tangle-generated numbers represent.

In conclusion

This techniqe is not “pretty” but, if used with care, you can get tangle to output a lot of useful comments, either through regular-expressions and pre-processing of the WEB code, or hand-editing the WEB to write summaries of the descriptions of the source code. I must stress that you can’t put “meta-comments” just anywhere in the WEB source because you risk breaking tangle’s parsing process: you’ll need to experiment and proceed carefully with (say) small/minor manual edits to make sure tangle or Web2C don’t “choke” on your changes.

Porting and building Web2C.exe for Windows

Introduction

This post is, once again, an aide-mémoire to record a work-in-progress: porting the tools that convert Knuth’s original Pascal-based WEB source to C – to create a native build of Web2C.exe, fixwrites.exe and other tools using Microsoft’s Visual Studio (and not using pipes). My apologies if this post is a little unstructured but the whole task is somewhat convoluted, which may be reflected in my writing style for this post! However, I’d like to record it whilst it is fresh in my memory.

Why would anyone want to do this when there are ready-made, reliable, TeX distributions freely available? Good question. Well, for me, it’s nothing more than pure curiosity – and the fact that most British TV programs are now such mind-numbing drivel that I might as well do something productive in the evenings!

Join TUG: Just as an aside, I’m a member of the TeX User Group, TUG, so if you too would like to support TeX why not consider joining?

Another reason for writing this post is that I could not find much documentation on how to build Web2C.exe from source code – apart from these notes by Timothy Murphy, detailing the process for Macintosh-based port. Even though they were written in 1992 they were extremely helpful in filling in some of the details, so a belated thank you to Timothy Murphy – much of this post draws inspiration from that document. Piecing together the Web2c build process has been somewhat of a “programming jigsaw” – there are still gaps in my understanding but, I think, I can see the big picture even if it’s still a little hazy in some areas.

The Big Picture

The source files for TeX, and other TeX-related programs and utilities, are written using Professor Donald Knuth’s literate programming methodology. In essence, the program code (in Pascal) and documentation of the source code (in TeX) are contained within a single file, with extension .web. For example, Professor Knuth’s source code of the latest version of TeX is contained in a file called tex.web. Similarly, within the TeXLive repository (see a previous post) or on CTAN, you can find the WEB source code for the latest versions of other programs; for example:

  • bibtex.web: the source code/documentation of BiBTeX, for formatting and producing reference lists, as widely used within academic journal papers.
  • mf.web: the source code/documentation of MetaFont.
  • patgen.web: the source code/documentation of patgen which “… takes list of hyphenated words and generates a set of patterns that can be used by the TeX 82 hyphenation algorithm.”
  • tangle.web: the source code/documentation of tangle, which converts a WEB file to a Pascal (i.e., extracts the source code in Pascal, not in C – that’s why Web2C exists).
  • weave.web: the source code/documentation of weave, which converts a WEB file to TeX (i.e., extracts the documentation of the program’s Pascal source code).

and other programs/utilities such as dvicopy.web, pltotf.web, tftopl.web and so forth.

What’s in a name: tangle, web and weave? I’ve not researched to find out, but I cannot help thinking that Professor Knuth drew inspiration from Sir Walter Scott when naming these programs. Scott’s poem Marmion contains the line(s) “O, what a tangled web we weave when we practice to decieve”. Maybe these programs are as literary as they are literate?

TeXLive as the source of the files for building Web2C.exe

The files I reference throughout this post can be downloaded via SVN from the TeXLive repository. If you want to browse the TeXLive repository, using the TortoiseSVN program on Windows, this post may be of help. The following screenshots show the TeXLive folders you’ll need to access for acquiring the various files I mention in this post.

  • svn://tug.org/texlive/trunk/Build/source/texk/web2c: this folder contains, for example, tangleboot.pin (see below) and all the *.web files listed above, plus many other essential files.

  • svn://tug.org/texlive/trunk/Build/source/texk/web2c/web2c: this folder contains the source files needed to build the actual Web2C.exe program. Note carefully it does not contain a file called Web2C.c, more on that below.

TeXLive has an advanced build-process for compiling/building all the tools and software it contains and I, for one, am in awe of the skills and expertise of its maintainers. In describing my explorations of building Web2C.exe as a Windows-based executable you need to realize that I am taking the source code files of Web2C.exe out of their “natural build environment”. What do I mean by this? Building the Web2C executable program is usually part of the much bigger TeXLive build/compilation process so you should be prepared for a little extra complexity to create Web2C.exe as a “standalone” Windows program. Note that “standalone” is in quotes because converting WEB-generated Pascal into C code requires other tools in addition to Web2C.exe: it is not fully accomplished by Web2C.exe alone.

A note about Kpathsea

The Kpathsea (path-searching) C library in an integral part of most TeX-related software and the Web2C C source files #include a number of Kpathsea headers. However, for my own purposes/experiments I’ve decided to decouple my build of the Web2C.exe executable from the need to include Kpathsea’s headers – the resulting C files generated by Web2C.exe will, of course, still depend on Kpathsea. If you grab the Web2C source files (see below) then “out of the box” you’ll need to checkout the Kpathsea library from:

svn://tug.org/texlive/trunk/Build/source/texk/kpathsea

I’ve simply not got the time to document everything I had to do to decouple Kpathsea when building Web2C.exe. It mainly involved commenting out various #include lines that pulled in Kpathsea headers and placing a few #define statements into my local version of web2c.h – plus creating some typedefs and adding a few macros. If you’re an experienced C programmer it is unlikely to present difficulties. As mentioned, this post describes a work-in-progress to satisfy my own curiosity and is meant to share a few of the things I’ve learnt, should they be useful to anyone as a starting point for their own work.

Web2C: so what is it?

Let me be clear that when I refer to Web2C I am referring to the executable program which undertakes the first (main) step in converting Pascal code into to C. So let’s now start to take a look at the details but start with a summary of “Where are we?”

Where are we?

The starting point for generating C code is to extract the Pascal code from WEB source files and that is accomplished using the tangle program. However, where do we get a working tangle program from to start with – do we have a chicken and egg problem? tangle is itself distributed in WEB source code (tangle.web), so if I need tangle to extract tangle’s source code from tangle.web, how do I create a working tangle program? Well, of course, this is solved by the distribution of tangle’s Pascal code in a file called tangleboot.pin within the Web2C directory of the TeXLive repository (see above). In essence, tangleboot.pin let’s you “bootstrap” the whole Web2C process by creating a working tangle.exe which you can use to generate the Pascal from WEB source files. Hence the name tangleboot.pin

So, how do I go from tangleboot.pin to a working tangle.exe? You need to build Web2C.exe and some associated utility programs (e.g., fixwrites.exe).

Where are the Web2C.exe source files?

As mentioned above, the TeXLive folder containing the source files needed to build Web2C.exe is

  • svn://tug.org/texlive/trunk/Build/source/texk/web2c/web2c

The C source files you need to compile/build Web2C.exe are:

  • kps.c
  • main.c
  • web2c-lexer.c
  • web2c-parser.c

Some notes on these files

These C files #include a number of header files from the TeXLive distribution, notably from the Kpathsea library, so you should definitely look through them to determine any additional files you need.

The files web2c-parser.c and web2c-lexer.c are worthy of some explanation because they are the core files which drive the Pascal –> C conversion process. However, these two C source files are not hand-coded but are generated from two further source files with similar names. If you look among the source files you will also notice these two additional files:

  • web2c-lexer.l
  • web2c-parser.y

What are these files with similar names? As you may infer from their names, these files are a lexical analyser and a parser generator and require additional tools to process them:

  • web2c-lexer.l --> web2c-lexer.c using a tool called flex.
  • web2c-parser.y --> web2c-parser.c + web2c-parser.h using a tool called bison.

Are bison/flex available for Windows?

Fortunately they are and, at the time of writing (February 2013), you can download Windows ports of bison 2.7 and flex 2.5.37 from http://sourceforge.net/projects/winflexbison/. The executables are called win_bison.exe and win_flex.exe respectively. The win_flex.exe port of flex adds an extra command-line switch (--wincompat) so that the C code it generates uses the standard Windows header io.h instead of unistd.h (which is used on Linux). You can also download older versions of bison and flex for Windows from the GnuWin32 project.

I have not yet tried to use the code generated by win_flex.exe and win_bison.exe but to the best of my (current) knowledge the command-line options you need are:

  • win_bison -y -d web2c-parser.y to generate the parser (you’ll get different file names on output: y.tab.c and y.tab.h)
  • win_flex --wincompat web2c-lexer.l to generate the lexical analyser (you’ll get a different file name on output: lex.yy.c)

You need more than just Web2c.exe

Assuming that you successfully build Web2c.exe, it is still not the end of the story. Although Web2c.exe does the bulk of the work in converting the Pascal to C, some initial pre-processing of the Pascal source file is needed before you can run it through Web2C.exe, and some further post-processing of the C code output by Web2C.exe is also needed. The details of how these pre- and post-processing steps actually work are contained within an important BASH shell script called convert (it has no extension) – convert is located within the TeXLive folder containing the Web2C source files. I readily confess that I know very little about Linux shell scripting so if you are well-versed in shell scripts no doubt you can easily understand what is going on in the convert file. However, here are pointers to get you started.

Pre-processing: adding the *.defines files to the Pascal file

Before you can actually run Web2C.exe on the Pascal file generated from WEB sources you need to concatenate the Pascal source file with some files having the extension “.defines“: you add these files to the start of the Pascal file before running Web2C.exe. There are several .defines contained in the Web2C source directory including:

  • common.defines
  • mfmp.defines
  • texmf.defines

The convert script checks which program, and its options, (TeX, MetaFont, BiBTeX etc) is being built and concatenates the appropriate *.defines file(s) to the start of the corresponding Pascal file. At this time, I don’t quite fully understand how/why these files are needed, but for the full details you need to read convert. By way of an example, when processing tangleboot.pin I added the file common.defines to the beginning of tangleboot.pin.

Post-processing: fixwrites.exe

Web2C.exe‘s output is not quite pure C source code – it may still contain some fragments of Pascal which need a specialist post-processing step to fully convert them to C: enter fixwrites.exe. fixwrites.exe post-processes Web2C.exe‘s C output to “…convert Pascal write/writeln’s into fprintf’s or putc’s” (see fixwrites.c).

Notes on web2c-parser.c, web2c-lexer.c and

  • main.c
  • Upon reading the convert script, and when I first ran Web2C.exe, it became readily apparent that the whole Pascal –> C tool chain (driven by convert) communicates using pipes) with stdout/stderr. The output of one program is “piped” into the input to another, rather than writing the data out to a physical disc file and then reading it back in. My personal preference, certainly whilst learning, is to output data to a file so that I can capture what’s going on.

    main.c and yyin

    Without going into too much detail, I needed to make a number of changes in main.c so that the lexical analyzer web2c-lexer.c was set to read it’s data from a disc file rather than through pipes/stdin. The FILE* variable you need to set/define is called yyin. For example, within main.c there is a function called initialize () which can be used to set yyin. For example:

    void initialize (void)
    {
    register int i;
    for (i = 0; i < hash_prime; hash_list[i++] = -1) ; yyin = xfopen("your_path_to\\tangleboot.p","r"); ... ... }

    In addition, within main.c there's a small function called normal () which does the following:


    void normal (void)
    {
    out = stdout;
    }

    The normal () function is called from within web2c-parser.c to set the output file (FILE *out) to stdout. At present, I'm not sure precisely why this is done, but I guess it is part of the piping between programs as driven by the convert process. For example, code within convert uses sed (the stream editor).

    Other output redirections happen in web2c-parser.c and you can search for these by looking for out = 0. Tracking down and locating these output redirections certainly helped me to better understand the flow of the programs.

    In conclusion

    This post is a little disjointed in places and light on detail in a number of areas, reflecting my own (currently) incomplete understanding of the relatively complex processes involved in converting WEB/Pascal to C. Nevertheless, I hope that it is of some use to someone, at some point. As my understanding develops I'll try to fill in the gaps with future posts.

    Building and using CTIE, CWEAVE and CTANGLE on Windows

    Introduction

    Before continuing, I should say that this post is a sort of aide-mémoire for myself but I hope it is useful to others as well. Anyone who has looked into building TeX from the WEB source code soon finds that the process is somewhat “less than straightforward”. Life can get a bit more complicated if, like me, you prefer to use Microsoft’s Visual Studio rather than MSYS/MinGW – which gives you a cut down “Linux-like” build environment. I use MSYS/MinGW for building LuaTeX and it works really well, but I confess to being seduced by the nice IDE of Microsoft’s Visual Studio. Having used Visual Studio to build a couple of C-based TeX distributions (Y&Y TeX, now open sourced), together with CXTeX, part of MetaTeX (by Taco Hoekwater), I have decided to “bite the bullet” and create a Visual Studio build for LuaTeX. I’m sure this will take quite some time but, you know, sometimes you get one of those itches you just have to scratch! And I’ve been meaning to attempt this for a long time, purely as an exercise.

    CWEB

    A lot of LuaTeX source code (apart from the libraries it uses) is written in CWEB, a dialect of literate programming by Silvio Levy and Donald Knuth. The original WEB sources of TeX use Pascal as the programming language, but for CWEB it is C – which, thankfully, saves you the painful process of converting Pascal to C via Web2C. Anyone who has looked at the C code generated by Web2C will, I’m sure, be dismayed because it’s almost impossible to read. This is not so surprising given that it’s machine generated. The joy of LuaTeX’s C code, derived from CWEB, it that it is much, much more readable than the C code derived WEB –> Web2C –> C. Not suprising, of course, because CWEB C code is not generated mechanically.

    So, how do you process CWEB code?

    Enter CWEAVE and CTANGLE. What are these, you may well ask. Their task is to process a file written using CWEB (typically, with an extension “.w“) and output the C source code (using CTANGLE) or the program’s documentation in TeX (using CWEAVE). From http://sunburn.stanford.edu/~knuth/cweb.html:

    • CTANGLE: converts a source file foo.w to a compilable program file foo.c;
    • CWEAVE: converts a source file foo.w to a prettily-printable and cross-indexed document file foo.tex.

    The LuaTeX build process (with MSYS/MinGW) also generates the executable ctangle.exe, so my first thought was “Great, I’ll just use that to generate C source from the CWEB *.w files in the LuaTeX distribution”. Running ctangle (from the LuaTeX build) using the command line (under the Windows cmd shell, or the MSYS BASH shell):

    ctangle --help

    you get the following output:

    $ ctangle --help
    Usage: ctangle [OPTIONS] WEBFILE[.w] [{CHANGEFILE[.ch]|-} [OUTFILE[.c]]]
    Tangle WEBFILE with CHANGEFILE into a C/C++ program.
    Default CHANGEFILE is /dev/null;
    C output goes to the basename of WEBFILE extended with `.c'
    unless otherwise specified by OUTFILE; in this case, '-' specifies
    a null CHANGEFILE.

    But first, change files and CTIE

    From the output of ctangle --help you can see that its command line includes reference to CHANGEFILE.ch. So, what is that? Suppose that you have some program foo.w written in CWEB and you want to make some platform-specific modifications to foo.w. Rather than amending foo.w itself (e.g., to keep it platform-independent) you “merge” foo.w with a change file which, for example, may contain Windows-specific code. You would put your Windows CWEB code into, say, win32.ch and merge this code with foo.w. So how do you do this merge? There are two main ways:

    1. you can combine foo.w and win32.ch using CWEAVE or CTANGLE, or
    2. you can use another program called CTIE

    What is CTIE?

    The idea behind CTIE is that it lets you merge a master CWEB file with multiple change files, whereas CWEAVE or CTANGLE support only 1 change file on their command line. The source code of CTIE is also part of the LuaTeX distribution, in the directory ..\source\texk\web2c\ctiedir\. In there you’ll find a file called ctie.c which compiles easily using Visual Studio to give you ctie.exe. If you want to read more about CTIE I have processed the documentation which you can download as a PDF.

    Default CHANGEFILE is /dev/null

    On reading the Usage information output by ctangle --help you should note that the Usage instructions state: Default CHANGEFILE is /dev/null. The explanation of /dev/null on http://en.wikipedia.org/wiki/Dev-null state that:

    “In Unix-like operating systems, /dev/null or the null device is a special file that discards all data written to it.”

    The Usage instructions are a little cryptic but what it is saying is that if you want to run CWEAVE or CTANGLE without using a change file, you would run it like this:

    ctangle foo.w -

    where the hyphen (-) in effect says “don’t use a changefile”. Let’s take an example CWEB file from the LuaTeX distribution, align.w, and try to generate the C source code using the version of ctangle built during the LuaTeX compilation process using MSYS/MinGW. Here we don’t want to apply a change file so we’ll use the hyphen option in place of a change file (the C output file will default to align.c):

    ctangle align.w -

    The resulting output is:

    This is CTANGLE, Version 3.64 (TeX Live 2011)
    ! Cannot open change file NUL. (l. 0)

    (That was a fatal error, my friend.)

    That’s a bit annoying, but the fix is very simple and there are a couple of ways to do it.

    How to fix this?

    To build CTANGLE you needs two files: ctangle.c and common.c, both of which are located in the source directory of LuaTeX (..\source\texk\web2c\cwebdir\). The “offending” code which causes the fatal error is located in common.c (and, of course, common.w).

    In common.c (or common.w) you’ll find the line:

    if (found_change<=0) strcpy(change_file_name,"/dev/null");

    and that needs changing for Windows. Fortunately, in the LuaTeX distribution (..\source\texk\web2c\cwebdir\) there is a change file, comm-w32.ch, taking care of this (written by Fabrice Popineau, in February 2002). In comm-w32.ch you'll find the above line replaced with:

    if (found_change<=0) strcpy(change_file_name,"NUL");

    Of course, the proper way to fix this is to apply a change file (such as comm-w32.ch) to the CWEB source of common.w and re-generate common.c with the above fix. You can fix common.c in at least two ways:

    1. manually edit common.c to replace "/dev/null" with "NUL" in the line above, or
    2. use the LuaTeX-build created version of ctangle but with the comm-w32.ch change file – it was the absence of a change file that we are trying to fix.

    Note: If you are experimenting with these CWEB tools I strongly suggest you make a copy of all your *.w files into a working directory in case you make an error and accidentally overwrite any files.

    Copy ctangle.exe, common.w and comm-w32.ch to a working directory away from your main source code, CD into that directory (make it the current directory), and run the following command line (it works under DOS and the MSYS BASH shell). the "./" simply tells ctangle to look in the current directory.

    $ ctangle ./common.w ./comm-w32.ch ./mycommon.c

    If successful, the output should be:

    ctangle ./common.w ./comm-w32.ch ./mycommon.c
    This is CTANGLE, Version 3.64 (TeX Live 2011)
    *1*5*7*27*56*67*77*81
    Writing the output file (./mycommon.c):.....500.....1000
    Done.
    (No errors were found.)

    giving you a new version of common.c (which I called mycommon.c) with the fix applied by comm-w32.ch. If you look at the last lines of the mycommon.c file you just generated you should see something like this:

    #line 78 "./comm-w32.ch"
    if(found_change<=0)strcpy(change_file_name,"NUL"); #line 1283 "./common.w"

    You can see that the line 78 of comm-w32.ch has been applied. Now, with the fixed file (mycommon.c) you can proceed to build CTANGLE using Visual Studio to generate an executable that accepts the "NULL" change file. We'll see that in a moment.

    Let's use a different approach: using CTIE to merge common.w and comm-w32.ch into, say, mycommon.w. From mycommon.w we'll use our newly compiled CTANGLE to output mycommon.c. The the following CTIE command line does the trick:

    ctie -m mycommon.w common.w comm-w32.ch

    The -m option is documented here. If successful, you should see something like this:

    This is CTIE, Version 1.1
    Copyright 2002,2003 Julian Gilbey. All rights reserved. There is no warranty.
    Run with the --version option for other important information.
    (common.w)
    (comm-w32.ch)
    ....500....1000....
    (No errors were found.)

    However, you can of course simply edit the file common.c directly to make the change. Once you fix common.c, both CWEAVE and CTANGLE compile nicely with Visual Studio and work perfectly when you use the "-" option to indicate no change file. With a working CTANGLE you can generate the C source of newcommon.w like this:

    G:\CWEB\cwebtools\Debug>ctangle newcommon.w -
    This is CTANGLE (Version 3.64)
    *1*5*7*27*56*67*77*81
    Writing the output file (newcommon.c):.....500.....1000
    Done.
    (No errors were found.)

    With CTANGLE in place you can now run it on the CWEB *.w sources in LuaTeX to generate the C code. Clearly, for Visual Studio one way to proceed is to incorporate CWEB sources into your project and have a "Custom Build Step" for .w files, processing them with CTANGLE.

    Happy TeXing, or should I say ctieing, ctangling and cweaving!

    Adding a UTF-8-capable regular expression library to LuaTeX

    Introduction

    In this post I’m going to sketch out adding the free PCRE C library to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I’ve not tried this in a production environment. So, do please undertake all necessary testing and due diligence in your own code!

    PCRE: Perl Compatible Regular Expressions

    PCRE is a mature C library which provides a very powerful regular expression engine. It is also capable of working with UTF-8 encoded strings, which is, of course, very useful because LuaTeX uses UTF-8 input. I’m not going to cover the entire PCRE build process in this post because, frankly, it’ll take too long. But in outline…

    Building PCRE as a static library (.lib)

    1. I used CMake to create a Visual Studio 2008 project via the PCRE-supplied CMakeLists.txt file. Using the CMake tool you can set the appropriate compile-time flags for UFT-8 support: PCRE_SUPPORT_UTF and PCRE_SUPPORT_UNICODE_PROPERTIES. The latter is very useful for seaching UTF-8 strings based on their Unicode character properties. Full details are in the PCRE documentation.
    2. After you finish configuring the PCRE build, and have selected your build environment, press Generate and CMake will output a complete Visual Studio project that you can open and start working on. Wonderful!
    3. However, getting PCRE to build as a static library was fine but I did have a few hassles getting the library to correctly link against the DLL I was building. It took me a bit of time to figure out which additional PCRE preprocessor directives I needed to set in the DLL C code to ensure everything was #define‘d properly.

    Building a DLL for LuaTeX

    I wrote a very brief overview of building DLLs for LuaTeX in this post so I won’t repeat the details here. Instead, I’ll give a summary indicating how you can get PCRE to call LuaTeX. One word of advice, PCRE comes with a lot of documentation and you’ll need to read through it very carefully! Asking PCRE to call LuaTeX sounds strange but indeed you can do it because PCRE provides the ability to register a callback function it will call each time it matches a string. Perl has a similar ability to execute Perl code on matching a string. From the PCRE documentation:

    “PCRE provides a feature called ‘callout’, which is a means of temporarily passing control to the caller of PCRE in the middle of pattern matching. The caller of PCRE provides an external function by putting its entry point in the global variable pcre_callout.”

    Calling LuaTeX

    OK, so how do we do that? There are two parts to this story: create a Lua function you want to call from C and create the C function which calls the Lua function.

    1. From within LuaTeX, use \directlua{...} to create a simple Lua function printy that we are going to call from PCRE. This Lua function takes a string and sends it to LuaTeX via tex.print(). In these examples I sent LuaTeX a simple text string "Yo! I was called!", which LuaTeX then typeset. Of course, you could also send LuaTeX the string that was matched by PCRE!
             \directlua{
                    function printy (str)
                    tex.print(str)
                    end
             }
      
    2. The next part is to create the C code to call a Lua function. This C function is the callout that PCRE will call when it matches a string.
             int mycallout(pcre_callout_block *cb){
             lua_State *L;
             L = cb->callout_data;
             if (L){
                    lua_getglobal(cb->callout_data, "printy");
                    if(!lua_isfunction(L,-1)) {
                           lua_pop(L,1);
                           return 0;
                     }
      
                    lua_pushstring(L, "Yo! I was called!");   /* push 1st argument */
                    /* Now make the call to printy with 1 argument and 0 results*/
                    if (lua_pcall(L, 1, 0, 0) != 0) {
                    // report your error 
                     return 0;
                    }
          }
          return 0;
      }
      

      A few points here are worth noting.

      • From the PCRE documentation:

        “The external callout function returns an integer to PCRE. If the value is zero, matching proceeds as normal. If the value is greater than zero, matching fails at the current point, but the testing of other matching possibilities goes ahead, just as if a lookahead assertion had failed. If the value is less than zero, the match is abandoned, the matching function returns the negative value”

      • The lua_State variable, *L, is passed in via a mechanism I’ll outline below.
      • The line lua_getglobal(cb->callout_data, "printy") does the main work of pushing the value of the gloabal variable printy onto Lua’s stack. Of course, in effect this is a pointer to the function we defined in LuaTeX, and which we call through lua_pcall(...). Further details in the Lua documentation.
      • The above code does near-zero error checking, it is purely to demonstrate the ideas!

    Other PCRE bits and pieces

    There are a few other points to consider, namely how do you setup the callout and how do you pass lua_State *L to the callout? I’m not going to explain in great detail how all these parts hang together in a full application, simply point out some key pieces.

    1. You have to set the PCRE global variable pcre_callout, a function pointer, to your callout function. Simply, pcre_callout = mycallout; Yes, it does work. Here, re represents our compiled regular expression pattern. Note that you must use the PCRE_UTF8 option if you are searching UTF-8 encoded text.
    2. Before you can start searching, you need to “compile” your regular expression pattern.
                    re = pcre_compile(pattern,
      		      PCRE_UTF8|PCRE_UCP,
      		      &err_msg,
      		      &err,
      		      NULL);
      
    3. Note, to use PCRE callouts you need to use the appropriate syntax in your regular expression; from the PCRE documentation, “Within a regular expression, (?C) indicates the points at which the external function is to be called.” Once you have compiled your search pattern, and done your error checking, you need to run the search engine using the compiled pattern and your target string (s) in the code below.
    4. The next step is to create a pointer to something called a pcre_callout_block, which is a struct. This struct has a field called callout_data which is a pointer into which you can store whatever you want to pass into the mycallout function: here, I’m setting it to the lua_State variable, L. By doing this, each time PCRE matches a string and calls the callout funtion, the lua_State variable, L will be available for our use! Clearly, you’ll need to do this from within the appropriate function you call from LuaTeX. Once this is done you are ready to begin your searching using pcre_exec(...).

                    pcre_extra *p;
                    p = (pcre_extra*) malloc(sizeof(pcre_extra));
                    memset(p,0, sizeof(pcre_extra));
                    p->callout_data = L;
                    p->flags=PCRE_EXTRA_CALLOUT_DATA;
                           res = pcre_exec(re,
                                  p,
                                  s,
                                  len,
                                  0,
                                  0,
                                  offsets,
                           OVECMAX);
      

    Summary

    PCRE is a marvellous and powerful C library – with copious documentation that you’ll need to read very carefully! The ability to provide LuaTeX with a UTF-8-enabled regex engine could open the way to some useful applications, particularly when combined with LuaTeX’s own callback mechanism. In particular, the process_input_buffer callback which allows you to change the contents of the line input buffer just before LuaTeX actually starts looking at it. The mind boggles at the possibilities!

    Browsing LuaTeX source with NetBeans

    Introduction

    It’s been a long time since I posted anything on this blog, mainly because my job has been keeping me very busy. As time permits I’ve been reading parts of the LuaTeX source code in an attempt to better understand how it all works: cross-referencing the source code to explanations in the LuaTeX Reference. A couple of days ago I stumbled on the NetBeans IDE – a free Integrated Development Environment. I was interested to see that NetBeans has a Subversion Checkout Wizard (i.e., built-in SVN capabilities), so you can checkout a copy of the LuaTeX code repository and import it directly into NetBeans as a new project. So, I downloaded NetBeans (with C/C++ support) and checked out a copy of the LuaTeX code base, directly from within NetBeans. After completing the download, NetBeans automatically imported the LuaTeX code to create a new project. Very nice!

    However, I have not tried to build LuaTeX using NetBeans (because I need to understand more about the build process) but I have found that it provides excellent tools to search and browse the source code, allowing you to very quickly explore and probe some of the deeper mysteries of TeX.

    Tip: tell NetBeans about .w files

    Much of the LuaTeX code base is written in CWEB (integrated C source code and documentation); consequently, many of the source files have a .w extension. You’ll need to configure NetBeans to tell it about .w files: see Tools –> Options –> Miscellaneous.

    Here’s a screenshot showing a search for the build_page() function, part of TeX’s page-building machinery, showing you where and when TeX exercises the page builder.