A simple example to get you started
Based on code generated by the superb RegexBuddy software (the price is great value!), here’s a simple example of using the PCRE regular expression library to search a UTF-8 text buffer for strings of Arabic text. The actual regular expression is very simple: ([\\x{600}-\\x{6FF}]+)
– it just looks for sequences of Unicode codepoints from 600 (hex) to 6FF (hex). Not a particularly efficient function but it works – e.g., should calculate buffer length once etc.
I used code like this in an Arabic text pre-processor I wrote for working with XeTeX: saving Arabic strings to a file (from XeTeX), processing the text and reading it back in via \input{...}
. Special effects not directly possible in XeTeX can be achieved by a pre-processing step. Yep, involves lots of \write18{...}
calls. For sure LuaTeX offers many other possibilities but XeTeX’s font handling (and use of HarfBuzz) are very convenient indeed!
// Called with a buffer containing UTF-8 encoded text void runpcre(unsigned char * buffer) { int wordcount; pcre *myregexp; const char *error; int erroroffset; int offsetcount; int offsets[(1+1)*3]; // (max_capturing_groups+1)*3 unsigned char *res; wordcount = 0; myregexp = pcre_compile("([\\x{600}-\\x{6FF}]+)", PCRE_UTF8|PCRE_UCP , &error, &erroroffset, NULL); if (myregexp != NULL) { offsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), 0, 0, offsets, (1+1)*3); while (offsetcount > 0) { // match offset = offsets[0]; // match length = offsets[1] - offsets[0]; if (pcre_get_substring(buffer, &offsets, offsetcount, 0, &res) >= 0) { wordcount++; // Do something with match we just stored into res // process_string could be what ever you want to do with the Arabic test string process_string(res, wordcount); } offsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), offsets[1], 0, offsets, (1+1)*3); } } else { // DOH! Syntax error in the regular expression at erroroffset } }