JOE: Hacker's Guide

Character classes

The data structures for representing character classes such as [a-z0-9] or [\w\d] are in cclass.c and cclass.h. These are used by the syntax highligher, the regular expression matcher, the keyboard mapping table and for implementing JOE's ctype routines (such as joe_iswupper).

Structures are provided for sets (is a character in a particular set?), mappings to integers (which integer is associated with this character?) and mappings to pointers (which pointer is associated with this character?).

Two different sets of algorithms are provided: one set based on intervals and another based on radix trees.

Interval arrays

An interval is a simple range of character codes in the following structure:

struct interval {
	int first, last;
};

If A-Z is be represented, then first is 65 and last is 90. If just the number zero is to be represented, then first and last are both 48.

An array of these represents a set. If the array is sorted, you can use a binary search to determine if a character belongs to a set. Two functions supporting this are provided:

/* Sort an struct interval array */
void interval_sort(struct interval *array, ptrdiff_t size);

/* Test if character ch is in a sorted interval array using binary search.
   Returns -1 if not found, or index to matching struct interval */
ptrdiff_t interval_test(struct interval *array, ptrdiff_t size, int ch);

These functions are no longer used in the editor since they have been superseded by radix trees. However, facts about Unicode use arrays of intervals, see unicat.c. Unicat.c is generated by util/uniproc.c from the raw data provided on the unicode.org website. Arrays of intervals provide the source data for the radix trees. The struct Cclass includes an interval array as the source data before it is optimized into a struct Rset.

Interval lists

The next data structure in cclass.c is a linked list of intervals:

struct interval_list {
        struct interval_list *next; /* Next item in list */
        struct interval interval; /* Range of characters */
        void *map;
};

This provides a way to create a mapping between characters and pointers (void *map). For example, a key sequence mapping table (a KMAP) can be implemented this way. Map could point to either a MACRO bound to a key, or to a KMAP for a sub-map (for example a top-level map would have ^K bound to a submap which has H bound to the help command).

The following functions are provided to handle interval lists:

/* Construct a struct interval_list */
struct interval_list *mkinterval(
	struct interval_list *next,
	int first, int last,
	void *map);

/* Free an interval list (frees all items in the list) */
void rminterval(struct interval_list *item);

/* Add a single interval to an interval list */
struct interval_list *interval_add(
	struct interval_list *interval_list,
	int first, int last,
	void *map);

/* Add an array of intervals (a character class) to an interval list */
struct interval_list *interval_set(
	struct interval_list *list,
	struct interval *array, int size,
	void *map);

/* Look up single character in an interval list, return what it's mapped to */
void *interval_lookup(
	struct interval_list *list,
	void *dflt,
	int ch);

/* Print interval list to the log buffer for debugging */
void interval_show(struct interval_list *list);

Interval_add is a complex funtion which adds a mapping to an interval list. You pass it the address of the first item in the list (or NULL) and it returns an edited list. Interval_add will override any mappings which already exist in the given range. The resulting list is always sorted and the items in it are non-overlapping. It might add, delete or modify items to the list.

Interval_set is just like interval_add, but takes an array of intervals (as previously discussed) instead of a single interval. For example, to create a mapping from all Unicode numbers to a particular MACRO, you could pass the Nd_table from unicat.c to it.

Interval_lookup returns the pointer mapped to the character, or if none is mapped, it returns dflt.

Radix trees

These sets of radix tree structures and associated methods are provided:

Rset: for sets
Rtree: for maps from characters to pointers
Rmap: for maps from characters to integers

The best way to understand a radix tree is to study one of the lookup functions (slightly simplified):

void *rtree_lookup_unopt(struct Rtree *r, int ch)
{
        int a, b, c, d;
        int idx;

        a = (ch >> TOPSHIFT);
        b = (SECONDMASK & (ch >> SECONDSHIFT));
        c = (THIRDMASK & (ch >> THIRDSHIFT));
        d = (LEAFMASK & (ch >> LEAFSHIFT));

        idx = r->top.entry[a];
        if (idx != -1) {
                idx = r->second.table.b[idx].entry[b];
                if (idx != -1) {
                        idx = r->third.table.c[idx].entry[c];
                        if (idx != -1)
                                return r->leaf.table.d[idx].entry[d];
                }
        }

        return NULL;
}

The idea is to break the 21 bit character code into four parts: a (7 bits), b (5 bits), c (5 bits) and d (4 bits) and use these to traverse a four-level tree. The top-level table is embedded in the struct Rtree itself. Tables from the other levels (the two mid levels and the leaf level) are allocated out of struct Levels. The top and mid-level tables are arrays of 16-bit shorts.

The number of levels and the size of each level was chosen as a compromise between speed and size. We certainly want radix tree lookups to be faster than interval array binary searches (otherwise why bother?), so we can't have too many levels. On the other hand, radix trees should not use too much memory. So for example a single level of 2^21 entries is out.

De-duplication

The leaf tables often contain the same data, so we de-duplicate them such that many of the second to last level table entries index to the same leaf table. This is such an effective optimization that the resulting radix tree ends up being around the same size as an equivalent interval array (an interval array with an array of the same number of entries containing associated data (ints or pointers)).

The second to last level also has duplicate tables, but not many so it's not de-duplicated.

De-deplication happens progressively during the build, and as a second pass optimization step. Why do it progressively if there is a whole second pass? There is an important reason for this: shorts are used for the second to last level table indexes. If you don't deduplicate progressively, you can have more than 64K leaf tables before optimization and overflow the short. There is a second reason: lower memory usage tends to equate to faster performance. For this reason it's preferable to keep the memory usage low during the build.

The progressive deduplication algorithm is simple: if the most recently completed leaf table matches the previously completed leaf table, we use the previous one in its place.

There is one big complication: there is no strong concept of a completed leaf table. A leaf table is assumed to be complete when the last entry has been written to. But there is nothing preventing previous data from being overwritten, or for tables to be made in reverse order (this can happen in a regex character class like [zyxw...]). It means that a de-duplicated table might have to re-duplicated. There is a reference count on each leaf table which is incremented every time that table is used during de-duplication. If the reference count is not equal to one during a table write, it is first copied to a new table (it's a copy on write scheme).

The second-pass deduplication works like this: a hash table of all leaf tables is built. If a duplicate table is detected during the hash table build, it is deduplicated (not added to hash table).

Anglocentrism

There is one further optimization: we provide faster access to the first 512 character codes (so that ASCII is fast). To do this, the first of the second to last level tables is embedded in the struct Rtree. If the code is below 512, then the upper 5 bits index this table to find the leaf table, and the lower 4 bits index to leaf table. Thus, there is one local access and one indirect access for codes below 512.

Rset

struct Rset radix trees are used for simple sets. In this case there are no leaf tables. Instead, the second to last level table contains bit maps. If a bit is a one, then the corresponding character is in the set. Each bit map is 16 bits: the shorts that make up the second to last level table are used for these bitmaps instead of as leaf-table indexes.

Here are the provided methods:

/* Initialize a struct Rset */
void rset_init(struct Rset *r);

/* Clear a struct Rset: anything malloced for it is freed, but the
 * struct Rset itself is not freed (it could be embedded in another
 * structure). */
void rset_clr(struct Rset *r);

/* Test if a character is in a set: this one can be called before
 * before rset_opt is called. */
int rset_lookup_unopt(struct Rset *r, int ch);

/* Add a range of characters to a set: ch to che */
void rset_add(struct Rset *r, int ch, int che);

/* Add an array of ranges to a set */
void rset_set(struct Rset *r, struct interval *array, ptrdiff_t size);

/* De-duplicate leaf tables and set up embedded table for quicker
 * access for characters below 512 */
void rset_opt(struct Rset *r);

/* Test if a character is in a set: only use this after rset_opt has
 * been called */
int rset_lookup(struct Rset *r, int ch);

/* Print Rset in human readable form to startup log for debugging */
void rset_show(struct Rset *r);

Rset_add originally accepted just a single character to add, not a range. This turned out to be too inefficient: the editor was using a lot of CPU time during start-up just to build the standard character class tables. Therefore, it has been modified to loop over the range for a fast inner loop and to lift function call overhead out of the original loop in rset_set.

Rtree

struct Rtree provides a mapping from characters to pointers. For example, this is used in the regular expression code to traverse states.

Here are the provided methods:

/* Initialize a struct Rset */
void rtree_init(struct Rtree *r);

/* Clear a struct Rtree: anything malloced for it is freed, but the
 * struct Rtree itself is not freed (it could be embedded in another
 * structure). */
void rtree_clr(struct Rtree *r);

/* Test if a character is in a tree: return pointer bound to it
 * or NULL.  This can be called before rtree_opt. */
void *rtree_lookup_unopt(struct Rtree *r, int ch);

/* Add a binding to from a range of characters to a pointer */
void rtree_add(struct Rtree *r, int ch, int che, void *map);

/* Add a binding to from each range of characters in the given array to a pointer */
void rtree_set(struct Rtree *r, struct interval *array, ptrdiff_t len, void *map);

/* Add a binding to from each range of characters in the given
 * interval_list list to its pointer */
void rtree_build(struct Rtree *r, struct interval_list *l);

/* De-duplicate leaf tables and set up embedded table for quicker
 * access for characters below 512 */
void rtree_opt(struct Rtree *r);

/* Test if a character is in a tree: return pointer bound to it
 * or NULL.  Only call this after rtree_opt has been called. */
void *rtree_lookup(struct Rtree *r, int ch);

/* Print Rtree in human readable form to startup log for debugging */
void rtree_show(struct Rtree *r);

They are the same as the rset methods, except that a pointer is provided. Also, a method is provided to convert an interval_list into an Rtree.

Rmap

struct Rmap provides a mapping from characters to integers. For example, this is used for Unicode simple case conversion. The integer provides an offset to be added to the character to convert it from one case to another. It's important to use an offset, not a direct replacement character for this so that leaf nodes have the same values and can be de-duplicated.

Here are the provided methods:

void rmap_init(struct Rtree *r);

void rmap_clr(struct Rtree *r);

int rmap_lookup_unopt(
	struct Rtree *r,
	int ch,
	int dflt
);

void rmap_add(
	struct Rtree *r,
	int ch, int che,
	int map,
	int dflt
);

void rmap_set(
	struct Rtree *r,
	struct interval *array,
	ptrdiff_t len,
	int map,
	int dflt
);

void rmap_opt(struct Rtree *r);

int rmap_lookup(struct Rtree *r, int ch, int dflt);

void rmap_show(struct Rtree *r);

They are the same methods as for struct Rtree, except that you provide 'dflt', the value to return if there is no binding for the character.

struct Cclass

Finally we have struct Cclass, which is made up of an interval array and a struct Rset and which provides set operations for character classes. The interval array is used for the set operations, then the array is optimized into a struct Rset for quick character lookups.

The Unicode database (made from the data in unicat.c and methods in unicode.c) uses this data structure. For example, you can get a character class containing all letters with:

struct Cclass *letters = unicode("L");

The set operations allow you to construct complex classes, for example the character class for \i, the initial character of identifiers is contructed like this:

    cclass_init(cclass_alnum_);
    cclass_union(cclass_alnum_, unicode("L"));
    cclass_union(cclass_alnum_, unicode("Pc"));
/* cclass_union(cclass_alpha_, unicode("Sc")); */
    cclass_union(cclass_alpha_, unicode("Nl"));
    cclass_union(cclass_alnum_, unicode("Mn"));
    cclass_union(cclass_alnum_, unicode("Mc"));
    cclass_union(cclass_alnum_, unicode("Nd"));
    cclass_add(cclass_alnum_, 0x200c, 0x200d);
    cclass_opt(cclass_alnum_);

The following methods are provided:

/* Initialize a character class */
void cclass_init(struct Cclass *cclass);

/* Free memory used by a Cclass (does not free cclass itself) */
void cclass_clr(struct Cclass *cclass);

/* Add a range to a character class */
void cclass_add(struct Cclass *cclass, int first, int last);

/* Remove a range from a character class */
void cclass_sub(struct Cclass *cclass, int first, int last);

/* Merge one character class into another */
void cclass_union(struct Cclass *cclass, struct Cclass *n);

/* Merge an interval array into a character class */
void cclass_merge(struct Cclass *cclass, struct interval *intervals, int len);

/* Subtract one character class from another */
void cclass_diff(struct Cclass *cclass, struct Cclass *n);

/* Compute unicode inverse of a character class, for [^a-z] */
void cclass_inv(struct Cclass *cclass);

/* Lookup a character in a character class using binary search.
   Return true if character is in the class. */
int cclass_lookup_unopt(struct Cclass *m, int ch);

/* Optimize a character class for fast lookup with cclass_lookup.
 * Generates a radix tree version of the character class. */
void cclass_opt(struct Cclass *cclass);

/* Return true if character is in the class */
int cclass_lookup(struct Cclass *cclass, int ch);

/* Print character class to startup log */
void cclass_show(struct Cclass *cclass);

/* Remap a character class from Unicode to a specific locale */
struct Cclass *cclass_remap(struct Cclass *m, struct charmap *map);

Cclass_remap is used to convert a character class from Unicode to some other character set (represented by struct charmap *). For example, to get letters in the KOI8-R 8-bit character set, use:

struct Cclass *r = cclass_remap(unicode("L"), find_charmap("KOI8-R"));

Cclass_remap remembers the conversion to save work, so the input struct Cclass is expected to remain around forever. Basically this is to be used only for built-in character classes.

Regular expressions

As of verison 4.1, JOE uses an enhanced version of Thompson's NFA matching algorithm.

The code is in regex.c and regex.h.

The regular expression matcher is a subroutine of the the larger text search algorithm in JOE. For example, text search will use Boyer-Moore to find a leading prefix of the regular expression before running the regular expression matcher. A leading prefix is leading text present in all of the different possible text which can match the regular expression. For example, in 'hello(foo|bar)', the leading text is 'hello', but it's empty in 'hello|there'. One of the jobs of the regular expression code is to identify this leading text.

The API for the regular expression library is simple:

/* Compile an expression s/len into a struct regcomp
 * Check regcomp->err for compile errors.
 */
struct regcomp *joe_regcomp(
	struct charmap *charmap, /* The character set of the expression */
	const char *s, ptrdiff_t len, /* The regular expression */
	int icase, /* Set to ignore case */
	int stdfmt, /* Set for standard syntax expressions, vs. JOE syntax */
	int debug /* Set to have debug information printed to startup log */
);

/* Free a compiled regular expression */
void joe_regfree(struct regcomp *r);

struct regmatch {
        off_t rm_so, rm_eo; /* Starting and ending byte offset of a match, or -1 if no match */
}

/* Returns 0 if buffer at p matches compiled expression r
 * 'matches' is filled in with submatch addresses
 * 'nmatch' has size of matches array
*/
int joe_regexec(
	struct regcomp *r, /* Compiled regular expression */
	P *p, /* Pointer to edit buffer */
	int nmatch, /* Size of matches array */
	struct regmatch *matches, /* Locations of sub-matches */
	int fold /* Set to ignore case */
);

Joe_regcomp converts the regular expression into an AST with an operator precedence parser and then generates a byte-code program representation of the regular expression's NFA. Joe_regexec simulates the byte-code program with an arbitrary number of threads machine. It feeds characters from the buffer into each active thread until one of the threads encounters the END instruction, there are no more threads, or the end of the buffer is reached. A side effect is the buffer pointer P is advanced to the first character not part of the matching text.

If the 'v' flag is included in JOE's search and replace options prompt, the AST and program are sent to JOE's startup log so that they can be viewed (with ESC x showlog). For example, here is the log when you try to search for 'hello$foo\|bar$\*there':

Parse tree:
,
  ,
    ,
      ,
        ,
          ,
            ,
              ,
                ,
                  ,
                    'h'
                    'e'
                  'l'
                'l'
              'o'
            *
              {
                |
                  ,
                    ,
                      'f'
                      'o'
                    'o'
                  ,
                    ,
                      'b'
                      'a'
                    'r'
          't'
        'h'
      'e'
    'r'
  'e'
Leading prefix 'hello'
NFA-program:
PC      INSN
--      ----
0:      'h'
4:      'e'
8:      'l'
12:     'l'
16:     'o'
20:     fork 92
28:     bra 0
36:     fork 64
44:     'f'
48:     'o'
52:     'o'
56:     jump 76
64:     'b'
68:     'a'
72:     'r'
76:     ket 0
84:     fork 28
92:     't'
96:     'h'
100:    'e'
104:    'r'
108:    'e'
112:    end
Total size = 116

The "Parse tree" (AST) is printed in an indented format: A line begins with an operator, then all succeeding lines with deeper indentation are its operands. The ',' operator means succession: match the first operand, then match the second operand. The '{' operator means record the submatch of its argument. The '*' operator means 0 or more matches of its argument. The '|' operator means match one argument or the other.

The "Leading prefix" is the part of the regular expression which can use Boyer-Moore search.

The "NFA-program" shows the dis-assembled program. A letter like 'h' is an instruction to match the letter itself. 'fork' is an instruction which creates a new thread beginning at the specified program counter value. 'jump' means the thread should goto the specified PC value. 'bra' and 'ket' indicate that the start and ending location of submatch should be recorded.

Here is the log from the regular expression '-\i\c-' to show you how character classes appear (\i is the initial character of an identifier, and \c is the following character of an identifier):

Parse tree:
,
  ,
    ,
      '-'
      [
        [41..5a] [5f..5f] [61..7a]
    [
      [30..39] [41..5a] [5f..5f] [61..7a]
  '-'
Leading prefix '-'
NFA-program:
PC      INSN
--      ----
0:      '-'
4:      class [41..5a] [5f..5f] [61..7a]
12:     class [30..39] [41..5a] [5f..5f] [61..7a]
20:     '-'
24:     end
Total size = 28

Note that this is showing the ASCII character encoding so that the character classes are not too big. With the UTF-8 character encoding, the character classes will include the full Unicode definitions of \i and \c.

Functions

/* Parse a character or escape sequence from ptr/len.
 * Returns with character or -256 and cat (if not NULL) filled in with a character class
 */
int escape(int utf8, const char **ptr, ptrdiff_t *len, struct Cclass **cat);

Escape reads one character from a string. If the character is an escape sequence, it's parsed into either a single character or a character class. Escape is used in other parts of the editor: for example, it's used to parse character lists of syntax highlighting files.

/* Conventional syntax regex */
static int do_parse_conventional(struct regcomp *g, int prec, int fold);

do_parse_conventional is the standard syntax regular expression parser. It records the AST in regcomp->nodes.

/* JOE syntax regex */
static int do_parse(struct regcomp *g, int prec, int fold);

do_parse is the JOE syntax regular expression parser. The JOE syntax is just like the regular syntax except that regular expression special characters must be escaped.

/* Determine leading prefix of search string */
/* We can use boyer-moore on the prefix if:
	Character set is byte coded.
	Character set is UTF-8 coded, but no folding is requested.
	Character set is UTF-8 coded, folding is requested, but character is below 128 */

static int extract(struct regcomp *g, int no, int fold);

Extract finds the leading prefix for Boyer-Moore from the AST.

/* Convert parse-tree into code with infinite loops */
static void codegen(struct regcomp *g, int no, int *end);

Codegen converts the AST into byte-code. Functions from frag.c / frag.h write instructions to the memory image and handle alignment. They also provide a mechanism for linking the program (to handle forward jumps). Basically a linked-list of all the jump intructions which target the same destination is created right in the memory image, using the jump instruction operands as the link pointers. When the jump destination is finally known, the link-list is traversed and the link pointers are replaced with the destination address.

Matcher

joe_regexec has the arbitrary number of threads simulator. It has two banks of threads, A and B, and ping-pongs between them for each input character. It feeds the next input character to all threads in bank A. It executes as many instructions as possible for each thread, and stops when the thread finally accepts the character or rejects it. If the thread accepts the character, it will be active for the next character, so it is moved to bank B. Otherwise, the thread will die when we transition to bank B. If a thread in bank A forks, the newly created thread is appended to bank A so that the current character is fed to it before switching. Once all threads have been fed the character, we switch to bank B and feed in the next character.

This scheme is simple and fast, but there are complications. First has to do with sub-match addressing. Without modification, the matcher will find all possible solutions to expressions like 'x(a*)(a*)y'. This means that at the end of the match, there will be a thread for each of these solutions. If the input is xaaaay, it will find x()(aaaa)y, x(a)(aaa)y, x(aa)(aa)y, x(aaa)(a)y, and x(aaaa)()y.

Research topic: due to this property, I think the Thompson NFA matcher can be extended to support back-references.

But we don't want all solutions, we want just one solution, the one where the left-most submatch is longest: x(aaaa)()y. To select the preferred solution, joe_regexec has a helper function:

static int add_thread(
	struct thread *pool, /* Thread pool */
	int l, int le, /* Current start and end of pool */
	unsigned char *pc, /* Current thread's pc */
	Regmatch_t *pos, /* Submatch addresses */
	int bra_no, /* Size of submatch addresses */
	int *stack, int sp /* Current thread's stack */
);

This adds the thread to the end of the new pool B (which is at pool[le]). Before it adds, it checks if there are other existing threads (between l and le) at the same program counter. If there are, then we pick the one with the "best" submatch, the one with the longest to the left.

The second complication is JOE's extension \! (which used to be \c in versions before 4.1). \! is like '.': it matches one character. However, if that one character is the opening bracket of an expression, it matches the entire expression. \! will match all of "(1+(2+3))". To support this, each thread needs a stack to record its progress in parsing the balanced expression.

Unicode database

unicode.c, unicode.h and unicat.c provide JOE's database of Unicode facts. Unicat.c is generated by util/uniproc.c from the raw data provided on the unicode.org website.

There is a table in unicat.c which has the name of, pointer to and size of each other table.

struct unicat unicat[] = {
	{ "Co", 3, Co_table, 0 },
	{ "Cs", 1, Cs_table, 0 },
	{ "Zp", 1, Zp_table, 0 },
	...
	{ "Domino Tiles", 1, uniblocks + 241, 0 },
	{ "Mahjong Tiles", 1, uniblocks + 240, 0 },
	{ "Arabic Mathematical Alphabetic Symbols", 1, uniblocks + 239, 0 },
	...

The function unicode looks up one of these tables and returns a struct Cclass version of it:

struct Cclass *unicode(const char *name);

If a single letter name is passed to the above function, a Cclass containing all two-letter named tables with the same matching first letter is returned. So for example, unicode("L") returns a Cclass of all letters, including "Lu", "Ll", "Lt", etc.

Joe_iswinit initializes Cclasses for JOE's isw ctype functions. The following functions are provided:

int joe_iswupper(struct charmap *,int c);
int joe_iswlower(struct charmap *,int c);

int joe_iswalpha(struct charmap *,int c);
int joe_iswalpha_(struct charmap *,int c); /* used for \i */

int joe_iswalnum(struct charmap *,int c);
int joe_iswalnum_(struct charmap *,int c); /* used for \c */

int joe_iswdigit(struct charmap *,int c);
int joe_iswspace(struct charmap *,int c);
int joe_iswctrl(struct charmap *,int c);
int joe_iswpunct(struct charmap *,int c);
int joe_iswgraph(struct charmap *,int c);
int joe_iswprint(struct charmap *,int c);
int joe_iswxdigit(struct charmap *,int c);
int joe_iswblank(struct charmap *,int c);

Another table in unicat.c is the width_table. This indicates which characters are double-wide on terminal emulators. It is used to implement the joe_wcwidth function:

int joe_wcwidth(int wide,int c);

'wide' should be true if the character set of c is Unicode. Both the terminal and the file must be Unicode (UTF-8 or UTF-16) or double-wide characters to be printed.

Another set of tables in unicat.c includes fold_table and fold_repl for case folding for string comparisons. Fold_table is an array of ranges. If a character is in one of these ranges, then the corresponding entry in fold_repl is a string (of 1 to 3 characters) which should replace it for case folding. The first character of the string is an offset which should be added to the original character to get the first character of the replacement.

rtree_fold is an struct Rtree representation of the fold table. It returns either a single character replacement offset, or an index into fold_repl if the character is to be replaced by a string longer than one character.

Lowerize can be called to convert a UTF-32 string into its case folded version:

int *lowerize(
	int *d, ptrdiff_t len, /* Destination string and length */
	const int *s /* Source string */
);

Unicat.c also provides simple case conversion tables: toupper_table, tolower_table and totitle_table along with the corresponding replacement tables: toupper_cvt, tolower_cvt and totitle_cvt.

These are used to implement joe_towupper and joe_towlower:

int joe_towupper(struct charmap *,int c); /* Convert to uppercase */
int joe_towlower(struct charmap *,int c); /* Convert to lowercase */

Digval is provided to return the decimal value of any unicode digit:

int digval(int ch)

This function uses Nd_table from unicat.c. Nd_table has the property that each range represents a single set of decimal digits. Normally adjacent ranges would be combined together, but in this case they are separated to support digval.

Finally unicode.c provides the following function:

int unictrl(int c);

Which returns true if JOE should treat the character as a control character which should not be sent directly to the terminal. Instead the character value is displayed in hexadecimal: <FFFFF>. Basically this function returns true if the character is not printable- if it's not in cclass_print.

Character sets

charmap.c and charmap.h implement JOE's representation of character sets.

Each character set is represented by a "struct charmap *". A charmap provides information about a specific character set. It provides these methods:

/* True if character is punctuation */
int joe_ispunct(struct charmap *map, int ch);

/* True if character is printable */
int joe_isprint(struct charmap *map, int ch);

/* True if character is a space */
int joe_isspace(struct charmap *map, int ch);

/* True if character is first of an identifier */
int joe_isalpha_(struct charmap *map, int ch);

/* True if character is rest of an identifier */
int joe_isalnum_(struct charmap *map, int ch);

/* True if character is blank: space or tab */
int joe_isblank(struct charmap *map, int ch);

/* True if character is blank tab CR LF FF or NUL */
int joe_isspace_eos(struct charmap *map, int ch);

/* Convert to lowercase */
int joe_tolower(struct charmap *map, int ch);

/* Convert to uppercase */
int joe_tolower(struct charmap *map, int ch);

/* Convert from character set to Unicode */
int to_uni(struct charmap *map, int ch);

/* Convert from Unicode to character set */
int from_uni(struct charmap *map, int ch);

There are two types of character sets: Unicode and 8-bit. If the character set is Unicode, then charmap->type will be true.

Joe_locale intializes some global character maps:

extern struct charmap *locale_map;      /* Character map of terminal */
extern struct charmap *utf8_map;        /* UTF-8 character map */
extern struct charmap *utf16_map;       /* UTF-16 character map */
extern struct charmap *utf16r_map;      /* UTF-16 reversed  character map */
extern struct charmap *ascii_map;       /* Plain ASCII map */

Find_charmap can be used to find a named character set:

/* Find (load if necessary) a character set */
struct charmap *find_charmap(const char *name);

If the character set is not already built into JOE, find_charmap tries to load it from a file. It looks in $HOME/.joe/charmaps and /usr/share/joe/charmaps.

The following character sets are built into JOE:

ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4,
ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13,
ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-8859-1,
KOI-8, KOI8-R, KOI8-T, KOI8-U, CP1251, CP1256,
CP1255, ARMSCII-8, TIS-620, GEORGIAN-PS

Guess_map tries to determine the character map based on the contents of a buffer:

struct charmap *guess_map(const char *buf, ptrdiff_t len);

It selects the locale_map unless it detects valid UTF-8 sequences in the buffers.

Finally, my_iconv is provided to convert from one character set to another:

void my_iconv(
	char *dest, ptrdiff_t destsiz, struct charmap *dest_map,
	const char *src,struct charmap *src_map
);

This is used in a few places in JOE, but most of JOE uses to_uni and from_uni to convert between character sets a character at a time.

Coroutines in JOE

JOE 3.6 now uses co-routines to help reduce the amount of event-driven code which has to be written. All previous versions of JOE were entirely in the event driven coding style, which made user dialog code a chore to write. Now, with the help of co-routines, dialog code can be written as old fashioned blocking code which is much cleaner, even though JOE can have multiple un-finished dialogs on the screen.

The coroutine library has only three (now four) functions:

int co_call(int (*func)(va_list args),...);	Call a function as a co-routine.
int co_yield(Coroutine *t, int val);	Suspend current co-routine and return to caller with given return value.
int co_resume(Coroutine *t, int val);	Suspend current co-routine and resume specified one with given return value.
int co_suspend(Coroutine *u, int val);	Suspend current co-routine (remembering its suspended invoking coroutines) and resume the top-level. The current coroutine is stored in u. When u is resumed and returns, the suspended chain is resumed (the resumer is continued when the chain tries to continue the top level.

co_call() invokes 'func' as a co-routine- it will run on its own stack. 'func' is called immediately upon the call to co_call. All of the arguments after func are passed as a va_list to the function. co_call calls va_start and va_end, 'func' should call va_arg once for each passed argument (see stdarg.h). The co-routine should not yield before it has finished reading all of the arguments, otherwise va_end will be called early. When 'func' returns, the co-routine is destroyed and co_call() returns with the return value of 'func'. If the co-routine yields by calling co_yield, co_call returns with the return value specified in the call to co_yield. Note that co_call only ever returns once for a particular call.

co_yield() suspends the current co-routine and returns with the specified return value to the caller (either to co_call or co_resume). co_yield() writes information about the co-routine to the 'Coroutine' structure whose address is passed to co_yield. The Coroutine structure's address is used as a handle for the co-routine and can be passed to co_resume().

co_resume() suspends the current co-routine (henceforth called the original co-routine) and resumes the specified co-routine. When the specified co-routine returns (when the function given in the co_call that created the co-routine returns), the original co-routine is continued and co_resume() returns with the function's return value. If the specified co-routine yields, the original co-routine is continued and co_resume() returns with the return value specified in the call to co_yield.

co_suspend() is like co_resume, except that the top-level task is resumed instead of the calling co-routine. This has the effect of suspending the entire chain of co-routines which invoked the caller of co_suspend. Co_suspend stores the suspended co_routine in another co_routine. That co_routine will resume this one when it finally returns.

The co-routine library is structured this way (which is different from the usual way where you get a co-routine handle immediately upon creation, for example see Knuth) to encourage proper stack un-winding. Basically, a co-routine can not be forcibly destroyed. Instead, each created co-routine is only destroyed when the initial function returns.

The co-routine library is based on ucontext.h and the getcontext, swapcontext, makecontext and setcontext system calls. If the system does not have these calls, it uses setjmp/longjmp tricks in a highly portable way: the stacks for each co-routine are allocated off the top of the main stack. The main stack grows after each allocation. The stacks are never freed, but they are reused. This works for any machine with a single normal stack. It will not work on IA64 (two stacks) or Cray (the stack is really a linked list).

I was originally going to use simple cooperative multi-threading, with only two basic calls: sleep_on() and wake_up() (inspired by old versions of the Linux kernel) and a top-level scheduler. Sleep_on would suspend the current thread, and create a new thread which runs just the scheduler. Wake_up would mark a thread as ready to run, and on the next return to the scheduler, it would be run.

This turns out to not be powerful enough to implement JOE's keyboard macro language, at least without much rewriting and inelegance. It could be done, but the keyboard macro player would have to feed events into the top-level scheduler or would have to be integrated with it, whereas in older versions of JOE (even without co-routines) the macro player just calls the edit functions (more or less) directly. The problem is that sleep_on (called by a function asking for user input) would suspend the task all the way up to the scheduler including the macro player, thus preventing the macro player from simulating user input.

On the other hand with real co-routines, the macro player will call each edit function as a co-routine (which I like to think of as a task with a very clear return point). If the edit function has to yield, it returns immediately to the macro player for the next macro step. If the yield is due to a request for input, the macro player supplies it by calling the recorded edit functions. If the macro wants to allow the user to provide input (using the "query" function, also known as an interactive macro), it can itself yield (since it is also running in a coroutine) back to the top-level edit loop (in actuality, the macro player calls a function which calls co_suspend() to do this). The macro player is continued when the user finishes supplying the input (hits the return key at a prompt). Several interactive macros can be running at once with this scheme.

Semi-Automatic Variables

JOE 3.6 uses "semi-automatic" variables for strings. There is a global "object-stack", accessed through these functions:

void *obj_malloc(int size);	Allocate a block of memory.
void obj_realloc(void ptr, int size);	Change allocation size.
void obj_free(void *ptr);	Free a specified block and all newer blocks.
void obj_perm(void *ptr);	Mark a block as permanent, which in effect moves the block from the stack to the heap.
int obj_len(void *ptr);	Access hidden "length" part of block (string length).
int obj_size(void *ptr);	Access hidden "size" part of block (allocation size).

These functions remember the order of allocation. When an allocated block is freed, it and all newer blocks are freed as well. For example, in:

unsigned char *a = obj_malloc(10);
unsigned char *b = obj_malloc(10);
unsigned char *c = obj_malloc(10);
unsigned char *d = obj_malloc(10);

obj_free(b);

b, c, and d are all freed, but a remains allocated. The top-level edit loop is structured to call obj_free after every editor command:

unsigned char *gc = obj_malloc(1);
execmd(...); /* Execute edit commands */
obj_free(gc); /* Collect garbage */

Thus, you rarely have to call obj_free- the only time you might want to do it is if you are calling obj_malloc in a loop. This could cause a large amount of memory to be allocated, so obj_free() should be called at the end of each loop iteration.

obj_perm() marks an object as a heap object instead of a stack object. When a heap object is freed, only itself is freed- the stack is not modified. Also when the stack is freed, heap objects are not touched.

Variable Strings

JOE uses dynamic strings built on top of the global object stack functions. These strings automatically resize themselves to fit their contents. They also know their own length, which allows you to have NULs in the middle of a string. They also always have a terminating NUL to make them compatible with C-strings. In fact they have (almost) the same type as C-strings: unsigned char *. The length is hidden in the obj_malloc data structure. These functions are provided:

unsigned char *vsmk(int prealloc);	Allocate a zero length string, but with enough space to grow to the specified length.
int vslen(unsigned char *s);	Return string length or 0 if s is NULL.
unsigned char vsensure(unsigned char s, int len);	Make sure there is enough space for a string of the specified length.
unsigned char vstrunc(unsigned char s, int len);	Set length of string: the string is truncated or extended with spaces to the specified length.
unsigned char vsncpy(unsigned char s,int offset,unsigned char *block, int block_size);	Copy a memory block to a string. The string is extended with spaces if offset > string length.
unsigned char vsdupz(unsigned char s);	Duplicate a z-string to a variable string.
unsigned char vsdup(unsigned char s);	Duplicate a variable string.
unsigned char vscpyz(unsigned char s,unsigned char *z);	Copy a z-string to a variable string.
unsigned char vscpy(unsigned char s,unsigned char *block,int size);	Copy a memory block to a variable string.
unsigned char vscatz(unsigned char s,unsigned char *z);	Append a z-string to a variable string.
unsigned char vscat(unsigned char s,unsigned char *block,int size);	Append a memory block to a variable string.
unsigned char vsadd(unsigned char s,unsigned char c);	Append a single character to a string.
unsigned char vsfmt(unsigned char s,int offset,const char *format,...);	Sprintf() to a variable string.
unsigned char vsgets(unsigned char sp,FILE f);	fgets() to a variable string (returns NULL on EOF, does not put '\n' in the string).

Note that it is OK to pass NULL in place of 's' in any of the above functions. In this case, a new string is allocated.

Most of these function return the address of the variable string, in case it was modified. One exception is vsgets()- it takes an address of a variable holding the string. It returns the string address, but it also writes it to the variable directly.

Many of the above functions take a memory block, which is specified by an address and a length. These macros can be used to help construct these arguments:

#define sc(s) (s), (sizeof(s)-1)	For string constants.
#define sz(s) (s), zlen(s)	For C-strings (zlen is the same as strlen).
#define sv(s) (s), vslen(s)	For variable strings.

So for example, vscatz is defined as:

unsigned char *vscatz(unsigned char *s, unsigned char *z);
{
	return vsncpy(sv(s), sz(z));
}

Variable arrays of variable strings

JOE also semi-automatic variable arrays of variable strings. These functions are provided:

unsigned char *vamk(int prealloc);	Allocate a zero length array, but with enough space to grow to the specified length.
void varm(unsigned char **a);	Free an array.
int valen(unsigned char *a);	Return array length or 0 if a is NULL.
unsigned char vaensure(unsigned char s, int len);	Make sure there is enough space for an array of the specified length.
unsigned char vatrunc(unsigned char s, int len);	Set length of array: the array is truncated or extended with NULLs to the specified length.
unsigned char vaadd(unsigned char a,unsigned char s);	Append a single string to an array.
unsigned char vasort(unsigned char *a,int len);	Sort an array.
unsigned char vawords(unsigned char a,unsigned char s,int len, unsigned char *sep, int seplen);	Convert a word list in s (words separated by characters given in sep/seplen) into an array of strings.
void vaperm(unsigned char **a);	Mark an array as preferment (heap instead of stack).

When an array is marked as permanent, any strings in the array also marked as permanent, and any string later added to the array are also marked as permanent.

Edit Buffers

API:

Look at the comments in b.h for more information.

B *bfind(unsigned char *name); Load disk file into memory buffer 'B'.

bsave(P *p,unsigned char *name,int size); Write size bytes from buffer beginning at p to disk file

brm(b); Free data structure

Once you have a B you can access the characters in it via P pointers (which are like C++ STL iterators).

B *b = bfind("foo"); Load file into memory

P *p = pdup(b->bof); Get pointer to beginning of file (duplicate b->bof which is a P).

prm(p); Free p when we're done with it.

int c=brch(p); Get character at p.

int c=pgetc(p); Get character at p and advance it.

int c=prgetc(p); Retract p, then return character at it.

- These return -1 (NO_MORE_DATA) if you try to read end of file or before beginning of file.

- A pointer can point to any character of the file and right after the end of the file.

- For UTF-8 files, character can be between 0 and 0x7FFFFFFF

Publicly readable members of P:

p->byte The byte offset into the buffer

p->line The line number

p->xcol If P is the cursor, this is the column number where the cursor will be displayed on the screen (which does not have to match the column of the character at P).

Some predicates:

pisbof(p); True if pointer is at beginning of buffer

piseof(p); True if pointer is at end of buffer

pisbol(p); True if pointer is at beginning of line

piseol(p); True if pointer is at end of line

pisbow(p); True if pointer is at beginning of a word

piseow(p); True if pointer is at end of a word

More information about character at p:

piscol(p); Get column number of character at p.

Some other ways of moving a P through a B:

pnextl(p); Go to beginning of next line

pprevl(p); Go to end of previous line

pfwrd(p,int n); Move forward n bytes

pbkwd(p,int n); Move backward n bytes

p_goto_bof(p);

p_goto_eof(p);

p_goto_bol(p);

p_goto_eol(p);

pset(p,q); Move p to same position as q.

pline(p,n); Goto to beginning of a specific line.

pgoto(p,n); Goto a specific byte position.

pfind(P,unsigned char *s,int len); Fast Boyer-Moore search forward.

prfind(P,unsigned char *s,int len); Fast Boyer-Moore search backward.

These are very fast- they look at low level data structure and don't go through pgetc(). Boyer-Moore allows you to skip over many characters without reading them, so you can get something like O(n/len).

Some facts:

Local operations are fast: pgetc(), prgetc().

Copy is fast: pset().

pline() and pgoto() are slower, but look for the closest existing P to start from.

The column number is stored in P, but it is only updated if it is easy to do so. If it's hard (like you crossed a line boundary backward) it's marked as invalid. piscol() then has to recalculate it.

Modifying a buffer:

binsc(p,int c); Insert character at p.

bdel(P *from,P *to); Delete character between two Ps.

Note that when you insert or delete, all of the Ps after the insertion/ deletion point are adjusted so that they continue to point to the same character before the insert or delete.

Insert and Delete create undo records.

Insert and Delete set dirty flags on lines which are currently being displayed on the screen, so that when you return to the edit loop, these lines automatically get redrawn.

Internal:

An edit buffer is made up of a doubly-linked list of fixed sized (4 KB) gap buffers. A gap buffer has two parts: a ~16 byte header, which is always in memory, and the actual buffer, which can be paged out to a swap file (a vfile- see vfile.h). A gap buffer consists of three regions: text before the gap, the gap and text after the gap (which always goes all the way to the end of buffer). (hole and ehole in the header indicate the gap position). The size of the gap may be 0 (which is the case when a file is first loaded). Gap buffers are fast for small inserts and deletes when the cursor is at the gap (for delete you just adjust a pointer, for insert you copy the data into gap). When you reposition the cursor, you have to move the gap before any inserts or deletes occur. If you had only one window and a single large gap buffer for the file, you could always keep the gap at the cursor- the right-arrow key copies one character across the gap.

Of course for edits which span gap buffers or which are larger than a gap buffer, you get a big mess of gap buffer splitting and merging plus doubly-linked list splicing.

Still, this buffer method is quite fast: you never have to do large memory moves since the gap buffers are limited in size. To help search for line numbers, the number of newlines '\n's contained in each gap buffer is stored in the header. Reads are fast as long as you have a P at the place you want to read from, which is almost always the case.

It should be possible to quickly load files by mapping them directly into memory (using mmap()) and treating each 4KB page as a gap buffer with 0 size gap. When page is modified, use copy-on-write to move the page into the swap file (change pointer in header). This is not done now. Instead the file is copied when loaded.

## Windowing System

There is a tiny object-oriented windowing system built into JOE. This is the class hierarchy:

SCRN A optimizing terminal screen driver (very similar to 'curses'). has a pointer to a CAP, which has the terminal capabilities read from termcap or terminfo.

writes output to screen with calls to the macro ttputc(). (tty.c is the actual interface to the tty device).

cpos() - set cursor position

outatr() - draw a character at a screen position with attributes

eraeol() - erase from some position to the end of the line

SCREEN Contains list of windows on the screen (W *topwin).

Points to window with focus (W *curwin).

Contains pointer to a 'SCRN', the tty driver for the particular terminal type.

W A window on a screen.

Has position and size of window.

Has: void *object- pointer to a structure which inherits window (W should really be a base class for these objects- since C doesn't have this concept, a pointer to the derived class is used instead- the derived class has a pointer to the base class: it's called 'parent').

Currently this is one of:

BW * a text buffer window (screen update code is here.)

QW * query window (single character yes/no questions)

MENU * file selection menu

BW * is inherited by (in the same way that a BW inherits a W):

PW * a single line prompt window (file name prompts)

TW * a text buffer window (main editing window).

WATOM *watom- Gives type of this window. The WATOM structure has pointers to virtual member functions.

KBD *kbd- The keyboard handler for this window. When window has focus, keyboard events are sent to this object. When key sequences are recognized, macros bound to them are invoked.

Some window are operators on others. For example ^K E, load a file into a window prompt operates on a text window. If you hit tab, a file selection menu which operates on the prompt window appears below this. When a window is the target of operator windows is killed, the operators are killed also.

Currently all windows are currently the width of the screen (should be fixed in the future). The windows are in a big circular list (think of a big loop of paper). The screen is small window onto this list. So unlike emacs, you can have windows which exist, but which are not on the screen.

^K N and ^K P move the cursor to next or previous window. If the next window is off the screen it is moved onto the screen, along with any operator windows are target it.

## Macros

- add something here.

## Screen update

- add something here.

## Syntax Highlighter

There are two parts to the syntax highlighter: the parser (in syntax.c) and the line attribute database (in lattr.c). The parser is simple enough: it is a finite state machine interpreter. You supply a line of text and the starting state, and it returns the starting state of the next line and an array containing the color to be used for each character. Each state has a color. The line is scanned left to right, character by character: the state and the character index a table, which gives an action, which includes the next state.

The action specifies the next state, and also indicates whether some previous N characters (including the current) should be "recolored" with the color of the next state, or whether the color of the current state should be used for the character. The action also indicates whether to move on to the next character, or to stay put and re-evaluate the current character with the new state. This leads to simpler hand-coding of the state machine, but can create an infinite loop. The action also indicates if the character should be added to a string buffer. The string buffer can be used (when specified by an action) as a key for a hash table look-up (typically filled with key-words). If there is a match, the hash-table entry supplies the action instead of the state table.

This parser is fast and simple and powerful enough to lexically analyze more than 40 languages. However a few enhancements were made to improve both the class of languages which can be highlighted and to improve the ease of writing the state machine. To support "here" documents in perl and sh, as well as to simplify matching delimiters (like open with close parenthesis), a string (from the string buffer) can be stored along with the state. This can be matched against the string buffer. So first the terminating word from the here document is stored in the string, and then the state machine looks forward until it finds this string to determine the end of the here document.

The state is only a single integer, not an entire stack. The parser can therefore not handle recursive grammars. Unfortunately, some languages have recursive elements in their lexical structure. One way to deal with this is to support only a very limited recursion depth by tediously repeating a set of states in the hand written state machine N times, where N is your maximum recursion depth. To ease writing this type of thing, and also to allow state machine to be reused in cases where one language is embedded within another (PHP in HTML, for example), a macro system was added to the state machine loader. Recursion is allowed in the macro calls, but is limited to a depth of 5.

As of JOE 3.6, the macro instantiation was replaced with a real stack, so now there is no recursion depth limit.

Line attribute cache

The second part of the syntax highlighter is the line attribute cache. In the original implementation, the state of the first line of each window is maintained and the parser is invoked during the screen update algorithm to highlight all of the lines on the screen. This works well when the window remains in place or when it scrolls forward through the file. However, if the window scrolls backward you may have to reparse the entire file or from the nearest previous window if there happens to be one. This can be slow for large files, and certainly wastes a lot of CPU time when scrolling backward through a file.

One attempt to solve this was to instead of starting at the beginning of the file, to instead assume that the line 200 lines (a configurable option) back was at the initial state, parse forward from there and hope for the best. Sometimes the state machine would find itself back in the proper state, but not always- to the point where most syntaxes had specified to just always reparse from the beginning of the file.

Finally, the line attribute cache was added. It stores the parser state at the beginning of each line, from the beginning of the file to last line actually displayed on the screen. This solves the scrolling problem, but introduces a new one: how to keep the line attribute cache up to date. This turns out to not be so difficult. First of all, the states are stored in a gap-buffer, which makes it easy to perform insertions and deletions. Whenever lines are inserted or deleted in the file, the line attribute cache is expanded or contracted to keep in sync. Now, when any change is made to the file, we reparse from just before the start of the change until the state of the beginning of a line again matches the corresponding entry in the cache. If a match is found, the rest of the cache is assumed to be correct. Some changes will require a large amount of reparsing, but many will not. In any case, the the highlighting on the screen is always correct.

Files

b.c	Text buffer management
blocks.c	Library: fast memcpy() functions (necessary on really old versions of UNIX).
bw.c	A class: text buffer window (screen update code is here)
charmap.c	UNICODE to 8-bit conversion functions
cmd.c	Table of user edit functions
coroutine.c	Coroutine library
dir.c	Directory reading functions (for old UNIXs).
hash.c	Library: automatically expanding hash table functions.
help.c	Implement the on-line help window
i18n.c	Unicode character type information database
kbd.c	Keymap data structure (keysequence to macro bindings).
lattr.c	Line attribute cache
macro.c	Keyboard and joerc file macros
main.c	Has main() and top level edit loop
menu.c	A class: menu windows
obj.c	Semi-automatic objects, plus dynamic string and array library
path.c	Library: file name and path manipulation functions
poshist.c	Cursor position history
pw.c	A class: prompt windows
queue.c	Library: doubly linked lists
qw.c	A class: single-key query windows
rc.c	Joerc file loader
regex.c	Regular expressions
scrn.c	Terminal update functions (like curses)
selinux.c	Secure linux (for RedHat) functions
syntax.c	Syntax highlighter
tab.c	Tab completion for file selection prompt
termcap.c	Load terminal capabilities from /etc/termcap file or terminfo database
tw.c	A class: main text editing window
ublock.c	User edit functions: block moves
uedit.c	User edit functions: basic edit functions
uerror.c	User edit functions: parse compiler error messages and goto next error, previous error
ufile.c	User edit functions: load and save file
uformat.c	User edit functions: paragraph formatting, centering
uisrch.c	User edit functions: incremental search
umath.c	User edit functions: calculator
undo.c	Undo system
usearch.c	User edit functions: search & replace
ushell.c	User edit functions: subshell
utag.c	User edit functions: tags file search
utf8.c	UTF-8 to unicode coding functions
utils.c	Misc. utilities
vfile.c	Library: virtual memory functions (allows you to edit files larger than memory)
w.c	A class: base class for all windows