Multimethods

Overview

Introduction

This article is about the uses of Multimethods, and how they could be added to C++.

I'll be assuming that the reader is familiar with some C++ features that are relevant to the discussion: inheritance, virtual functions, runtime type information (RTTI) and overload resolution. Also, familiarity with the terms static type and dynamic type is useful.

Multimethods resources

The Cmm system is available from http://www.op59.net/cmm/readme.html.

Multimethods are not a new idea; for example, the Dylan language (see http://www.functionalobjects.com/resources/index.phtml) supports them natively, as does CLOS (Common Lisp).

Bjarne Stroustrup writes about multimethods in The Design and Evolution of C++ (Section 13.8, pages 297-301).

If one restricts oneself to Standard C++, it is possible to approximate multimethods at the expense of verbosity in source code. See Item 31 in Scott Meyers' More Effective C++, and chapter 11 of Andrei Alexandrescu's Modern C++ Design, where template techniques are used extensively. Bill Weston has a slightly different template technique that supports non-exact matches, see http://homepage.ntlworld.com/w.weston/.

The Frost project (http://frost.flewid.de) adds support for multimethods to the i386-version of the gcc compiler.

Acknowledgements

Thanks to Bill Weston for many interesting discussions about multimethods and for comments on this paper.

Multimethods as a C++ extension

Multimethods syntax

Stroustrup's Design and Evolution of C++ suggests a syntax for multimethods for C++, which I will be using here.

A function prototype is a multimethod function if one or more of its parameters are qualified with the keyword virtual. Implementations of a multimethod function have conventional parameters without the virtual qualifier. Implementations also have a trailing underscore, which is not a part of Stroustrups's suggested syntax.

The parameters that correspond to the virtual parameters in the virtual function prototype must be derived from the original types, and have the same modifiers (such as const).

Virtual function declarations must occur before implementation function definitions/declarations. Otherwise the implementation functions will be treated as conventional functions.

An example should make this clearer:

We'll use the classic multimethods example, an Overlap function that takes references to a Shape base class. Detecting whether two shapes overlap requires different code for each combination of shapes:

struct  Shape            {...};
struct  Square   : Shape {...};
struct  Triangle : Shape {...};

bool    Overlap( virtual Shape& a, virtual Shape& b);
    
bool    Overlap_( Square& a, Triangle& b) {...}
bool    Overlap_( Triangle& a, Square& b) {...}
bool    Overlap_( Shape& a, Square& b)    {...}
bool    Overlap_( Square& b, Shape& a)    {...}

The Overlap( virtual Shape& a, virtual Shape& b) prototype is replaced by a dispatch function with prototype Overlap( Shape& a, Shape& b). This dispatch function uses C++ RTTI to choose one of the available Overlap_() functions, based on the dynamic types of its parameters.

The trailing underscore in implementation function names is ugly, but it allows one to define an implementation that takes the base parameters, which would otherwise clash with the generated dispatch function.

Multimethods dispatch algorithm

It is possible that there isn't any matching implementation function for a particular combination of dynamic types. In this case, the generated dispatch function will throw an exception.

It will also throw an exception if there is no clear best implementation for a particular combination of dynamic types. An implementation is considered best only if both the following two conditions apply:

  1. All of the best implementation's parameter types match the dynamic types.
  2. Each of the best implementation's parameter types is at least as derived as the corresponding parameter type in any other matching implementation.

This is the same as the lookup rule used by C++ at compile time, apart from C++'s use of conversion operators and special-casing of built-in types.

Note that we cannot have duplicate implementations, so the second condition implies that for each other matching implementation X, the best implementation must have at least one parameter type that is more derived than the corresponding parameter type in X.

So:

Shape&  s = *new Square;
Shape&  t = *new Triangle;
Overlap( t, t); // Throws - no matching implementation
Overlap( s, t); // Calls Overlap_( Square&, Triangle&)
Overlap( t, s); // Calls Overlap_( Triangle&, Square&)
Overlap( s, s); // Throws - these implementations
                // both match, but neither is a
                // better match than the other:
                //   Overlap_( Shape& a, Square& b)
                //   Overlap_( Square& b, Shape& a)

Why multimethods?

Most articles about C++ and multimethods only mention the well known overlap example described above - detecting whether two objects derived from a base Shape class overlap. This sort of issue could arises in, for example, the game Space Invaders. If this was the only example where Multimethods were required, the case for adding them to C++ would be weak indeed. However, I think that there are some genuinely important programming tasks where language support for multimethods would simplify things considerably.

Non-member functions are preferred over member functions

Scott Meyers has written an interesting article How Non-Member Functions Improve Encapsulation describing why non-friend global functions are to be preferred over member functions. The paper is short and readable, so I won't attempt to summarise it here.

The STL's design uses non-member function templates to provide generic algorithms which act on many different sorts of container types.

The STL also encourages the use of traits classes, which are ways of associating extra information with existing classes without modifying these classes. This is interesting because it shows that there is a need to add to existing classes without modifying them. Multimethods similarly enable new virtual functions to be added to existing classes.

Even Standard C++'s library seems confused about whether or not to use member functions. For example the member function std::istream::getline() takes a C-style buffer only, but if you want to read a line from a stream into a std::string, you have to use the global function std::getline( std::istream&, std::string&, char delim='\n').

In C++, the only time you are forced to use member functions is when you need virtual dispatch. In all other cases, you could simply use a (possibly friend) global function

Multimethods provide global functions that behave like virtual functions, and so they would allow a consistent non-member-function syntax for all functions.

Object orientation, interfaces and multimethods

There seem to be almost as many different definitions of Object Orientation as there are programmers. However one thing that does seem fairly consistent is that it involves well defined interfaces.

Which is fine. We'd all like to have well defined interfaces for everything that we write.

Unfortunately, things don't always happen like that. What usually happens is that interfaces change. People aren't omniscient. People make design mistakes. Requirements change. At this point, rigid interfaces change from being a help to being a problem. And a lot of the time, deriving a new class from the existing class, and adding the newly required method to this class isn't possible - what if the original base class has other derived classes already - we would have to somehow insert the new class into the existing inheritance hierarchy, which forces a recompile of all existing code. The accepted way of handling this is to make all non-leaf classes abstract. I'm sure that there is a theoretical way of describing this, but to me it just makes things messy. It basically doubles the number of classes one has to write. Why on earth do we have to go through this sort of boiler-plate-writing?

Consider a Unix file descriptor. There are some functions that operate on file descriptors, open(), read(), write(), lseek(), close(). These have nice well-defined interfaces - reading data from the file, writing to the file etc. etc. But what about ioctl()? This is a catch-all for all the functions that different file descriptors could support, now or in the future. I think it's pretty telling that the people who designed Unix weren't able to come up with a fixed interface for something as fundamental as a file descriptor.

So in practise, we need extensible interfaces sometimes. I'm not advocating abandoning the approach of trying to design interfaces correctly, merely suggesting that we accommodate the inevitable modifications to interfaces in a better way.

Multimethods provide exactly this because - by their nature multimethods have to be outside of individual classes, because they typically operate on more than one class. A multimethod with a single virtual parameter is basically a virtual method that you can add to an existing class without changing the class or forcing a recompile of the users of the class.

This is how I think things work if one has multimethods available. You represent interfaces as classes. You can derive other classes from existing base classes if you require a hierarchy of interfaces, adding in member data if appropriate. Then you define multimethods that take virtual references to the appropriate classes.

So instead of a conventional inheritance tree with abstract interface classes and concrete implementation classes:

    base void foo()=0
       ^
       |
       |
    derived void bar()= 0   <----- concrete_derived::foo();
       ^                           concrete_derived::bar();
       |
       |
    more_derived void bar()= 0 <-- concrete_more_derived::bar();

You have:

    base
      ^
      |
   derived
      ^
      |
   more_derived

with multimethod implementations inserted at whatever level of the hierarchy is appropriate:

void foo( virtual base&);
void bar( virtual derived&);

void foo_( derived&);
void bar_( virtual more_derived&);
void bar_( more_derived&);

If you want base to have an extra function pqr(), you can simply write the new function as a multimethod: void pqr( virtual base&);. It's that simple.

It is likely that this sort of freedom to add new functions can be misused, but that isn't really a good argument against adding multimethods to C++. The whole approach of C++ is to support different types of programming so that the programmer can choose what is most appropriate.

Language-specific error messages

One is often recommended to reduce coupling between code that generates an error object and the code that displays the error to the user. If an error occurs, one would like to create an error object that encodes all the available information about the problem (including dynamic information such as filenames), and pass it to whatever error-handling system is used (e.g. throw it, or put the data in a global variable and return an error code). Somewhere else, someone will want to display some information about the error to the user.

The code that created the original error object doesn't want to be concerned with whether the error data is going to be logged to a file, output as text in a particular language or sent to stderr using fprintf. In many cases, the code that creates the error is in a generic library that simply cannot know how the error information will ultimately be processed.

So there will be a separate function that takes the information in the error object, and displays it in a specific way. Note that this will often be more complicated than simply using printf( "Couldn't open file `%s' because error code %i", filename, e) because some languages would require the order of the error code and the filename to be reversed. In general, we need to have different spoken-language-specific functions that format the information in different ways.

I think the natural way of doing this would be to have different classes for each language, and use multimethods to select a display function appropriate to a particular language and a particular error type.

To be concrete, what I'm advocating is the following:

First, a hierarchy of errors:

struct FileError : std::exception { std::string filename; };
struct NoSuchFile : FileError {};
struct PermissionError : FileError {};
...

And a similar hierarchy of spoken languages:

struct Language {};
struct English : Language {};
struct Chinese : Language {};
...

We then define a multimethod that reports a particular error in a particular language:

void ReportError( virtual std::exception& e, virtual Language& l);

Particular implementations of this multimethod give the required functionality:

void ReportError_( NoSuchFile& e, English& l)
{
    l.stream << "No such file `" << e.filename << "'";
}
void ReportError_( PermissionError& e, English& l)
{
    l.stream << "Permission error for file `" 
             << e.filename << "'";
}

Note how we are using separate functions to give us the different formatting of error messages, instead of printf-style format strings. This gives us complete flexibility; for example for English, we could write code that adds trailing s's if numbers are larger than 1.

Finally, we can easily ensure that an English message is generated if the ideal language-specific functions aren't available:

void ReportError_( std::ostream& out, Language& l, Error& e)
{
    English english;
    out << "No support for specified language; using English:\n";
    ShowError( out, english, e);
}

The multimethod approach has the advantage that the application code can be written in a spoken-language-neutral way, and more than one language can be supported at runtime. In fact as long as the multimethod dispatch mechanism can cope with the addition of new types at runtime, then one would be able to add new languages to a running application.

Without multimethods, it is not easy to solve this problem. One way is to hard code a particular language into an executable; I wonder whether this approach is the cause of the Windows SDK coming in the form of hundreds of language-specific CDs... There are various database-style message systems around, but these are generally OS-specific, which makes writing portable code difficult.

GUI event dispatching

Event dispatching is so common that we don't often think of it as being interesting. However, I think that dispatching events in an extensible way requires multimethods. The event-dispatching problem is: given an event (mouse click, key press, redraw request etc) and a particular window in a GUI, call the appropriate code.

There are two parameters here - the event and the window. Typically, the window will be created by an application, and we'd expect it to contain information specific to the application - a text editor window might contain a pointer to the text that it is showing, while a dialogue box might contain a set of button handles. Similarly, it seems natural to design the event types as different classes, such as MouseClick, RedrawRequest, KeyPress etc, each with their own data - MouseClick would have the coordinates of where the mouse was clicked, RedrawRequest would have information about which parts of a window require redrawing plus a drawing context.

The simplest way for an application writer to provide the handler functions for their window-type would be to write a function for each event type. It is the event dispatcher's task to choose which of these functions to call. The dispatcher will be aware of all sorts of different windows, each with their own set of handler functions, so its problem is to map from the types of two parameters that are known only at runtime, to a particular function. This is multimethods.

Things can be simplified if there are small fixed number of different event types, In this case, one can perform half of the dispatch with a fairly small and fast switch statement, and then perform the rest by calling a virtual function belonging to the window. But if one is interested in having an extensible system of events, then this breaks down; it's interesting that common GUI development environments such as Visual C++, Qt and GTK all use non-standard language facilities to implement event dispatching. In the case of Qt, this takes the form of the MOC (Meta Object Compiler) which preprocesses C++ source code in order to insert extra text; the others use macros to implement message maps.

If multimethods were available, one could write GUI code like:

System-wide gui.h:

// polymorphic base classes
struct Window {...}; // polymorphic base class
struct Event  {...};

// Event-handling multimethod.
void HandleEvent( virtual Window&, virtual Event&);

// Event types
struct MouseClick    : Event { int x, y; };
struct RedrawRequest : Event { int x1, y1, x2, y2; };
...

In a system GUI library source file, there would be some default implementations:

void HandleEvent_( Window&, Event&) {...}
/* Generic handler for all events and windows.
Probably does nothing. */

An application would do:

#include "gui.h"

struct MyWindow : Window {...}; // extra data for our window

void HandleEvent_( MyWindow& w, MouseClick& m)    { ... }
void HandleEvent_( MyWindow& w, RedrawRequest& r) { ... }

Processing parse trees

Parse trees typically contain nodes that have varying dynamic types. Generating code from a parse tree can involve a second dynamic type - the type of output format. For example a compiler's parse tree could be used to generate object files for more than one target processor.

This issue has arisen in Cmm itself. In order to enable Cmm's parser to be used independently as a library in custom tools that process C++ source code, it can be built as a Cmm programme. Outputing a node in the parse tree is done by calling a multimethod, which dispatches on two things: the type of the parser-node, and the type of the output object.

This allows custom source translators to be easily written. For example, Cmm's parser represents a for()-loop as an object of type ForLoop, so one could generate customisation of for() loops by writing the following:

void Output_( ForLoop& forloop, OutStream& out)
{
    out.out << "{";
    forloop.Output( out); // default output routine.
    out.out << "}";
}

This example simply wraps all for()-loops in side an extra pair of curly braces, while leaving the output of all other nodes unchanged, which can be useful when using Visual C++'s compiler because of it's scoping behaviour.

Implementation issues and Cmm

I've written a C++ source processor called Cmm which implements a multimethods language extension for C++. Cmm has been designed to generate multimethod code that supports dynamic loading and unloading of code, which means that all information about multimethods and their implementations are stored in dynamic data strucures.

Cmm was written as a source processor because this works with different compilers on different systems, and I didn't really want to get involved in the internals of gcc. Also, a parsing library for C++ is a useful tool in itself. Unfortunately, there is a reason for the scarcity of free C++ parsers: parsing C++ is little short of a nightmare. In the end, I ended up writing a hand-written recursive-descent parser with lots of hacks to enable it to work without using name lookup.

It's actually impossible to correctly parse even C without name lookup (e.g. parsing a * b; requires knowledge of whether a is a type or not.), but luckily Cmm can get away with assuming certain things without it causing any trouble - most of the code it reads is simply output verbatim anyway.

Cmm takes individual compilation units containing multimethod code that have already been run through the C++ preprocessor (e.g. #include's have been expanded), and generates legal C++ compilation units, which can then be compiled and linked together conventionally.

The generated C++ code calls some support functions that are provided as a single source file called dispatch.cpp. This contains functions that manage data structures that store all known virtual functions and their implementations, the actual runtime dispatch functions, functions to support dispatch caching and also support for the exception types thrown for ambiguous/unmatched dispatches.

Generated code for multimethod dispatch

We will use the simple Overlap multimethod example described earlier. The virtual function is:

bool Overlap( virtual Shape&, virtual Shape&);

We will consider an implementation of Overlap that is specialised for a first parameter Square and second parameter Triangle. This will look like:

// user-written implementation function
bool Overlap( Square& a, Triangle& b) {...}

In order to perform multimethod dispatch, one has to first decide which implementations match the dynamic types, and then try to find one of the implementations which can be considered a better match then all of the others.

The first step is done by calling auxiliary functions that Cmm creates for each implementation, which takes the base parameters and returns true only if they can each be casted to the implementation's parameters. Because these functions takes the virtual function's base parameters, we cannot use conventional overloading to distinguish them, and so Cmm makes the function names unique using a mangling scheme which, for simplicity, will be denoted by _XYZ in the following:

// Cmm-generated match function for the function
// bool Overlap( Square& a, Triangle& b);
bool Overlap_cmm_match_XYZ( Shape& a, Shape& b)
{
    if ( !dynamic_cast< Square*  >( &a)) return false;
    if ( !dynamic_cast< Triangle*>( &b)) return false;
    return true;
} 

This separate function is generated in the same compilation unit as the implementation, which enables the dynamic_cast to work with derived types defined in anonymous namespaces. [Actually, the Overlap_cmm_match_XYZ function takes an array of two void*'s rather than a separate parameter for each virtual type, each of which is static_cast-ed to Shape* before the dynamic_cast is attempted. This is to enable generic dispatch code to be used for different virtual functions.]

The second step requires that the inheritance tree for each dynamic type is known. The dispatch code can then compare the types taken by each matching implementation, and select the implementation for which each virtual parameter is no less derived than any other matching implementation's virtual parameter. As discussed earlier, this corresponds to the way conventional overloading works at compile time.

The information about the inheritance trees is encoded in C-style strings using a mangling scheme similar to that used by C++ compilers when generating symbol names. This allows static initialisation to be used to register implementations at runtime.

[Previous versions of Cmm registered implementations at build time by requiring a separate link-style invocation, but this made builds very complicated and slow, and precluded use with dynamic loading of code. The only advantages of the old scheme were that dispatch time may have been slightly faster, and all implementations were usable by static initialisation code.]

Finally, the generic dispatch code calls the actual implementation via a wrapper function that takes the base parameters, casts them directly to the derived types, and calls the implementation. Again, this function name is mangled:

// Cmm-generated call function for the function
// bool Overlap( Square& a, Triangle& b);
bool Overlap_cmm_call_XYZ( Shape& a, Shape& b)
{
    return Overlap(
        *static_cast< Square*  >( &a),
        *static_cast< Triangle*>( &b));
}

The function's precondition is that the derived types are correct and so the static_casts's are legal. Using this wrapper function enables the dispatch code to work in terms of generic function pointers even if implementations use derived classes in anonymous namespace.

[The function should use dynamic_cast rather than static_cast when Derived inherits virtually from Base, but this hasn't been implemented yet.]

Registering multimethods using static initialisation

For each implementation, Cmm generates a global variable whose initialisation registers the implementation with the dispatch function:

static cmm_implementation_holder
    Overlap_XYZ(
        "7Overlap2_1_5Shape1_5Shape",                    // virtual fn.
        "8Overlap_2_2_5Shape_6Square2_5Shape_8Triangle", // implementation
        Overlap_cmm_implmatch_XYZ,
        Overlap_cmm_implcall_XYZ);

cmm_implementation_holder is a class defined in dispatc.h/dispatch.cpp, whose constructor de-mangles the first two textual parameters to extract complete information about the inheritance tree for each virtual parameter taken by the virtual function and the implementation function. Together with the Overlap_cmm_match functions, this is sufficient to enable multimethod dispatch to be performed.

In this example, the first mangled string means: "A function called Overlap with 2 virtual parameter, the first a class containing one item in its inheritance tree, Shape, and the second also containing the same single class in its inheritance tree, Shape". The second mangled string means: "A function called Overlap_ with 2 virtual parameters, the first one being a class with 2 items in its inheritance tree, Shape followed by Square, while the second parameter's type also contains 2 items in its inheritance tree, the first one being Shape, and the second Triangle".

This use of static initialisers to register implementations allows dynamically loaded code to automatically register new implementations with the dispatch functions. Furthermore, the destructor of the cmm_implementation_holder class unregisters the implementations, so one can load/unload code at will.

The handling of implementation functions in dynamically loaded code has been tested on OpenBSD 3.2 and Slackware Linux 8.1, using the dlopen() and dlclose() functions.

Dispatch caches

Figuring out which of a set of implementation functions to call for a particular set of dynamic types is very slow, so some sort of caching scheme is required. Caching is performed by the code in the dispatch.cpp library. Currently this uses a std::map to map from a std::vector of std::type_info's to a function pointer. This gives a dispatch speed of O(Log N), where N is the number of different combinations of dynamic types that have been encountered (some of which may be mapped to the same function pointer). On OpenBSD 3.2 with gcc 2.95, the dispatch time for two virtual parameters is around 10-100 times slower than a conventional virtual function call.

It would probably be better to have special cache support for multimethods with one or two virtual parameters, using a std::map with key types std::type_info[1] and std::type_info[2]. No doubt templates could be used to do this with maximum obfuscation.

Dispatch functions

The actual virtual dispatch function is very simple, because it uses code in dispatch.cpp to do all the real work. This means that it is practical to generate a separate copy in all compilation units as an inline function, looking like:

inline bool Overlap( Shape& a, Shape& b)
{
    static cmm_virtualfn&   virtualfn =
        cmm_get_virtualfn( "7Overlap2_1_5Shape1_5Shape");
    typedef bool(*cmm_fntype)( Shape&, Shape&);
    
    const void*           params[] = { &a, &b};
    const std::type_info* types[]  = { &typeid( a), &typeid( b)};
    
    cmm_fntype cmm_fn = reinterpret_cast< cmm_fntype>(
            cmm_lookup( virtualfn, params, types));
    
    return cmm_fn( cmm_0, cmm_1);
}

The cmm_lookup function uses types as an index into the internal std::map dispatch cache. If this fails, the actual parameters params are used in the slow lookup algorithm described earlier. It returns a generic function pointer, which has to be cast into the correct type using reinterpret_cast.

Raw pointer to implementation function

Cmm provides an extra dispatch function that doesn't actually call the implementation. Instead, it returns a pointer to the best implementation function. This enables client code that calls a multimethod on the same parameters many times in a loop, to cache the function pointer and so avoid any dispatch overhead.

This extra dispatch function has the same name as the virtual function, with a suffix _cmm_getimpl. Using the Overlap example, if you have one collection of shapes that you know are all squares, and you want to search for an overlap with a particular shape, you would usually do:

void Fn( std::vector< Square*> squares, Shape& s)
{
    std::vector< Square*>::iterator it;
    for( it=squares.begin(); it!=squares.end() ++ it)
    {
        if ( Overlap( **it, s)) {...}
    }
}

With the generated Overlap_cmm_get_impl function, you can avoid the multimethod dispatch overhead in each iteration:

void Fn( std::vector< Square*> squares, Shape& s)
{
    std::vector< Square*>::iterator it;
    if ( squares.empty()) return;
    
    bool (*fn)( Shape&, Shape&) =
        Overlap_cmm_get_impl( s, squares[0]);
    
    for( it=squares.begin(); it!=squares.end() ++ it)
    {
        if ( fn( **it, s)) {...}
    }
}

Constant-time dispatch speed

It's possible to get constant-time dispatch speed if all types are assigned a unique small integer, by looking in a multi-dimensional array using the small integers as indices. Cmm has a command-line switch that makes the generated code use this technique.

This scheme is clearly potentially wasteful of memory. If a programme has a thousand classes, then each array will have up to one thousand elements. But only the arrary rows that are actually needed will be used.

In pseudo code, the dispatch for a funtion with two virtual parameters looks like:

void foo( virtual Base& a, virtual Base& b
    int a_smallint = a.get_small_integer();
    int b_smallint = b.get_small_integer();
    fn = cache[a][b];
    return fn( a, b);

In this case, cache is essentially of type fn_ptr[][].

Cmm's dispatch.cpp contains an implementation of this dispatch method that allocates the array lazily so that memory is only allocated for those rows that are actually used.

Getting a unique small integer for each dynamic type is slightly tricky. In an ideal world, the compiler and linker would conspire to make space available in the vtable, which would enable very fast lookup. Cmm can't do this though, so instead it inserts an inline virtual function body into all polymorphic classes, containing a static int to enable fast access to the unique integers:

class Base
{
    // Next function inserted by Cmm:
    virtual int cmm_get_small_integer() const
    {
        static int id=0;
        if ( !id) id = cmm_create_small_integer( typeid( *this));
        return id;
    }
};

The function cmm_get_small_integer() is in the Cmm library dispatch.cpp along with all of the other support function. It maintains an internal map of std::type_info's to ints so that it returns the same integer if called more than once for the same type. This is required to make things work when the C++ environment doesn't implement the static int id correctly; for example, under OpenBSD 3.2, each compilation unit that contains the inline function cmm_get_small_integer() will have its own copy.

The constant time dispatch system is not robust. It adds a virtual function to all polymorphic classes, but this only works if all code is passed through Cmm. Other code, such as code in libraries that are linked into the final executable, may break at runtime because of assuming a different vtable layout. To avoid breaking code in system libraries, Cmm doesn't insert the function into polymorphic classes defined in the std namespace, but of course this means that you cannot do constant-time multimethod dispatch on base classes that are defined in std, such as std::exception.

Cmm extras

Cmm parses C++ code into a type-safe internal representation, using a class hierachy to represent the C++. For example, at the lowest level, there is a different class for each keyword and lexical element (Keyword_struct, Keyword_double, Keyword_EQ etc). Then there are classes such as Declaration, ForLoop which represent higher level C++ concepts. A messy detail with C++ that I've come across (apart from the details of the parsing) is that all declarations potentially declare more than one item. For example, int x, y;. This means that declarations have to be represented as a base type (the left-hand-side of the declaration) and a list of items.

Having a C++ parser available makes a number of things possible that are usually not practical with C++. For example, Cmm can read an alternative declaration syntax based on one suggested by Stroustrup in The Design and Evolution of C++, and read python-style code with indentation but no braces.

Of more interest is support for serialisation and reflection. I'm not sure what the real definition of reflection is; I've written some reflection support into Cmm that allows serialisation of classes, and in the process this gives access to the types and names of all members of a class. There is currently no support for a programme to examine its own functions.

Reflection support

The reflection support uses overloading to recursively call a function name, passing each member of a class in turn. It works by the user writing a function declaration prefixed by @cmm_memberrecursivefn. This gets replaced by a function definition as the following example demonstrates:

struct my_class
{
    int         x;
    std::string text;
};
@cmm_memberrecursivefn void foo( my_class& instance, std::ostream& out);

Cmm will convert this into:

struct my_class
{
    int         x;
    std::string text;
};
void foo( my_class& instance, std::ostream& out)
{
    foo( instance.x, out);
    foo( instance.text, out);
}

The idea here is that you write different overloads of the foo() function taking basic types, and the functions that Cmm generates in response to @cmm_memberrecursivefn directives make the appropriate calls. Base classes are handled like members in the way one would expect so that all members of the class are eventually passed to the foo() function (although virtual base classes aren't treated properly at the moment).

More detailed reflection behaviour is possible with the @cmm_memberreflectfn directive. This is used in a similar way to @cmm_memberrecursivefn, except that the generated function adds two extra parameters, a char* pointing to the name of the member and a reference to the std::type_info for the member. Thus:

@cmm_memberreflectfn void bar( my_class& instance, std::ostream& out)

- will expand to:

void bar( my_class& instance, std::ostream& out)
{
    bar_cmm_reflect( instance.x, typeid( int), "x", out);
    bar_cmm_reflect( instance.text, typeid( std::string), "text", out);
}

This gives full access to all available information about the members of the class. One can write a simple function template for foo_cmm_reflect that will make calls back to the foo() function:

template< class T>
    void foo_cmm_reflect(
        T& data,
        const std::type_info& static_type,
        const char* name,
        std::ostream& out)
    {
        out << name << "=" << static_type.name() << "=" {";
        foo( data, out); // call back into foo().
    }

With overloads of bar() for basic types and @cmm_memberreflectfn foo(...) directives for a set of classes, this scheme gives full access to all member-variable information for the classes.

Instead of the use of overloaded function calls used in the above examples, one could provide iterator-style access to member variables, but actually using items for which one only has an iterator (which must necessarily have a plain static type) seems difficult.

Multimethods and networked objects

I'm slightly suspicious of the idea of using objects that can be used on different systems. If an object is migrated from one machine to another, in general one cannot migrate the methods along with the data because the second machine may have a different operating system. Even if it has an identical operating system, any differences in the configuration will mean that the code may not work in exactly the same way. This is going against the whole approach of systems like COM and CORBA, where object interfaces are fundamental.

So if an object is migrated, what will actually happen is that the data is migrated, and methods are attached to it at the other end, which may or may not behave in exactly the same way. So it may be be better to forget about the concept of objects, and face the reality that we are dealing with data and code.

This corresponds very closely to the multimethod style, where virtual functions are quite separate from the classes that they act upon. Serialisation becomes only concerned with the migration of data, and the target system's multimethod support is then used to add whatever functionality is possible.

With Cmm's @cmm_memberreflectfn and @cmm_memberrecursivefn support, serialisation of data becomes trivial. Unserialisation is slightly more complicated though; one can read data into an existing object using @cmm_memberreflectfn and @cmm_memberrecursivefn, but usually one will want to construct a brand new object.

Perhaps this could be done by Cmm being told to add special constructors to particular classes, which take the serialised data and initialise each member variable:

struct foo
{
    @cmm_member_constructor foo( std::istream& in);
    int x;
    std::string text;
    MyClass data;
};

This would expand to:

struct foo
{
    foo( std::istream& in)
    :
    x( in),
    text( in),
    data( in)
    {}

    int x;
    std::string text;
};

The difficulty here is that while MyClass can be given a constructor that takes a std::istream parameter, fundamental types don't have such constructors.