This file contains some information about the design and implementation of Cmm. The directory containing this page (./) contains the Cmm source code.
See ../users.html for a description of the Cmm C++ language extension and Cmm usage information.
Cmm is written in (hopefully portable) C++. It was designed in C++ rather than UML or similar (which I'm not going to apologise for BTW) and is very much work in progress.
The main component is a simplified C++ parser. This parser doesn't do a
complete parse - for example by default it ignores the {...}
in
all function and class definitions (the -detailedparse
switch
can be used to change this). All types and functions defined by the parser
are in the namespace parser
.
The other main components are support for cmm -c, cmm -l
and cmm -s. These are contained in the structs
SecondPassInfo
and ThirdPassInfo
respectively, both
in the namespace Cmm
- see cmm.h.
The generated parse tree consists of statically-typed pointers to various
structs that represent different C++ constructs. These structs all inherit
from the base type parser::node_t
, and the inheritance tree for
the different node types reflects the C++ grammar. See parser.h for details of the parse tree representation.
The source code that is parsed is kept in memory as a zero-terminated C-style string, and each node that makes up the parse tree contains pointers into this string (not a copy of parts of this string).
Lexical elements like identifiers, keywords and operators are represented
as individual node types. For example, the C++ keyword int
is
represented as a Keyword_int
, and the C++ operator
-=
is represented as a keyword_MINUSEQ
; both of
these derive from the struct keyword
which in turn derives from
node_t
. C++ RTTI is used to determine the type of a node in the
parsing code. For example, the lexer's lexer::GetNext()
method
returns a node_t*
, and the parser then uses C++ RTTI to
determine the actual type of the node. This approach is also used in
higher-level parsing - for example after getting a top-level node, the parser
use RTTI to determine whether the node represents a function. See lexer.h for information on the lexer interface, and parser.cpp for the parsing code.
The parser is hand-written C++. I have used lex and yacc in the past, but have tired of their indescriminant use of macros and global variables. Also, there isn't a readily available yacc grammar for C++. There are some interesting comments about parsing C++ in section 3.3.2 (page 68) of Stroustrup's The Design and Evolution of C++.
The parser doesn't do any name lookup. Unfortunately C and C++ are
impossible to parse correctly without name lookup (e.g. A * B;
is
a declaration if A
is a type, otherwise it is an expression), so
the parser doesn't always get things right. In particular, it looks for a
declaration before looking for an expression. So foo( x);
will
always be treated as a declaration even inside function bodies, where it is
more likely to be a function call.
The parser leaks memory - the parse tree nodes contain plain pointers to subtrees and have no destructors. This simplifies the code - there are no reference counted pointers etc - and allows the parse tree to be manipulated easily. For example, when generating a new function with the same parameter types as an existing function, one can simply make the new function point to the existing function's parameter node.
If leaking memory had to be avoided, it would be possible to keep a list
of all allocated nodes by making the base node's constructor
parser::node_t::node_t()
update a static list of allocated
nodes. An alternative would be to use referenced-counted pointers everywhere.
Cmm is an experimental command line programme that doesn't run for very long
so, for the moment at least, the memory leaks don't really matter.
The generation of functions also leaks memory - heap-allocated
std::stringstream
's are used to contain pieces of source code
that make up the functions. I've had a few problems with VC++ and STLport's
std::stringstream
class which has complicated this slightly.
The parser is rather slow, mainly because of the lexer - it does a simple
comparison with each lexical element using the function
StartsWith()
in parser.cpp, until it
finds a match. Of course there are much better ways of extracting lexical
elements, but I haven't had time to improve this aspect of the programme
yet.
cmm -c
The input Cmm file is parsed, and a Cmm::SecondPassInfo
is
constructed from the parse tree (see cmm.h). The
constructor for this class walks through the parse tree looking for all
virtual functions, and also for all functions that could be implementations
of known virtual functions. This information is used to generate the Cmmi
file and output C++ source making various changes to convert from Cmm to
legal C++.
cmm -l
A Cmm::ThirdPassInfo
is constructed (see cmm.h) and then passed each of the input Cmmi files in turn.
The virtual and implementation function information is extracted and
organised so that the implementations for each virtual function are known
(independant of which Cmm file they originated in). This allows dispatch
functions to be generated and appended to one of the implementation files.
Cmmi files are designed to be written and read by the same Cmm programme during a build so the format doesn't need to be fixed forever. The current format is:
Cmmi: virtualfn-spec*
virtualfn-spec: virtualfunction "<c++filename>" <virtual fn prototype> <implementation>*
implementation: implementation "<c++filename>" <numparams> paraminfo ... <matchfn> <callfn>
paraminfo: <derivationdepth> <classname> <classname> <classname> ...
It's probably much easier to look at an example to understand the format - see ../examples/overlap/main.cmmi. The code that writes/reads Cmmi files is in cmm.cpp.
Example dispatch code can be seen in ../examples/overlap/main.cppextra. It might be easier to read this rather than try to understand the textual explanation below.
The dispatch function has to decide which implementation function to call,
given the particular set of dynamic types of it's parameters. Finding this is
a slow process, so results are cached in a static std::map
that
maps from a std::vector
of std::type_info*
's to a
function pointer.
The std::map
and std::vector
container types can
be changed with cmm's -map and -array parameters. This is
useful because the dispatch speed depends primarily on these containers.
There are two support functions that cmm -c generates for each implementation it finds. cmm -l also declares these functions just before the dispatch function, so that the dispatch function can call them.
The first support function takes just the virtual parameters, and does a
dynamic_cast<>
on each one to test whether it is
compatible. This function returns true
if the implementation
matches or false
if it doesn't.
The second support function is the implementation that the dispatch
function calls. It has exactly the same signature as the virtual function,
i.e. it takes references to base types, rather than the implementation's
derived types. This allows the dispatch function to use it without knowing
anything about the derived types - they could be undefined in the dispatch
function's compilation unit. This function uses static_cast
to
get the derived types - it assumes that the dispatch function is correct and
is actually giving it the appropriate derived dynamic types - and then calls
the real implementation.
Code generated by cmm -c uses a faulty algorithm. Cmm -s uses code in dispatch.cpp, which is correct (though slower).
This mode works by using a library dispatch.cpp that accepts registration of multimethod implementations at runtime, and also provides a generic dispatch function that works for all multimethods. For each implementation, Cmm adds a global variable whose initialisation causes the implementation to be registered with the library.
Because multimethod dispatch is done using the library, the actual
multimethod-specific code is very simple, and so each compilation unit that
contains a multimethod implementation will also get an implementation of the
dispatch function. This function is defined as inline
to allow
it to occur in multiple compilation units. Whether the final executable ends
up with multiple copies of this function depends on how good the linker is at
removing duplicate code.
The generic dispatch works by dealing with arrays of void*
which point to the virtual parameters. The implementation-specific match
functions first static_cast
these to the appropriate base class
before using dynamic_cast
to decide whether they have the
correct dynamic type. The generic dispatch function is also given an array of
std::type_info*
which are used to to address a cache, in the
same way as with cmm -c/-l.
This all means that there is more overhead involved in a dynamic dispatch. Instead of Cmm generating a complete dispatch function for each multimethod, it generates a functions which puts the addresses and typeids of the virtual parameters in two arrays, and calls the generic dispatch function, which returns a pointer to the appropriate implementation. The generic dispatch function does much the same things as cmm -l's dispatch function, so the overhead is the creation of the arrays and the extra function call.
Please note that the code in cmm.cpp is pretty horrendous, and needs a severe cleanup/rewrite.
As well as fixing the limitations mentined in ../users.html, the following are things that I'd like to try:
Cmm -s allows runtime modification of the available implementations, and things work automatically when DLLs are loaded/unloaded..
The idea is to have the caller decide that a function call should be
resolved dynamically, with no restriction on having base classes for each
parameter. This would be done when the call to a function prefixes one or
more arguments with virtual
. This makes the dynamic dispatch
case similar to conventional compile-time overloading, where no base types
are required when lookup is performed.
As of cmm-0.27, the experimental -caller-dispatch
flag (see
test107 and examples/caller-dispatch.cmm)
supports this usage.
A virtual function with a single virtual parameters is equivalent to a conventional C++ virtual method. Cmm has access to all source code when a project is built, so it could add virtual member functions to classes and make dispatch functions call these methods directly. For example a Cmm file:
struct Base
{ ...
};
void Foo( virtual Base& x, double y);
struct Derived : Base
{ ...
};
void Foo( Derived& x, double y)
{ // implementation for class Derived.
}
- would be translated into:
struct Base
{ ...
virtual void cmm_Foo( double y) = 0;
};
inline void Foo( Base& x, double y)
{
return x.cmm_Foo( y);
}
struct Derived1 : Base
{ ...
virtual void cmm_Foo( double y)
{
Base& x = *this;
// implementation for class Derived
}
};
And no separate dispatch function would be required.
There is one difference here, in that a compile time error will be given
if the original Cmm source doesn't contain an implementation of Foo(
Derived&, double)
for some class Derived
. Cmm as it
stands would give only a runtime error. However, one could make Cmm output a
Base::cmm_Foo( double)
that throws an exception.