Developing Toy

Here you’ll find some of the implementation details.

Bytecode

The output of Toy’s compiler, and the input of the interpreter, is known as “bytecode”. Here, I’ve attempted to fully document the layout of the canonical bytecode’s structure, but since this was written after most of this was implemented, there may be small discrepencies present.

There are four main sections of the bytecode:

Header
Literal Cache
Function Definitions
Program Definition

Bytecode Header Format

Note: The bytecode header format must not change.

This section is used to define what version of Toy is currently running, as well as to prevent any version/fork clashes.

The header consists of four values:

TOY_VERSION_MAJOR
TOY_VERSION_MINOR
TOY_VERSION_PATCH
TOY_VERSION_BUILD

The first three are single unsigned bytes, embedded at the beginning of the bytecode in sequence. These represent the major, minor and patch versions of the language. The fourth value is a null-terminated c-string of unspecified data, which is intended but not required to specify the time that the langauge’s compiler was itself compiled. The build string can hold arbitrary data, such as the current maintainer’s name, current fork of the language, or other versioning info.

There are some strict rules when interpreting these values (mimicking, but not conforming to semver.org):

Under no circumstance, should you ever run bytecode whose major version is different - there are definitely broken APIs involved.
Under no circumstance, should you ever run bytecode whose minor version is above the interpreter’s minor version - the bytecode could potentially use unimplemented features.
You may, at your own risk, attempt to run bytecode whose patch version is different.
You may, at your own risk, attempt to run bytecode whose build version is different.

All interpreter implementations retain the right to reject any bytecode whose header data does not conform to the above specification.

The latest version information can be found in toy_common.h

Literal Cache

In Toy, a “Literal” is a value of some kind, be it an integer, or a dictionary, or even a variable name. Rather than embedding the same literal (potentially) many times within the bytecode, the “Literal Cache” was devised to act as an immutable, indexable repository of any literals needed. When bytecode is first loaded into the interpreter, the first thing that happens (after the header is parsed) is the reconstruction of the literal cache. The internal function readInterpreterSections() is responsible for this step.

The first unsigned short to be read from this section is literalCount, which defines the number of literals which are to be read. Once all literals have been read out of this section, the opcode TOY_OP_SECTION_END is expected to be consumed. Some preprocessor macros can also enable or disable debug printing functionality within the repl.

The list of valid literal types are:

TOY_LITERAL_NULL

This literal is simply inserted into the literal cache when encountered.

TOY_LITERAL_BOOLEAN

This literal specifies that the next byte is it’s value, either true or false.

TOY_LITERAL_INTEGER

This literal specifies that the next 4 bytes are it’s value, interpreted as a 32-bit integer.

TOY_LITERAL_FLOAT

This literal specifies that the next 4 bytes are it’s value, interpreted as a 32-bit floating point integer.

TOY_LITERAL_STRING

This literal specifies that the next collection of null terminated bytes are it’s value, interpreted as a null-terminated string.

TOY_LITERAL_ARRAY_INTERMEDIATE

TOY_LITERAL_ARRAY_INTERMEDIATE specifies that the literal to be read is a flattened LiteralArray. A “flattened” compound literal does not actually store it’s contents, only references to it’s contents’ positions within the literal cache.

To read this array, you must first read an unsigned short which specifies the size, then read that many additional unsigned shorts, which are indices. Finally, the original LiteralArray can be reconstructed using those indices, in order.

As the final step, the newly reconstructed LiteralArray is added to the literal cache.

TOY_LITERAL_DICTIONARY_INTERMEDIATE

TOY_LITERAL_DICTIONARY_INTERMEDIATE specifies that the literal to be read is a flattened LiteralDictionary. A “flattened” compound literal does not actually store it’s contents, only references to it’s contents’ positions within the literal cache.

To read this dictionary, you must first read an unsigned short which specifies the size (both keys and values), then read that many additional unsigned shorts, which are indices of keys and values. Finally, the original LiteralDictionary can be reconstructed using those key and value indices.

As the final step, the newly reconstructed LiteralDictionary is added to the literal cache.

TOY_LITERAL_FUNCTION

When a TOY_LITERAL_FUNCTION is encountered, the next unsigned short to be read (the function index) should be converted into an integer literal, before having it’s type manually changed to TOY_LITERAL_FUNCTION_INTERMEDIATE for storage within the literal cache.

Functions will be processed properly in a later step - so this literal is added to the cache as a placeholder until that point.

TOY_LITERAL_IDENTIFIER

This literal specifies that the next collection of null terminated bytes are it’s value, interpreted as a null-terminated string.

TOY_LITERAL_TYPE

This literal specifies that the next byte is the type of a literal, and the following byte is a boolean specifying const-ness.

(This literal type may be integrated with TOY_LITERAL_TYPE_INTERMEDIATE at some point.)

TOY_LITERAL_TYPE_INTERMEDIATE

This literal specifies that the next byte is the type of a literal, and the following byte is a boolean specifying const-ness.

Then if the type is TOY_LITERAL_ARRAY, the following unsigned short is an index within the cache, representing the type of the contents.

Otherwise, if the type is TOY_LITERAL_DICTIONARY, the following two unsigned shorts are indices within the cache, representing the types of the keys and values.

TOY_LITERAL_INDEX_BLANK

This literal is simply inserted into the literal cache when encountered.

Function Definitions

The second stage of readInterpreterSections() is used to read the third section of the given bytecode - the function definitions.

The first unsigned short is the number of functions present within this section. The second unsigned short is the length of this entire section (this one is not necessarily needed, and may be removed at some point).

For each TOY_LITERAL_FUNCTION_INTERMEDIATE within the cache, you must read an unsigned short as the size. Then, the following size block of bytecode is to be copied, wholesale, into the specified cached literal, before setting that literal’s type to TOY_LITERAL_FUNCTION. While the function is not operational yet, it will be further processed when needed.

Once all function literals have been read out of this section, the opcode TOY_OP_SECTION_END is expected to be consumed.

Program Definition

TODO

Opcodes

TODO: finish these

Toy_Opcode	Value	Description
TOY_OP_EOF	0	End of File signal. Used internally as an OK signal.
TOY_OP_PASS	1	Does nothing - a no-op.
TOY_OP_ASSERT	2	Pops `rhs`, `lhs`* from the stack. `rhs` must be a string, identifiers are not accepted. If `lhs` is `null` or `false`, prints `rhs` to the assert output, and cancels the script.
TOY_OP_PRINT	3	Pops one* literal from the stack, and prints it to the print output.
TOY_OP_LITERAL	4	Read one byte as an index; retrieve the given index’s literal from the cache, and push it to the stack.
TOY_OP_LITERAL_LONG	5	Same as `TOY_OP_LITERAL`, except you read two bytes at a time.
TOY_OP_LITERAL_RAW	6	Pops one literal from the stack. If it’s an identifier, explicitly replace it with it’s value. Push the literal to the stack.
TOY_OP_NEGATE	7	Pops one* literal from the stack. If it’s an integer or float**, push the negative value to the stack.
TOY_OP_ADDITION	8	Pops `rhs`, `lhs` from the stack. If both are strings, attempts to concatenate them, and push the result to the stack. Otherwise, will attempt to add `lhs` to `rhs` (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_SUBTRACTION	9	Pops `rhs`, `lhs` from the stack. Attempts** to subtract `rhs` from `lhs` (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_MULTIPLICATION	10	Pops `rhs`, `lhs` from the stack. Attempts** to multiply `lhs` by `rhs` (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_DIVISION	11	Pops `rhs`, `lhs` from the stack. Attempts** to divide `lhs` by `rhs` (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_MODULO	12	Pops `rhs`, `lhs` from the stack. Attempts** to modulo `lhs` by `rhs`, and push the result to the stack.
TOY_OP_GROUPING_BEGIN	13	Begin operating on a group; analagous to the `(` operator.
TOY_OP_GROUPING_END	14	Finish operating on a group; analagous to the `)` operator.
TOY_OP_SCOPE_BEGIN	15	Push an inner-scope onto the stack; analogous to the `{` operator.
TOY_OP_SCOPE_END	16	Push an inner-scope off of the stack; analogous to the `}` operator.
TOY_OP_TYPE_DECL	17	Not used.
TOY_OP_TYPE_DECL_LONG	18	Not used.
TOY_OP_VAR_DECL	19	Read one byte, retrieving that literal from the cache as `identifier`, and then read the next byte, retrieving that literal from the cache as `type`. Attempt** to declare a variable with `identifier` as its name and `type` as its type, then pop the top of the stack* as it’s value (cooerce integer values to floats if `type` specifies a float).
TOY_OP_VAR_DECL_LONG	20	Same as `TOY_OP_VAR_DECL`, except you read two bytes at a time.
TOY_OP_FN_DECL	21	Read one byte, retrieving that literal from the cache as `identifier`, and then read the next byte, retrieving that literal from the cache as `function`. Attempt** to declare a variable with `identifier` as its name and `fn const` as it’s type, then set `function` as its value. The function’s scope has the current scope as its ancestor.
TOY_OP_FN_DECL_LONG	22	Same as `TOY_OP_FN_DECL`, except you read two bytes at a time.
TOY_OP_VAR_ASSIGN	23	Pops `rhs`, `lhs` from the stack. `lhs` must be an identifier. Attempts* to store the value of `rhs`, with `lhs` as it’s identifier.
TOY_OP_VAR_ADDITION_ASSIGN	24	Equivilent of `TOY_OP_ADDITION` followed by `TOY_OP_VAR_ASSIGN`.
TOY_OP_VAR_SUBTRACTION_ASSIGN	25	Equivilent of `TOY_OP_SUBTRACTION` followed by `TOY_OP_VAR_ASSIGN`.
TOY_OP_VAR_MULTIPLICATION_ASSIGN	26	Equivilent of `TOY_OP_MULTIPLICATION` followed by `TOY_OP_VAR_ASSIGN`.
TOY_OP_VAR_DIVISION_ASSIGN	27	Equivilent of `TOY_OP_DIVISION` followed by `TOY_OP_VAR_ASSIGN`.
TOY_OP_VAR_MODULO_ASSIGN	28	Equivilent of `TOY_OP_MODULO` followed by `TOY_OP_VAR_ASSIGN`.
TOY_OP_TYPE_CAST	29	Pops `val`, `type` from the stack. Attempts** to change the type of `val` to `type`, and push the result to the stack. Only works on `bool`, `int`, `float` and `string` types, and only if the value is in the correct format.
TOY_OP_TYPE_OF	30	Pops `rhs` from the stack. If `rhs` is an identifier, it determines the the type of the variable specified, otherwise it determines the type of `rhs` directly. The result is pushed to the stack.
TOY_OP_IMPORT	31	Pops `alias` and `identifier` from the stack. Attempts** to invoke a hook specified by `identifier`, with the given `alias`.
TOY_OP_EXPORT_removed	32	Removed.
TOY_OP_INDEX	33	Pops `third`, `second`, `first`* and `compound`* from the stack. Pushes the value specified by `compound[first : second : third]` onto the stack.
TOY_OP_INDEX_ASSIGN	34	Pops `assign`, `third`, `second`, `first` and `compound`* from the stack. Sets the value specified by `compound[first : second : third]` to be `assign`.
TOY_OP_INDEX_ASSIGN_INTERMEDIATE	35	Due to the unfortunately bonkers nature of indexing, this is needed as part of `TOY_OP_INDEX_ASSIGN`. It acts the same as `TOY_OP_INDEX`, but leaves an additional copy of some variables on the stack for `TOY_OP_INDEX_ASSIGN` to process.
TOY_OP_DOT	36
TOY_OP_COMPARE_EQUAL	37
TOY_OP_COMPARE_NOT_EQUAL	38
TOY_OP_COMPARE_LESS	39
TOY_OP_COMPARE_LESS_EQUAL	40
TOY_OP_COMPARE_GREATER	41
TOY_OP_COMPARE_GREATER_EQUAL	42
TOY_OP_INVERT	43
TOY_OP_AND	44
TOY_OP_OR	45
TOY_OP_JUMP	46
TOY_OP_IF_FALSE_JUMP	47
TOY_OP_FN_CALL	48
TOY_OP_FN_RETURN	49
TOY_OP_POP_STACK	50
TOY_OP_TERNARY	51
TOY_OP_FN_END	52
TOY_OP_SECTION_END	255
TOY_OP_PREFIX	256	Used internally.
TOY_OP_POSTFIX	257	Used internally.

*If this literal is an identifier, it is instead replaced with the correct given value from the current scope.
**On failure, the script will print an error message to the error output and exit.

Function Internal Structure

TODO: loose first argument, args & returns counters in the program space

Parser Structure and Operations

TODO

Compiler Structure and Operations

TODO

Interpreter Structure and Operations

The Toy interpreter is, at it’s core, just a big loop that reads bytes from memory and acts on them. Here, I’ll break down exactly how it works, from a top-down perspective.

Running the Interpreter

There are four main functions for running the interpreter:

Toy_initInterpreter
Toy_runInterpreter
Toy_resetInterpreter
Toy_freeInterpreter

First, init zeroes out the interpreter, sets up the printing functions, and delegates to reset, which in turn sets up the program’s scope (and injects the default global functions). The initialization function is split into two this way so that reset can be used independantly on a “dirty” interpreter to ready it for another script (or another run of the same script). reset is usually not needed and may be removed in future.

free simply frees the interpreter after execution.

Interestingly, run doesn’t jump straight into exection. Instead, it first does it’s own bit of setup, before reading out the bytecode’s header. If the header indicates an incompatible version, then the interpreter will refuse to run, to prevent mistakes from ruining the program.

run will also delegate to a function called readInterpreterSections(), which reads and reconstructs the “literalCache” - a collection of all values within the program (variable identifiers, variable values, function bytecode, etc.)

Next, run will pass to a function called execInterpreter(), which contains the program’s loop.

Finally, run will automatically free the bytecode and associated literalCache (this may change at some point).

Executing the Interpreter

Opcodes within the bytecode are 1 byte in length, and specify a single action to take. Each possible action is definied within the interpreter in a function that begins with exec, and are called from within a big looping switch statement. If any of these exec functions encounters an error, they can simply return false to break the loop.

The interpeter is stack-based; most, if not all of the actions are preformed on literals within a specially designated array called stack. for example:

	case TOY_OP_PRINT:
		if (!execPrint(interpreter)) {
			return;
		}
	break;

When a the opcode TOY_OP_PRINT is encountered, the top literal within the stack is popped off, and printed (more info on literals below).

static bool execPrint(Toy_Interpreter* interpreter) {
	//get the top literal
	Toy_Literal lit = Toy_popLiteralArray(&interpreter->stack);

	//if the top literal is an identifier, get it's value
	Toy_Literal idn = lit;
	if (TOY_IS_IDENTIFIER(lit) && Toy_parseIdentifierToValue(interpreter, &lit)) {
		Toy_freeLiteral(idn);
	}

	//print as a string to the current print method
	Toy_printLiteralCustom(lit, interpreter->printOutput);

	//free the literal
	Toy_freeLiteral(lit);

	//continue the loop
	return true;
}

Identity Crisis

As in most programming languages, variables can be represented by names specified by the programmer; in Toy, these are called “identifiers”. These identifiers can be passed around in place of their actual values, but can’t be used directly. To retrieve a value, you must first “parse” it, like so:

Toy_Literal idn = literal; //cache the literal, just in case it's an identifier
if (TOY_IS_IDENTIFIER(literal) && Toy_parseIdentifierToValue(interpreter, &literal)) { //if it is an identifier, parse it...
	Toy_freeLiteral(idn); //always remember to free the original identifier, otherwise you'll have a memory leak!
}

You will often see this pattern throughout the codebase.

Other Utility Functions

Other functions are available at the top of the interpreter source file:

printing utilities
injection utilities
parsing utilities
bytecode utilities
function utilities (these ones is at the very bottom of the source file)

Literals

TODO

Arrays & Dictionaries

TODO