View on GitHub

Toy

A toy programming language.

Developing Toy

Here you’ll find some of the implementation details.

Bytecode

The output of Toy’s compiler, and the input of the interpreter, is known as “bytecode”. Here, I’ve attempted to fully document the layout of the canonical bytecode’s structure, but since this was written after most of this was implemented, there may be small discrepencies present.

There are four main sections of the bytecode:

Bytecode Header Format

Note: The bytecode header format must not change.

This section is used to define what version of Toy is currently running, as well as to prevent any version/fork clashes.

The header consists of four values:

The first three are single unsigned bytes, embedded at the beginning of the bytecode in sequence. These represent the major, minor and patch versions of the language. The fourth value is a null-terminated c-string of unspecified data, which is intended but not required to specify the time that the langauge’s compiler was itself compiled. The build string can hold arbitrary data, such as the current maintainer’s name, current fork of the language, or other versioning info.

There are some strict rules when interpreting these values (mimicking, but not conforming to semver.org):

All interpreter implementations retain the right to reject any bytecode whose header data does not conform to the above specification.

The latest version information can be found in toy_common.h

Literal Cache

In Toy, a “Literal” is a value of some kind, be it an integer, or a dictionary, or even a variable name. Rather than embedding the same literal (potentially) many times within the bytecode, the “Literal Cache” was devised to act as an immutable, indexable repository of any literals needed. When bytecode is first loaded into the interpreter, the first thing that happens (after the header is parsed) is the reconstruction of the literal cache. The internal function readInterpreterSections() is responsible for this step.

The first unsigned short to be read from this section is literalCount, which defines the number of literals which are to be read. Once all literals have been read out of this section, the opcode TOY_OP_SECTION_END is expected to be consumed. Some preprocessor macros can also enable or disable debug printing functionality within the repl.

The list of valid literal types are:

TOY_LITERAL_NULL

This literal is simply inserted into the literal cache when encountered.

TOY_LITERAL_BOOLEAN

This literal specifies that the next byte is it’s value, either true or false.

TOY_LITERAL_INTEGER

This literal specifies that the next 4 bytes are it’s value, interpreted as a 32-bit integer.

TOY_LITERAL_FLOAT

This literal specifies that the next 4 bytes are it’s value, interpreted as a 32-bit floating point integer.

TOY_LITERAL_STRING

This literal specifies that the next collection of null terminated bytes are it’s value, interpreted as a null-terminated string.

TOY_LITERAL_ARRAY_INTERMEDIATE

TOY_LITERAL_ARRAY_INTERMEDIATE specifies that the literal to be read is a flattened LiteralArray. A “flattened” compound literal does not actually store it’s contents, only references to it’s contents’ positions within the literal cache.

To read this array, you must first read an unsigned short which specifies the size, then read that many additional unsigned shorts, which are indices. Finally, the original LiteralArray can be reconstructed using those indices, in order.

As the final step, the newly reconstructed LiteralArray is added to the literal cache.

TOY_LITERAL_DICTIONARY_INTERMEDIATE

TOY_LITERAL_DICTIONARY_INTERMEDIATE specifies that the literal to be read is a flattened LiteralDictionary. A “flattened” compound literal does not actually store it’s contents, only references to it’s contents’ positions within the literal cache.

To read this dictionary, you must first read an unsigned short which specifies the size (both keys and values), then read that many additional unsigned shorts, which are indices of keys and values. Finally, the original LiteralDictionary can be reconstructed using those key and value indices.

As the final step, the newly reconstructed LiteralDictionary is added to the literal cache.

TOY_LITERAL_FUNCTION

When a TOY_LITERAL_FUNCTION is encountered, the next unsigned short to be read (the function index) should be converted into an integer literal, before having it’s type manually changed to TOY_LITERAL_FUNCTION_INTERMEDIATE for storage within the literal cache.

Functions will be processed properly in a later step - so this literal is added to the cache as a placeholder until that point.

TOY_LITERAL_IDENTIFIER

This literal specifies that the next collection of null terminated bytes are it’s value, interpreted as a null-terminated string.

TOY_LITERAL_TYPE

This literal specifies that the next byte is the type of a literal, and the following byte is a boolean specifying const-ness.

(This literal type may be integrated with TOY_LITERAL_TYPE_INTERMEDIATE at some point.)

TOY_LITERAL_TYPE_INTERMEDIATE

This literal specifies that the next byte is the type of a literal, and the following byte is a boolean specifying const-ness.

Then if the type is TOY_LITERAL_ARRAY, the following unsigned short is an index within the cache, representing the type of the contents.

Otherwise, if the type is TOY_LITERAL_DICTIONARY, the following two unsigned shorts are indices within the cache, representing the types of the keys and values.

TOY_LITERAL_INDEX_BLANK

This literal is simply inserted into the literal cache when encountered.

Function Definitions

The second stage of readInterpreterSections() is used to read the third section of the given bytecode - the function definitions.

The first unsigned short is the number of functions present within this section. The second unsigned short is the length of this entire section (this one is not necessarily needed, and may be removed at some point).

For each TOY_LITERAL_FUNCTION_INTERMEDIATE within the cache, you must read an unsigned short as the size. Then, the following size block of bytecode is to be copied, wholesale, into the specified cached literal, before setting that literal’s type to TOY_LITERAL_FUNCTION. While the function is not operational yet, it will be further processed when needed.

Once all function literals have been read out of this section, the opcode TOY_OP_SECTION_END is expected to be consumed.

Program Definition

TODO

Opcodes

TODO: finish these

Toy_Opcode Value Description
TOY_OP_EOF 0 End of File signal. Used internally as an OK signal.
TOY_OP_PASS 1 Does nothing - a no-op.
TOY_OP_ASSERT 2 Pops rhs, lhs* from the stack. rhs must be a string, identifiers are not accepted. If lhs is null or false, prints rhs to the assert output, and cancels the script.
TOY_OP_PRINT 3 Pops one* literal from the stack, and prints it to the print output.
TOY_OP_LITERAL 4 Read one byte as an index; retrieve the given index’s literal from the cache, and push it to the stack.
TOY_OP_LITERAL_LONG 5 Same as TOY_OP_LITERAL, except you read two bytes at a time.
TOY_OP_LITERAL_RAW 6 Pops one literal from the stack. If it’s an identifier, explicitly replace it with it’s value. Push the literal to the stack.
TOY_OP_NEGATE 7 Pops one* literal from the stack. If it’s an integer or float**, push the negative value to the stack.
TOY_OP_ADDITION 8 Pops rhs*, lhs* from the stack. If both are strings, attempts to concatenate them, and push the result to the stack**. Otherwise, will attempt** to add lhs to rhs (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_SUBTRACTION 9 Pops rhs*, lhs* from the stack. Attempts** to subtract rhs from lhs (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_MULTIPLICATION 10 Pops rhs*, lhs* from the stack. Attempts** to multiply lhs by rhs (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_DIVISION 11 Pops rhs*, lhs* from the stack. Attempts** to divide lhs by rhs (coercing integers to floats if needed), and push the result to the stack.
TOY_OP_MODULO 12 Pops rhs*, lhs* from the stack. Attempts** to modulo lhs by rhs, and push the result to the stack.
TOY_OP_GROUPING_BEGIN 13 Begin operating on a group; analagous to the ( operator.
TOY_OP_GROUPING_END 14 Finish operating on a group; analagous to the ) operator.
TOY_OP_SCOPE_BEGIN 15 Push an inner-scope onto the stack; analogous to the { operator.
TOY_OP_SCOPE_END 16 Push an inner-scope off of the stack; analogous to the } operator.
TOY_OP_TYPE_DECL 17 Not used.
TOY_OP_TYPE_DECL_LONG 18 Not used.
TOY_OP_VAR_DECL 19 Read one byte, retrieving that literal from the cache as identifier, and then read the next byte, retrieving that literal from the cache as type. Attempt** to declare a variable with identifier as its name and type as its type, then pop the top of the stack* as it’s value (cooerce integer values to floats if type specifies a float).
TOY_OP_VAR_DECL_LONG 20 Same as TOY_OP_VAR_DECL, except you read two bytes at a time.
TOY_OP_FN_DECL 21 Read one byte, retrieving that literal from the cache as identifier, and then read the next byte, retrieving that literal from the cache as function. Attempt** to declare a variable with identifier as its name and fn const as it’s type, then set function as its value. The function’s scope has the current scope as its ancestor.
TOY_OP_FN_DECL_LONG 22 Same as TOY_OP_FN_DECL, except you read two bytes at a time.
TOY_OP_VAR_ASSIGN 23 Pops rhs*, lhs from the stack. lhs must be an identifier. Attempts** to store the value of rhs, with lhs as it’s identifier.
TOY_OP_VAR_ADDITION_ASSIGN 24 Equivilent of TOY_OP_ADDITION followed by TOY_OP_VAR_ASSIGN.
TOY_OP_VAR_SUBTRACTION_ASSIGN 25 Equivilent of TOY_OP_SUBTRACTION followed by TOY_OP_VAR_ASSIGN.
TOY_OP_VAR_MULTIPLICATION_ASSIGN 26 Equivilent of TOY_OP_MULTIPLICATION followed by TOY_OP_VAR_ASSIGN.
TOY_OP_VAR_DIVISION_ASSIGN 27 Equivilent of TOY_OP_DIVISION followed by TOY_OP_VAR_ASSIGN.
TOY_OP_VAR_MODULO_ASSIGN 28 Equivilent of TOY_OP_MODULO followed by TOY_OP_VAR_ASSIGN.
TOY_OP_TYPE_CAST 29 Pops val*, type* from the stack. Attempts** to change the type of val to type, and push the result to the stack. Only works on bool, int, float and string types, and only if the value is in the correct format.
TOY_OP_TYPE_OF 30 Pops rhs from the stack. If rhs is an identifier, it determines the the type of the variable specified, otherwise it determines the type of rhs directly. The result is pushed to the stack.
TOY_OP_IMPORT 31 Pops alias and identifier from the stack. Attempts** to invoke a hook specified by identifier, with the given alias.
TOY_OP_EXPORT_removed 32 Removed.
TOY_OP_INDEX 33 Pops third*, second*, first* and compound* from the stack. Pushes the value specified by compound[first : second : third] onto the stack.
TOY_OP_INDEX_ASSIGN 34 Pops assign*, third*, second*, first* and compound* from the stack. Sets the value specified by compound[first : second : third] to be assign.
TOY_OP_INDEX_ASSIGN_INTERMEDIATE 35 Due to the unfortunately bonkers nature of indexing, this is needed as part of TOY_OP_INDEX_ASSIGN. It acts the same as TOY_OP_INDEX, but leaves an additional copy of some variables on the stack for TOY_OP_INDEX_ASSIGN to process.
TOY_OP_DOT 36  
TOY_OP_COMPARE_EQUAL 37  
TOY_OP_COMPARE_NOT_EQUAL 38  
TOY_OP_COMPARE_LESS 39  
TOY_OP_COMPARE_LESS_EQUAL 40  
TOY_OP_COMPARE_GREATER 41  
TOY_OP_COMPARE_GREATER_EQUAL 42  
TOY_OP_INVERT 43  
TOY_OP_AND 44  
TOY_OP_OR 45  
TOY_OP_JUMP 46  
TOY_OP_IF_FALSE_JUMP 47  
TOY_OP_FN_CALL 48  
TOY_OP_FN_RETURN 49  
TOY_OP_POP_STACK 50  
TOY_OP_TERNARY 51  
TOY_OP_FN_END 52  
TOY_OP_SECTION_END 255  
TOY_OP_PREFIX 256 Used internally.
TOY_OP_POSTFIX 257 Used internally.

*If this literal is an identifier, it is instead replaced with the correct given value from the current scope.
**On failure, the script will print an error message to the error output and exit.

Function Internal Structure

TODO: loose first argument, args & returns counters in the program space

Parser Structure and Operations

TODO

Compiler Structure and Operations

TODO

Interpreter Structure and Operations

The Toy interpreter is, at it’s core, just a big loop that reads bytes from memory and acts on them. Here, I’ll break down exactly how it works, from a top-down perspective.

Running the Interpreter

There are four main functions for running the interpreter:

First, init zeroes out the interpreter, sets up the printing functions, and delegates to reset, which in turn sets up the program’s scope (and injects the default global functions). The initialization function is split into two this way so that reset can be used independantly on a “dirty” interpreter to ready it for another script (or another run of the same script). reset is usually not needed and may be removed in future.

free simply frees the interpreter after execution.

Interestingly, run doesn’t jump straight into exection. Instead, it first does it’s own bit of setup, before reading out the bytecode’s header. If the header indicates an incompatible version, then the interpreter will refuse to run, to prevent mistakes from ruining the program.

run will also delegate to a function called readInterpreterSections(), which reads and reconstructs the “literalCache” - a collection of all values within the program (variable identifiers, variable values, function bytecode, etc.)

Next, run will pass to a function called execInterpreter(), which contains the program’s loop.

Finally, run will automatically free the bytecode and associated literalCache (this may change at some point).

Executing the Interpreter

Opcodes within the bytecode are 1 byte in length, and specify a single action to take. Each possible action is definied within the interpreter in a function that begins with exec, and are called from within a big looping switch statement. If any of these exec functions encounters an error, they can simply return false to break the loop.

The interpeter is stack-based; most, if not all of the actions are preformed on literals within a specially designated array called stack. for example:

	case TOY_OP_PRINT:
		if (!execPrint(interpreter)) {
			return;
		}
	break;

When a the opcode TOY_OP_PRINT is encountered, the top literal within the stack is popped off, and printed (more info on literals below).

static bool execPrint(Toy_Interpreter* interpreter) {
	//get the top literal
	Toy_Literal lit = Toy_popLiteralArray(&interpreter->stack);

	//if the top literal is an identifier, get it's value
	Toy_Literal idn = lit;
	if (TOY_IS_IDENTIFIER(lit) && Toy_parseIdentifierToValue(interpreter, &lit)) {
		Toy_freeLiteral(idn);
	}

	//print as a string to the current print method
	Toy_printLiteralCustom(lit, interpreter->printOutput);

	//free the literal
	Toy_freeLiteral(lit);

	//continue the loop
	return true;
}

Identity Crisis

As in most programming languages, variables can be represented by names specified by the programmer; in Toy, these are called “identifiers”. These identifiers can be passed around in place of their actual values, but can’t be used directly. To retrieve a value, you must first “parse” it, like so:

Toy_Literal idn = literal; //cache the literal, just in case it's an identifier
if (TOY_IS_IDENTIFIER(literal) && Toy_parseIdentifierToValue(interpreter, &literal)) { //if it is an identifier, parse it...
	Toy_freeLiteral(idn); //always remember to free the original identifier, otherwise you'll have a memory leak!
}

You will often see this pattern throughout the codebase.

Other Utility Functions

Other functions are available at the top of the interpreter source file:

Literals

TODO

Arrays & Dictionaries

TODO