Prototyping parallel functional intermediate languages

by

Andrew David Ben-Dyke

A thesis submitted to the Faculty of Science
of The University of Birmingham
for the degree of
DOCTOR OF PHILOSOPHY

School of Computer Science
Faculty of Science
The University of Birmingham
United Kingdom

October 1999
This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation.

Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder.
Abstract

Non-strict higher-order functional programming languages are elegant, concise, mathematically sound and contain few environment-specific features, making them obvious candidates for harnessing high-performance architectures. The validity of this approach has been established by a number of experimental compilers. However, while there have been a number of important theoretical developments in the field of parallel functional programming, implementations have been slow to materialise. The myriad design choices and demands of specific architectures lead to protracted development times. Furthermore, the resulting systems tend to be monolithic entities, and are difficult to extend and test, ultimately discouraging experimentation. The traditional solution to this problem is the use of a rapid prototyping framework.

However, as each existing systems tends to prefer one specific platform and a particular way of expressing parallelism (including implicit specification) it is difficult to envisage a general purpose framework. Fortunately, most of these systems have at least one point of commonality: the use of an intermediate form. Typically, these abstract representations explicitly identify all parallel components but without the background noise of syntactic and (potentially arbitrary) implementation details. To this end, this thesis outlines a framework for rapidly prototyping such intermediate languages. Based on the traditional three-phase compiler model, the design process is driven by the development of various semantic descriptions of the language. Executable versions of the specifications help to both debug and informally validate these models. A number of case studies, covering the spectrum of modern implementations, demonstrate the utility of the framework.
Acknowledgements

Firstly, I thank my supervisor, Dr Tom Axford, for his assistance throughout this research project. I also acknowledge the members of my thesis group, Dr Lydia Kronsjö, and Dr Marta Kwiatkowska, for their helpful and timely advice. Furthermore, I acknowledge my examiners, Dr Antoni Diller of the University of Birmingham, and Dr Mike Joy of the University of Warwick, for their advice, in accordance with which this thesis has been modified. I am also grateful to my fellow research students for providing a relaxed and friendly working environment – in addition, Dr Howard Goodman must take the credit for the improvement in my grammar, punctuation, and general writing style. Finally, I would like to thank all of my family and friends for their patience and understanding, though special recognition must go to Claire for her dedicated support of my academic pursuits.
## Contents

1 Introduction ................................................................. 1
   1.1 Motivation ................................................................. 1
   1.2 Overview ................................................................. 2
      1.2.1 Background ......................................................... 2
      1.2.2 Prototyping parallel functional intermediate languages.................. 2
      1.2.3 The sequential STG' language ....................................... 2
      1.2.4 Expressing parallelism – static models .................................... 2
      1.2.5 Managing parallelism – operational models ................................. 3
      1.2.6 Simulating the target architecture ............................................ 3
      1.2.7 Compilation rules ..................................................... 3
      1.2.8 Prototyping parallel functional intermediate languages................. 3
      1.2.9 Summary, evaluation, and further work ....................................... 3

2 Background ................................................................. 4
   2.1 Introduction ................................................................. 4
   2.2 Parallel processing ....................................................... 4
      2.2.1 Architectural taxonomies ............................................... 4
      2.2.2 Amdahl's law and the corollary of modest potential ....................... 6
   2.3 Architecture independence through functional programming ................. 6
      2.3.1 The software crisis and parallel languages .................................... 6
      2.3.2 Can programming be liberated from the von Neumann style? ............... 7
      2.3.3 Graph reduction ....................................................... 9
      2.3.4 Parallel functional programming: an introduction ......................... 10
   2.4 User-level annotations and expressions ........................................ 11
      2.4.1 Implicit specification ............................................... 11
      2.4.2 Bulk data types ..................................................... 12
      2.4.3 Skeletal parallelism ............................................... 13
      2.4.4 Low-level annotations ............................................... 15
   2.5 Prototyping parallel functional languages ...................................... 18
      2.5.1 The problem with performance comparisons .................................... 19
      2.5.2 Is a standard benchmark suite the solution? ................................... 19
      2.5.3 Existing approaches to developing functional implementations .......... 22
   2.6 Summary ................................................................... 23
5 Expressing parallelism – static models
5.1 Introduction ........................................... 62
5.2 Introducing parallelism into the STG' language ........ 62
  5.2.1 New production rules ...................... 64
  5.2.2 New primitive functions .............. 66
  5.2.3 New primitive types ...................... 66
  5.2.4 Altering existing expressions ........ 68
  5.2.5 Hybrid definitions ...................... 71
5.3 Language restrictions revisited .................. 71
  5.3.1 Syntactic, algorithmic, and informal restrictions .... 71
  5.3.2 Type-inference rules ...................... 72
  5.3.3 Free variables .............................. 74
5.4 Denotational semantics and parallel languages .... 74
  5.4.1 Order of evaluation ...................... 75
  5.4.2 Degree of evaluation ...................... 75
  5.4.3 Speculative evaluation and non-termination .... 76
  5.4.4 Non-determinism ......................... 76
  5.4.5 Run-time errors and exception handling .... 78
  5.4.6 A selection of bottoms .................. 78
5.5 Summary ............................................ 79

6 Managing parallelism – operational models
6.1 Introduction ........................................... 80
6.2 Parallelism and the STG machine ............... 80
  6.2.1 One abstract machine or many? .......... 80
  6.2.2 Abstractions of time ...................... 83
  6.2.3 Inter-processor synchronisation ......... 86
  6.2.4 Shared memory .............................. 87
  6.2.5 Message-passing architectures .......... 89
6.3 Operational semantics and the STG machine .... 91
  6.3.1 The evaluation mechanism ............... 92
  6.3.2 Communication and synchronisation .... 96
  6.3.3 Resource management ..................... 101
  6.3.4 Partitioning and naming ................. 109
6.4 Modifying the STG machine ...................... 112
  6.4.1 New production rules ..................... 112
  6.4.2 New primitive types ..................... 113
  6.4.3 Supporting the new state-transition rules .... 114
6.5 Animation and testing ................................ 118
  6.5.1 The processor framework ................. 119
  6.5.2 An example animation: the ping-pong system .. 120
  6.5.3 Verification and testing .................. 121
  6.5.4 Interactive animation .................... 123
  6.5.5 Batch-mode animation ................... 124
6.6 Summary ............................................ 127
I  Example RISC programs  263
  1.1  Prelude operations ................................................................. 263
  1.1.1  Integers ................................................................. 263
  1.1.2  Booleans ................................................................. 268
  1.1.3  Lists ................................................................. 270
  1.2  Generating Fibonacci numbers ............................................. 277
  1.3  Generating prime numbers – the sieve of Eratoshenes ............. 282
  1.4  Updating algebraic constructors ............................................. 293
  1.5  Updating partial applications ................................................. 294

References  297
<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2.1</td>
<td>Four examples of parallel architectures</td>
<td>5</td>
</tr>
<tr>
<td>2.2</td>
<td>Operational template of $P^3L$'s dedicated-farm skeleton</td>
<td>15</td>
</tr>
<tr>
<td>3.1</td>
<td>An overview of the prototyping framework</td>
<td>26</td>
</tr>
<tr>
<td>4.1</td>
<td>Abstract syntax of the STG' language</td>
<td>32</td>
</tr>
<tr>
<td>4.2</td>
<td>Abstract syntax of types</td>
<td>39</td>
</tr>
<tr>
<td>4.3</td>
<td>The simplified PROGRAM type rule</td>
<td>40</td>
</tr>
<tr>
<td>4.4</td>
<td>Example type signatures of primitive functions</td>
<td>40</td>
</tr>
<tr>
<td>4.5</td>
<td>The CONDECL type rule</td>
<td>40</td>
</tr>
<tr>
<td>4.6</td>
<td>The BINDS type rule</td>
<td>41</td>
</tr>
<tr>
<td>4.7</td>
<td>The CONS-EXP type rule</td>
<td>41</td>
</tr>
<tr>
<td>4.8</td>
<td>Denotational semantics of STG' programs and bindings</td>
<td>45</td>
</tr>
<tr>
<td>4.9</td>
<td>Denotational semantics of STG' expressions, defaults and atoms</td>
<td>46</td>
</tr>
<tr>
<td>4.10</td>
<td>Denotational semantics of STG' case alternatives</td>
<td>47</td>
</tr>
<tr>
<td>4.11</td>
<td>The relationship between the STG-machine rules and the code component</td>
<td>52</td>
</tr>
<tr>
<td>4.12</td>
<td>The STG-machine rule for evaluating letstrict expressions</td>
<td>53</td>
</tr>
<tr>
<td>4.13</td>
<td>The STG-machine rule for returning to a letstrict continuation</td>
<td>53</td>
</tr>
<tr>
<td>4.14</td>
<td>The STG-machine rule for returning to a let# continuation</td>
<td>53</td>
</tr>
<tr>
<td>4.15</td>
<td>An example STG' program</td>
<td>54</td>
</tr>
<tr>
<td>5.1</td>
<td>An extended unification algorithm for Hill's PODs</td>
<td>67</td>
</tr>
<tr>
<td>5.2</td>
<td>Hill's extended syntax for a data-parallel STG language</td>
<td>68</td>
</tr>
<tr>
<td>5.3</td>
<td>The PID# type for restricting access to non-deterministic topology functions</td>
<td>68</td>
</tr>
<tr>
<td>5.4</td>
<td>Improving pipeline parallelism using a new boxed type</td>
<td>69</td>
</tr>
<tr>
<td>5.5</td>
<td>A valuation function for strict function application</td>
<td>71</td>
</tr>
<tr>
<td>5.6</td>
<td>The ALG-PALTS and ALG-PALT type rules for PODs</td>
<td>74</td>
</tr>
<tr>
<td>6.1</td>
<td>A simple processor framework for the parallel STG machine</td>
<td>81</td>
</tr>
<tr>
<td>6.2</td>
<td>Transition rules for a simple ping-pong system</td>
<td>82</td>
</tr>
<tr>
<td>6.3</td>
<td>The state-transition diagram for the ping-pong system</td>
<td>82</td>
</tr>
<tr>
<td>6.4</td>
<td>Explicitly modelling time in the processor framework</td>
<td>83</td>
</tr>
<tr>
<td>6.5</td>
<td>Transition rules for a time-aware ping-pong system</td>
<td>84</td>
</tr>
<tr>
<td>6.6</td>
<td>Time costs for the ping-pong system</td>
<td>85</td>
</tr>
<tr>
<td>6.7</td>
<td>State transitions for the time-aware ping-pong system</td>
<td>85</td>
</tr>
<tr>
<td>6.8</td>
<td>Incorporating processor states into the processor framework</td>
<td>86</td>
</tr>
<tr>
<td>6.9</td>
<td>Adding time outs to the ping-pong system</td>
<td>88</td>
</tr>
<tr>
<td>6.10</td>
<td>Specifying the order of evaluation of the let# construct</td>
<td>93</td>
</tr>
<tr>
<td>6.11</td>
<td>Modifying the STG machine to allow the forcing of arbitrary boxed expressions</td>
<td>94</td>
</tr>
<tr>
<td>6.12</td>
<td>Argument passing using heap-allocated application frames</td>
<td>95</td>
</tr>
</tbody>
</table>
9.29 Layout of the GUM info tables for a standard closure .......... 180
9.30 The RISC implementation of the pack method for re-entrant closures ... 180
9.31 The parallel STG' fib2 -O benchmark .............................................. 182
9.32 Relative speedups for the conservative fib -O benchmark ............... 183
9.33 Relative speedups for the conservative fib2 -O benchmark .......... 183
9.34 The impact of message latency on the fib2 -O benchmark ............. 184
9.35 Total messages sent during the fib2 15 .............................................. 184
9.36 Communication costs and GUM load-balancing messages .......... 185
9.37 Speedups for the queens and queens2 -O benchmarks .................... 186
9.38 The leton expression: abstract syntax, free variables, and type inference . 189
9.39 The self expression: abstract syntax, free variables, and type inference . 189
9.40 The execution tree for fib 5 ................................................................. 191
9.41 The execution tree for queens 2 ........................................................... 192
9.42 Recursive execution-tree domain equations ...................................... 192
9.43 Execution-tree semantics of para-functional STG' programs and bindings . 193
9.44 Execution-tree semantics of para-functional STG' expressions .......... 194
9.45 Execution-tree semantics of para-functional STG' case alternatives .... 195
9.46 Initial operational rules for the para-functional STG' language ........ 195
9.47 Operational rules for supporting virtual topologies ......................... 197
9.48 The para-functional STG' sfib3 -O benchmark ................................. 198
9.49 Speedup curves for the para-functional fib benchmarks .................. 199
9.50 A para-functional STG' replacement for the dc skeleton ............... 201
9.51 Denotational semantics of the farm skeleton ................................. 201
## List of tables

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2.1</td>
<td>Some examples of collection-oriented operations</td>
<td>12</td>
</tr>
<tr>
<td>2.2</td>
<td>Paralation Lisp’s data-parallel constructs</td>
<td>13</td>
</tr>
<tr>
<td>2.3</td>
<td>Concurrent Clean’s topology functions</td>
<td>17</td>
</tr>
<tr>
<td>2.4</td>
<td>Comparing implementations of parallel functional languages – language issues</td>
<td>20</td>
</tr>
<tr>
<td>2.5</td>
<td>Comparing implementations of parallel functional languages – benchmarks</td>
<td>21</td>
</tr>
<tr>
<td>4.1</td>
<td>The operational reading of STG’ language expressions</td>
<td>32</td>
</tr>
<tr>
<td>4.2</td>
<td>Summary of the environments used during type inference</td>
<td>39</td>
</tr>
<tr>
<td>4.3</td>
<td>The meta-language of the denotational semantics</td>
<td>45</td>
</tr>
<tr>
<td>4.4</td>
<td>Example state components</td>
<td>50</td>
</tr>
<tr>
<td>4.5</td>
<td>The state components of the STG machine</td>
<td>52</td>
</tr>
<tr>
<td>4.6</td>
<td>The $fib$ benchmark results</td>
<td>60</td>
</tr>
<tr>
<td>4.7</td>
<td>The $primes$ benchmark results</td>
<td>60</td>
</tr>
<tr>
<td>4.8</td>
<td>The $queens$ benchmark results</td>
<td>61</td>
</tr>
<tr>
<td>4.9</td>
<td>The $hamming$ benchmark results</td>
<td>61</td>
</tr>
<tr>
<td>5.1</td>
<td>MacLennan’s language design principles</td>
<td>63</td>
</tr>
<tr>
<td>5.2</td>
<td>Extending production-rule groups</td>
<td>65</td>
</tr>
<tr>
<td>5.3</td>
<td>A selection of type rules for parallel constructs</td>
<td>73</td>
</tr>
<tr>
<td>6.1</td>
<td>The relationship between the abstract syntax and the STG-machine rules</td>
<td>113</td>
</tr>
<tr>
<td>7.1</td>
<td>State components of the RISC uniprocessor</td>
<td>131</td>
</tr>
<tr>
<td>7.2</td>
<td>A selection of RISC instructions</td>
<td>132</td>
</tr>
<tr>
<td>7.3</td>
<td>The hybrid architecture’s message-passing interface</td>
<td>134</td>
</tr>
<tr>
<td>8.1</td>
<td>The state components of the compiler framework</td>
<td>138</td>
</tr>
<tr>
<td>8.2</td>
<td>The code component of the compilation state-transition system</td>
<td>139</td>
</tr>
<tr>
<td>8.3</td>
<td>RISC-instruction counts for the unoptimised benchmarks</td>
<td>143</td>
</tr>
<tr>
<td>8.4</td>
<td>RISC-instruction counts for the optimised benchmarks</td>
<td>143</td>
</tr>
<tr>
<td>8.5</td>
<td>Comparing STG machine reductions and RISC instructions</td>
<td>144</td>
</tr>
<tr>
<td>9.1</td>
<td>State components of a thread-management system</td>
<td>148</td>
</tr>
<tr>
<td>9.2</td>
<td>Overview of the STG rules for Mattson’s speculative evaluation engine</td>
<td>149</td>
</tr>
<tr>
<td>9.3</td>
<td>The register map for compiling speculative expressions</td>
<td>153</td>
</tr>
<tr>
<td>9.4</td>
<td>State components of a message-passing system</td>
<td>162</td>
</tr>
<tr>
<td>9.5</td>
<td>Overview of the GUM STG rules</td>
<td>163</td>
</tr>
<tr>
<td>9.6</td>
<td>Messages used by GUM</td>
<td>165</td>
</tr>
<tr>
<td>9.7</td>
<td>State components of GUM’s work pool</td>
<td>167</td>
</tr>
<tr>
<td>9.8</td>
<td>Representing remote references with GUM</td>
<td>173</td>
</tr>
<tr>
<td>9.9</td>
<td>The register map for compiling GUM expressions</td>
<td>178</td>
</tr>
</tbody>
</table>
9.10 State components for supporting virtual topologies ............... 196

C.1 The real subset of the nofib benchmark suite ....................... 227
Glossary

This glossary is not intended to be exhaustive, and only includes entries for the most important frequently occurring items. These descriptions are based on a number of sources, include the references cited in the text as well as more general resources, such as the free on-line dictionary of computing [Howe, 1993].

abstract interpretation the execution of an abstract version of a program to deduce information about the program.

abstract machine a stylised processor design for executing an abstract machine code (which is usually the intermediate language of a compilation system) i.e. “a formal interpreter for the language which runs on a hypothetical machine” [Hennessy, 1990, page 114].

aggressive take a property of an abstract machine, whereby all of a function’s arguments (after all pending updates have been performed) have to be present if evaluation is to proceed. As noted by Beemster [1994], the STG machine [Peyton Jones, 1992, rules 17 and 17a, section 5.6] is an example of this type of system.

algebraic data type a sum-of-product type using constructors to differentiate between each possible product [Bird and Wadler, 1988, pages 204–219]. Recursive and mutually recursive data types are permitted.

animation the process of making specifications executable for the purpose of experimentation and informal validation.

API (application-program interface) describes the formal interface through which user code can access a library’s functionality. Typical details will include the argument-passing convention, and each method’s input and output parameters. Additional information may include a list of possible side-effects and/or error returns.

boxed value any value which is indirectly referenced via an address pointer [Peyton Jones and Launchbury, 1991].

closure an operational structure used to represent a lambda expression, including an environment of its free-variable bindings [Peyton Jones, 1987, section 21.5, page 378].

constant applicative form (CAF) a top-level definition that may require to be updated during the lifetime of the evaluation. A typical STG’ CAF would be nine = u → + 4 5 [Peyton Jones, 1987, section 13.2, page 224].

constructor a tag used to uniquely identify a product type of an algebraic data type [Peyton Jones, 1987, section 4.1, page 52].
continuation an instruction sequence, or function, that may be invoked as the final step of the current computation, and which represents “what to do next”. In a physical implementation, a continuation is usually represented by a return address [Peyton Jones, 1987, sections 5.4 and 9.4].

continuation-passing style is a program notation that makes aspect of control flow and data flow explicit [Appel, 1992, page 2]. All user-defined functions take a continuation as an argument, and apply it to their result in order to effect a return to the main computation.

denotational semantics a set-based syntax-driven valuation function which maps a program directly to its meaning, or denotation [Stoy, 1977; Schmidt, 1986].

domain a set of values over which an ordering relation is defined, or, more specifically, the Scott domain [Burn, 1991, definition 2.2.21].

DMMP (distributed memory, message passing) – a traditional message-passing multiprocessor [Johnson, 1988].

evaluation transformer an identity function which has the operational side effect of forcing the evaluation of an expression beyond head normal form [Burn, 1991, chapter 5].

exception “an error, unusual condition, or external signal, that may set a status bit and may or may not cause an interrupt, depending upon whether or not the corresponding interrupt is enabled” [May, Silha, Simpson and Warren, 1994, section 1.3.1, page 368] and, typically, invokes a specialised handler to deal with the error.

free variable a variable referred to in an expression, but not bound by a local definition [Peyton Jones, 1987, section 2.2, page 14].

functional (programming) language any declarative, side-effect free language whose programs are sets of recursive function definitions.

garbage collection is “the automatic reclamation of computer storage” [Wilson, 1992, page 1]. This is achieved by disposing of any heap-allocated object which can no longer be reached by the running program. (see also root set).

Glasgow Haskell compiler one of the three main Haskell compilers, based on the STG language and STG-machine technology [Peyton Jones, Hall, Hammond, Partain and Wadler, 1993].

graph reduction a technique for evaluating non-strict functional programming languages which uses sharing to minimise the duplication of work [Wadsworth, 1971, chapter 4].

GMSV (global memory, shared variables) – a traditional shared memory multiprocessor [Johnson, 1988].

Haskell a non-strict, purely functional language whose features include support for higher-order functions, type classes and static, polymorphic typing, user-defined data types, functional I/O, and pattern matching [Hudak, Peyton Jones, Wadler and others, 1992].
higher-order functions “functions are treated as first-class values in a language – allowing them to be stored in data structures, passed as arguments and returned as results” [Hudak, 1989, section 2.1, pages 382–383].

Hindley–Milner type-inference algorithm the classic approach to polymorphic type checking in a functional programming system [Milner, 1978].

intermediate language any language that is used as a temporary representation during the compilation of a source language to a target language.

interpreter “a piece of software that directly executes a source program” [Watson, 1989].

metalanguage “a language used to define another language” [Watson, 1989, section 1.4.2, page 14].

MIMD (multiple instruction, multiple data) – most commercial multiprocessors and collections of workstations fall into this architectural category [Flynn, 1972].

non-determinism a property of a computation which may (arbitrarily) return different results [Stoy, 1977, page 201].

non-strictness a property of an evaluation strategy such that an expression is only evaluated when its value is actually needed (normal-order reduction [Peyton Jones, 1987, section 2.3, page 25]).

powerdomain each element of a powerdomain is a set of elements of the domain from which it was formed. Powerdomains can be used in a denotational semantics to model non-determinism [Stoy, 1977, page 201].

primitive function builtin routines similar to the lambda-calculus δ-rules, and the only way to perform computations on unboxed values.

prototyping “is the process of constructing software for the purpose of obtaining information about the adequacy and appropriateness of the designers’ conception of a software product” [Balzer, Gabriel, Belz, Dewar, Fisher and others, 1988, page 8].

referential transparency the ability to replace any sub-expression by others possessing the same value without changing the final value of the mathematical expression [Bird and Wadler, 1988, page 2]. This is often summarised as: “equals can be replaced by equals” [Hudak, 1989, page 362].

RISC (Reduced Instruction Set Computer) machine the salient features of this class of processor include [Kane and Heinrich, 1992, chapter 1, pages 1–22]: one instruction completed per cycle; simple addressing modes and instruction formats; sufficient on-chip memory (registers and cache) to overcome the processor/memory bottleneck; and a reliance on optimising compilers to obtain the best possible performance.

root set a list of the heap addresses which are live in the local state [Wilson, 1992, section 1.2]. Using this as the main input, it must be possible for the garbage collector to identify all of the live closures of the entire system.

SIMD (single instruction, multiple data) – vector/array processors, often referred to as data-parallel machines [Flynn, 1972].
sparking the creation of a new thread to reduce an expression [Clack and Peyton Jones, 1986, section 2.1].

speculative evaluation an approach to increasing available parallelism by sparking threads to reduce non-essential expressions [Mattson Jr., 1993a, chapter 3, page 39].

STG (Shared Term Graph) language is the abstract machine code of the STG machine, and can be viewed as “a very austere purely-functional language” [Peyton Jones, 1992, section 4].

STG' language a variant of the STG language which serves as the foundation for the prototyping system. The sequential semantics of the language is presented in chapter 4.

STG machine (Spineless Tagless G-machine) is an abstract machine designed to support non-strict higher-order functional languages [Peyton Jones and Salkild, 1989; Peyton Jones, 1992].

syntactic sugar any syntactic construct added solely for the purpose of improving programmability. The removal of these expressions is known as de-sugaring.

syntax driven a property of a language processor which takes its structure directly from the abstract syntax.

thread an independent process which computes the value of one expression and then terminates [Peyton Jones, 1989, evaluate-and-die model, page 178] (see also sparking).

thunk a closure which represents an expression not in head-normal form [Peyton Jones, 1992, section 3.1].

ticky-ticky profiling a feature of GHC, whereby the run-time system records the number of updates, the number of constructors entered etc. The system is so named because “that’s the sound a Sun4 makes when it is running up all those counters (slowly)” [AQUA Team, 1993, section 9, page 36].

time-out “A period of time after which an error condition is raised if some event has not occurred. A common example is sending a message. If the receiver does not acknowledge the message within some preset time-out period, a transmission error is assumed to have occurred.” [Howe, 1993]

type inference the process of deducing a program’s type attributes from its syntax, as typified by the Hindley–Milner system [Milner, 1978].

unboxed value any value which can be represented using a machine literal, including, for example, 32-bit integers and 64-bit floating point numbers [Peyton Jones and Launchbury, 1991].
Chapter 1

Introduction

1.1 Motivation

Non-strict higher-order functional programming languages are elegant, concise, mathematically sound and contain few environment-specific features. Furthermore, current implementations of functional programming languages generate sequential code of a comparable efficiency to that of their imperative rivals. This combination suggests the possibility of architecture-independent parallel programming, and the validity of this approach has been established by a number of experimental compilers [Hill, 1994; Chakravarty, 1994; Hammond, Mattson Jr. and Peyton Jones, 1994; Hudak, 1991]. However, a would-be designer of a parallel functional system is faced with three major obstacles:

1. due to the large number of dimensions involved and to the lack of a common benchmarking system, it is extremely difficult determine which components are central to the performance of the system.

2. having selected and integrated the components, the cost of developing an efficient implementation for just one platform is considerable.

3. once the base implementation is complete, experimentation with any but the most trivial of subsystems may require significant effort.

What is needed is a system to rapidly develop and test ideas before committing to a full-scale implementation. However, as each existing implementation tends to prefer one specific platform and a particular way of expressing parallelism (including implicit specification) it is difficult to envisage a general purpose framework.

Fortunately, most of these systems have at least one point of commonality: the use of an intermediate form [Peyton Jones, 1987]. Typically, these abstract representations explicitly identify all parallel components but without the background noise of syntactic and (potentially arbitrary) implementation details. To this end, this thesis outlines a framework for rapidly prototyping such intermediate languages, split into three stages:

**language specification** the language is specified in terms of its syntax, type rules [Milner, 1978] and denotational semantics [Stoy, 1977]. This provides the reference model against which to test the output of the subsequent stages.

**parser construction** a number of parallel languages have been outlined [Hill, 1994; Hudak, 1991; Kelly, 1989; Burton, 1984] and so it is important to verify that the proposed abstract form can act as a suitable target (for as large a subset of these as possible).
compilation rules to provide a degree of architecture independence to the source languages, the code generator must produce efficient output for a variety of diverse architectures. This stage is driven by the development of an operational semantics for the intermediate language.

The design process is driven by the development of semantic models of the stages, and these are used primarily to validate and motivate the parse and compilation rules. To improve confidence in the models themselves, executable version of the specifications, written in the functional programming language Haskell, are constructed. While some parts of the framework could be automated, it is worth stating that we have made no attempt to develop an automatic system of the sort typified by CERES [Tofte, 1990].

1.2 Overview

1.2.1 Background

By critically examining the relevant literature, chapter two motivates the central work of this thesis: the design of a prototyping framework for parallel intermediate languages.

1.2.2 Prototyping parallel functional intermediate languages

Chapter three describes an approach to the design of an explicitly parallel intermediate language for use during the compilation of non-strict higher-order functional programming languages. The framework is based upon the development of both a denotational and operational model for the intermediate language, which are then used to produce specifications for the parser and code generator. Haskell animations of these components aid with both debugging and informal validation. (For a more detailed and example-driven description of the prototyping system see [Ben-Dyke and Axford, 1995], which is reproduced in appendix A.)

1.2.3 The sequential STG' language

Chapter four describes the STG' language, a variant of the Shared Term Graph (STG) language, both in terms of its abstract and concrete syntax, and denotational semantics. A Hindley–Milner style type-inference algorithm is also presented, which serves to restrict the language and produces information useful to a compilation system.

1.2.4 Expressing parallelism – static models

In chapter five a number of guidelines are presented for adding support for parallelism into the sequential STG' language, as described in chapter 4. Typically, this involves extending the abstract syntax, adding language restrictions, and developing a denotational model of the parallel components. The examples used to motivate each of the steps are, where possible, based on the constructs presented in section 2.4. While the issues of language design are not directly addressed, MacLennan’s principles [1987, page 547] serve as a useful guide, and are thus reproduced in table 5.1.
1.2.5 Managing parallelism – operational models

Chapter six discusses the development of an operational description to augment the denotational semantics of the parallel STG' language (see chapter 5). The STG machine provides the basic recipe, into which the parallel ingredients, including threads, messages, and shared memory, are added. To facilitate testing and debugging, the animation of the model, which is essentially a state-transition system, is also considered. The final description is then used by chapter 8 to provide the foundation upon which the compilation system is built.

1.2.6 Simulating the target architecture

Chapter seven describes the simulator used to test and debug the output of the STG' compiler (see chapter 8). A RISC-like instruction set, based on the DEC Alpha processor family, serves as the interface between the two systems. The simulator is interpretive and is specified using the state-transition notation presented in chapter 6. While overall performance is relatively poor, the extensible nature of the state-transition model is more important for this particular application.

1.2.7 Compilation rules

Chapter eight describes how the state-transition model can be used to model a compilation system. Particular emphasis is placed on encoding important optimisations, including register allocation, closure layout, and dead-code elimination. The validity of this approach is demonstrated by developing a compilation system for a subset of the sequential STG' language.

1.2.8 Prototyping parallel functional intermediate languages

In chapter nine the use of the prototyping framework is illustrated by four case studies. Each of the studies are based upon existing well-known systems, and, between them, include examples of the main programming abstractions used in modern parallel functional programming and cover both message-passing and shared-memory architectures. The first study is based upon shared-memory Haskell, and considers the introduction of parallel threads into the STG' language. This provides a simple overview of the methodology, and serves as a foundation upon which the other case studies build. The second moves on to consider GUM Haskell [Trinder et al., 1996]. While the static semantics are very similar to those of the first case study, the operational model is far more complex, and demonstrates how message passing can be modelled by a state-transition system. The third investigates the data placement primitives of para-functional Haskell – this proves interesting both in terms of the denotational and operational models. Skeletal parallelism is the subject of the final case study, dealing with farms, pipes and divide-and-conquer skeletons.

1.2.9 Summary, evaluation, and further work

Chapter ten concludes the thesis by re-stating the main contributions of the work, and attempting to evaluate the prototyping framework. Finally, there is a discussion of the limitations of the proposed approach, and possible areas for future research are examined.
Chapter 2

Background

2.1 Introduction

By critically examining the relevant literature, this chapter motivates the central work of this thesis: the design of a prototyping framework for parallel intermediate languages.

Section 2.2 introduces the field of parallel processing, while section 2.3 deals with the specifics of parallel functional programming. This leads on to a review, in section 2.4, of the idioms used by explicitly-parallel functional programming languages. Section 2.5 reviews the existing approaches to prototyping, and the chapter is then summarised in section 2.6.

2.2 Parallel processing

2.2.1 Architectural taxonomies

The term parallel can be used to describe a wide range of architectures, with the only common denominator being the use of more than one processing element. For the purpose of this thesis, such systems are assumed to comprise many similar processors, connected by a reliable communication mechanism, co-operating to solve a single task or problem. Many different styles of parallel computers have been developed, and figure 2.1 shows a number of common configurations (the blocks represent processors, memory, communication networks, or host processors). Indeed, there is sufficient variety [Duncan, 1990] that a number of different taxonomies have been developed. Flynn [1972] categorised machines based upon the number of instruction and data streams:

SISD (single instruction, single data) – the classic von Neumann architecture, encompassing most modern uniprocessors. Examples include the DEC Alpha AXP architecture [Sites, 1992], the SPARC family [Sun Microsystems, 1988], and the PowerPC [May et al., 1994].

SIMD (single instruction, multiple data) – vector/array processors (often referred to as data-parallel machines). Examples include the AMT DAP, Thinking Machines' CM-200, and the MasPar MP-1 (all of which have been surveyed by MacDonald [1992]).

MISD (multiple instruction, single data) – no practical examples of this class exist.
Figure 2.1: Four examples of parallel architectures: (a) vector processor (SIMD); (b) classic shared memory (GMSV); (c) loosely-coupled message passing (DMMP); (d) constant-valence message passing (DMMP, but with the memory components not shown)

**MIMD** (multiple instruction, multiple data) – most commercial multiprocessors and collections of workstations fall into this category. Examples include both shared-memory machines, such as the KSR and Sequent Symmetry, and message-passing systems, including Thinking Machines’ CM-5, the NCUBE range, and systems based on the Inmos Transputer. (Oren and Ramanathan [1993] include an overview of each of these machines in their survey paper.)

Johnson [1988, figure 1, page 45] noted that the last category, MIMD, was too coarse, and divided it into the following classes:

<table>
<thead>
<tr>
<th>shared variables</th>
<th>message passing</th>
</tr>
</thead>
<tbody>
<tr>
<td>global memory</td>
<td>GMSV</td>
</tr>
<tr>
<td>distributed memory</td>
<td>DMSV</td>
</tr>
<tr>
<td></td>
<td>GMMP</td>
</tr>
<tr>
<td></td>
<td>DMMP</td>
</tr>
</tbody>
</table>

One failing common to both of these taxonomies, however, is that they convey no information with regards to a machine’s “size”. The Erlangen classification system, developed by Händler [1982], uses the triple \( (K, D, W) \) as a representation, where \( K \) is the number of processors, \( D \) is the number of ALUs (Arithmetic Logic Units), and \( W \) is the word length of each ALU. If pipelining is used, the notation is extended to \( (K \times K', D \times D', W \times W') \), where the multipliers are the pipeline depths (macro-, instruction- and arithmetic-pipelining respectively). The system also allows representations to be combined using the following operators:

+ indicates the existence of more than one structure that operates independently in parallel.

* indicates the existence of sequentially ordered structures where all data is processed through all structures.

\( \triangledown \) indicates that a certain system may have multiple configurations.

Skillicorn [1988] extended this idea to include descriptions of the interconnection topology (including both processor-to-processor and processor-to-memory networks). Indeed,
Bönniger, Esser and Krekel [1993] also traded conciseness for accuracy by increasing the
number of items to 350 (split across 14 groups). Schlesinger and Kuehn [1993] have taken
this concept to its natural limit by developing an architecture-description language. An­
other approach is adopted by Culler et al. [1993], whose LogP model uses the performance
characteristics of the communication mechanism as the primary attributes:

\[ L \] - an upper bound on the latency involved with communicating a word-length message
from source to destination.

\[ o \] - the overhead attributed to the transmission or reception of each message (during which
time a processor can engage in other activities.)

\[ g \] - the minimum gap allowed between consecutive message transmission or reception. The
reciprocal gives the per-processor bandwidth.

\[ P \] - the number of processors.

2.2.2 Amdahl’s law and the corollary of modest potential

While intuition would suggest that \( n \) co-operating processors should be \( n \)-times faster than
a single processor, Amdahl [1967] argued the case against parallel processing as a means
of achieving large scale computations. He showed that the maximum theoretical speedup
is merely the reciprocal of the percentage of time spent performing serial computation
(this limiting factor is known as the serial fraction.) Indeed, most problems are unlikely
to experience even a 100-fold improvement, and this insight resulted in a degree of scepti­
cism regarding the viability of massive parallelism. However, Gustafson [1988] addressed
these concerns by pointing out that there are, in fact, two distinct approaches to parallel
processing: \textit{fixed size, reduced time}, used whenever user acceptance is important or there
are real-time constraints; and, secondly, \textit{bounded time, increased size}, where the increased
power is used to improve either the accuracy or problem size of the computation.

Gustafson then showed that Amdahl’s law only applied to the case where the serial
fraction is independent of the number of processors, i.e. the fixed-size, reduced time
approach. By considering the alternative method, a new law of \textit{scaled speedup} was defined
such that the speedup is approximately equal to the number of processors used (thereby
confirming the original intuition).

As a final cautionary note, Snyder [1986, page 291], taking the bounded-time approach
for an \( O(n^4) \) algorithm as an example, showed that it would require 100 million processors
to increase the problem size by two orders of magnitude – this lead to the corollary of
modest potential:

"Because its benefit is so modest, the whole force of parallelism must be trans­
ferred to the problem, not converted to "heat" in implementational overhead."

2.3 Architecture independence through functional program­
ing

2.3.1 The software crisis and parallel languages

While parallel processing seems to offer high performance at a low cost, the recent failure
of the Thinking Machines Corporation [Markoff, 1994] would suggest that it is not a
commercially-viable option. Skillicorn [1990, page 38] states that the major problem is:
“There is currently no way to develop software for parallel computers and expect it to have a long lifetime.”

The situation is akin to the original software crisis of the 1960s, and, as then, either the hardware or software components (or both) need to be improved. A number of promising examples of the former approach exist, including the MIT Alewife [Agarwal et al., 1991], the Stanford DASH [Lenoski et al., 1992], and the Tera Computer Company’s Multi-Threaded Architecture [Smith, 1990]. However, this thesis takes the latter path, with McColl [1995, page 42] providing the necessary motivation:

“This software-first approach has a great deal of merit given that hardware is changing rapidly and that the cost and time required to produce software makes architecture-independence in software a major goal.”

This opinion is widely held, as illustrated by the long list of architecture-independent programming languages: Dino, High Performance Fortran, Lucid, Orca, Proteus, PCN, Sisal, Split C, SR, etc. (Cheng [1993] has surveyed over 30 different parallel languages, including those listed here, in addition to a wide selection of communication libraries, and performance, debugging, and visualisation tools.) Although each of these languages has achieved a degree of success, Cook, Pancake and Walpole [1994, section 6] have recently stated that:

“Parallel programming is difficult, under-supported, and unlikely to achieve impressive speedups on most applications.”

Peyton Jones [1989, section 2.3, page 176] argues that this is a direct result of the underlying programming model, and re-applies Backus’s fat and weak criticism (see section 2.3.2) to parallel languages, stating that:

“A parallel imperative program specifies in detail many resource-allocation decisions which the parallel functional program does not mention at all.”

Note that this thesis does not claim that parallel functional programming is the only solution, merely that it is a promising approach to the problem of developing architecture-independent software.

2.3.2 Can programming be liberated from the von Neumann style?

In 1977, the ACM Turing award was presented to Backus [1978], the developer of Fortran and BNF. During the lecture, he echoed the work of Landin [1966] by criticising conventional programming languages:

“Inherent defects at the most basic level cause them to be both fat and weak: their primitive word-at-a-time style of programming inherited from their common ancestor – the von Neumann computer, their close coupling of semantics to state transitions, their division of programming into a world of expressions and a world of statements, their inability to effectively use powerful combining forms for building new programs from existing ones, and their lack of useful properties for reasoning about programs.”

1The BSP (Bulk-Synchronous Parallel) model, first proposed by Valiant [1990], is an attempt to decouple these two approaches.
He then proceeded to advocate the use of functional programming to circumvent the imperative *intellectual bottleneck*, and, despite the recent advances in imperative programming [Stroustrup, 1991], many of these arguments still hold [Hughes, 1989; Hudak and Jones, 1994].

The key properties of functional languages are described below (for a more complete overview of functional programming, [Hudak, 1989] and [Bird and Wadler, 1988] are highly recommended):

**declarative** “functional programming is often described as expressing what is being computed rather than how” [Hudak, 1989, page 361]. Essentially, the programmer is free to concentrate on the underlying algorithm without having to deal with unimportant details, such as sequencing and memory management.

**formal semantics** the lambda calculus provides the theoretical foundations of functional languages [Barendregt, 1981; Hankin, 1994]. Furthermore, specifying the denotational semantics [Stoy, 1977; Schmidt, 1986] of such languages is straightforward – indeed, the meta-language used is syntactically very similar to the lambda calculus.

**referential transparency** due to the absence of side-effects, any sub-expression can be replaced by others possessing the same value without changing the final value of the mathematical expression [Bird and Wadler, 1988, page 2]. This is often summarised as: “equals can be replaced by equals” [Hudak, 1989, page 362].

In addition, most modern functional programming languages will also provide the following features:

**algebraic data types** a sum-of-product type using *constructors* to differentiate between each possible product [Bird and Wadler, 1988, pages 204–219]. Recursive and mutually recursive data types are permitted.

**higher-order functions** “functions are treated as first-class values in a [functional] language – allowing them to be stored in data structures, passed as arguments and returned as results” [Hudak, 1989, section 2.1, pages 382–383]. Through the use of such functions, it is possible to develop powerful combining forms, such that “it is as though the programming language can be extended with new control structures whenever desired” [Hughes, 1989, section 3, pages 99–101].

**syntactic sugar** any syntactic construct added solely for the purpose of improving programmability. Examples include pattern matching, guarded expressions, and list comprehensions [Hudak et al., 1992].

**type inference** the process of deducing a program’s type attributes from its syntax, as typified by the Hindley–Milner system [Milner, 1978].

This combination of features provides an expressiveness and freedom from trivial details (such as sequencing and memory management) that is, as yet, unrivalled by modern imperative languages, such as C++ [Stroustrup, 1991]. Furthermore, the two traditional failings of functional languages, namely their inefficiency and poor input/output facilities, have been addressed by recent advances in compiler design [Peyton Jones, 1992; Plasmeijer and van Eekelen, 1993a] and category theory [Wadler, 1992; Peyton Jones and Wadler, 1993] respectively.
Non-strict evaluation

Often referred to as call-by-need, normal order reduction [Peyton Jones, 1987, page 25], or lazy evaluation, non-strictness is a property of an evaluation strategy such that an expression is only evaluated when its value is actually needed. Hughes [1989, page 98] argues that, in combination with higher-order functions, lazy evaluation “can contribute greatly to modularity, the key to successful programming.” However, implementing non-strictness is complicated [Peyton Jones, 1987] and relies on the existence of analysis techniques for removing unnecessary laziness [Beemster, 1994; Peyton Jones and Partain, 1994; Burn, 1991]. However, recent benchmark results prompted Hartel et al. [1996, section 6.3.2] to note the following:

“As the Glasgow Haskell compiler shows, if the compiler can exploit strictness at the right points, the presence of lazy evaluation need not be a hindrance to high performance. This implementation is actually faster than most of the strict implementations.”

Even so, the functional programming community is split over the issue of strict versus non-strict languages, with Standard ML [Harper, Milner and Tofte, 1990] and Haskell (see the following section) representing the two camps. To help settle the matter, it is worth considering what Hill [1994, chapter 1, page 1] has to say about laziness in the context of parallelism:

“We have found that non-strictness opens the door to particular techniques and parallel algorithms, in just the same way that it has opened the eyes of functional programmers over the last decade.”

Therefore, for the purpose of this thesis, it will be assumed that the increased convenience and expressiveness of non-strictness outweigh the (perceived) operational deficiencies.

Haskell

Haskell is a modern functional programming language whose features include “higher-order functions, non-strict semantics, static polymorphic typing, user-defined datatypes, pattern matching, list comprehensions, a module system, and a rich set of primitive datatypes” [Hudak, Peyton Jones, Wadler and others, 1992, page 1]. A number of free compilers, interpreters, tools, and libraries exist, and, in a recent benchmark experiment [Hartel, 1994], the Glasgow Haskell compiler [Peyton Jones et al., 1993] consistently outperformed the best compilers for other lazy languages, including Clean [Nöcker et al., 1991], and Fast [The FAST project team, 1993]. Even when compared against strict languages, GHC managed to finish second out of twenty five [Hartel et al., 1996, section 6.2.1].

Appendix B contains a number of example Haskell definitions, including map, foldl, fib, primes, and queens.

2.3.3 Graph reduction

Graph reduction [Wadsworth, 1971, chapter 4] is, arguably, the standard approach to implementing functional languages [Peyton Jones, 1987, parts II and III]. It provides a framework for both controlling the order of reduction, e.g. normal order versus applicative order, and managing the sharing of common sub-expressions. Combinators [Turner, 1979] are typically used as the abstract machine language, with the instruction set being optimally generated for each program [Hughes, 1984].
Clean [Nöcker, Smetsers, Plasmeijer and van Eekelen, 1991] is one of the few modern languages not to use graph reduction, instead being based upon term graph rewriting [Sleep, Plasmeijer and van Eekelen, 1993]. The primary advantage of this approach is that a formal proof of soundness has been developed [Barendregt, van Eekelen, Kennaway and others, 1987]. Despite these theoretical differences, when comparing the ABC machine [Plasmeijer and van Eekelen, 1993a] (the abstract machine used by Clean), with, for example, the STG machine (a graph-reduction system), it is clear that the implementation strategies are very similar.

For a brief period, specialised hardware for running functional languages was considered, with examples including Cobweb [Hankin, Osmon and Shute, 1985], the AMPS (Applicative Multi-Processors) project [Keller, Lindstrom and Patil, 1979], ALICE (Applicative Language Idealised Computing Engine [Darlington and Reeve, 1981]), and FLAGSHIP [Watson and Watson, 1987]. However, all of these offerings could not compete against commercially-developed traditional uniprocessors, and have therefore become obsolete.

2.3.4 Parallel functional programming: an introduction

Burge [1975] was probably the first to recognise the potential advantages of parallel functional programming. In addition to the high level of abstraction offered by the paradigm, the absence of side effects offers the following benefits in a parallel context [Peyton Jones, 1989, section 2.2, pages 175–176]:

- functional languages are implicitly parallel, i.e. it is possible for a compiler to automatically detect what sub-expressions can be safely evaluated in parallel (see section 2.4.1).
- the programmer does not have to specify the low-level synchronisation of variables as sharing is automatically handled by the run-time system.
- ideally, the language retains its original denotational semantics (modulo resource allocation), therefore the same formal reasoning techniques can be applied. This also means that all programs are guaranteed to be deadlock free (unless, of course, the sequential program also fails to terminate).
- a program's result will be independent of implementation details such as scheduling, partitioning, and load balancing. This also implies that parallel programs can be debugged on a sequential architecture.
- being based on the lambda calculus, there is no architectural bias.

Concerning the first two points, critics would argue that a human programmer must specify the low-level behaviour if performance is to be a primary goal (recall the corollary of modest potential from section 2.2.2). Peyton Jones [1989, section 2.3, pages 176–177] counters by noting that there has always been resistance to abstraction:

"For example, in the beginning all programs were written in assembly language, and compilers were distrusted because they were unlikely to do as good a job of register allocation as a human programmer."

However, he does concede that parallel functional programming relies on the existence of low- and mid-level parallel imperative technology, and it is probably in these areas that most work needs to be done.
For a more detailed overview of parallel functional programming, both [Hammond, 1994] and [Peyton Jones, 1989] are recommended.

2.4 User-level annotations and expressions

When examining a particular approach to parallel functional programming it is important to differentiate between what is being expressed and the notation used. For instance, consider the following code fragments, both of which have similar operational behaviours:

\[
\begin{align*}
\text{f (g}_1 \arg _{g_1} \ldots \arg _{g_n}) \ldots (g_n \arg _{g_n} \ldots \arg _{g_{n,n}}) & \quad \text{sandwich } f \job _1 \ldots \job _n \\
\text{where } f :: \text{speculation } a_1 \rightarrow \cdots \rightarrow \text{speculation } a_n \rightarrow a & \quad \text{where } f = \ldots \\
& \quad \job _1 = g_1 \arg _{g_1} \ldots \arg _{g_{1,a_1}} \\
& \quad \vdots \\
& \quad \job _n = g_n \arg _{g_n} \ldots \arg _{g_{n,a_n}} \\
\end{align*}
\]

Burton’s type annotations
[Burton, 1987]  
Vree’s sandwich expression
[Vree, 1989]

For the purpose of this section, the primary focus is on the intended behaviour rather than any syntactic differences, with the predominant abstractions being:

- **Implicit specification**: It is left to the compiler to identify and harness those portions of the program which will benefit from parallel evaluation.

- **Bulk data types**: Primitive operations for manipulating a group of objects as a whole provide the main sources of parallelism.

- **Skeletal parallelism**: A skeleton is an algorithmic template, with both a denotational and operational reading, into which problem-specific routines are slotted.

- **Low-level task control**: The programmer is given full control over the creation, placement, and scheduling of threads of computation.

Each of these paradigms is examined in turn in sections 2.4.1 to 2.4.4.

2.4.1 Implicit specification

In the absence of programmer-supplied hints, the compiler has to rely on abstract interpretation (or, when this fails, run-time profiling) to both detect potential parallelism, and to ascertain if the overhead is justified. The final product of the analysis phase is an explicitly parallel program, using one or more of the abstractions described in the following sections. The main tools used are *strictness* and *complexity* analysis, with the former identifying those expressions which are sure to be evaluated [Burn, 1991], and the latter generating cost models of the reduction of these expressions [Maheshwari, 1990]. Of the two, strictness analysis is probably the most evolved, as it has applications outside the world of parallel functional programming [Howe and Burn, 1994].

The advantages offered by this paradigm are the reduced programmer burden coupled with architecture independence. The viability of this approach has been demonstrated by a number of systems, including serial combinators [Goldberg, 1988b] and evaluation transformers [Burn, 1989]. However, at present, such systems only introduce unstructured *par* combinators, leading to sub-optimal performance.
merging
\{1, 2, 9\} \cup \{1, 3, 6\} = \{1, 2, 3, 6, 9\}
\langle 1, 2, 9 \rangle \oplus \langle 1, 3, 6 \rangle = \langle 1, 2, 9, 1, 3, 6 \rangle
zip\_with (+) \langle 1, 2, 9 \rangle \langle 1, 3, 6 \rangle = \langle 2, 5, 15 \rangle

selection
\{x \in \{1, 2, 9\} | x \leq 5 \cdot x\} = \{1, 2\}

permutation
\text{reverse} (c, a, t) = (t, a, c)

reduction
\text{fold}_\text{left} (+) 0 \langle 1, 3, 6 \rangle = 10

scanning
\text{scan}_\text{left} (+) 0 \langle 1, 3, 6 \rangle = \langle 0, 1, 4, 10 \rangle

apply to all
\text{map} (1+) \langle 1, 3, 6 \rangle = \langle 2, 4, 7 \rangle

Table 2.1: Some examples of collection-oriented operations

2.4.2 Bulk data types

The fundamental component of a data-parallel language is the monolithic apply-to-all function which simultaneously acts on a large collection of data. Sipelstein and Blelloch [1991, page 510] note that:

“A collection-oriented language is characterised by two features: the kinds of collections it supports and the operations permitted on those collections.”

Example collections include arrays, bags, sets, mappings, and trees [Maafien, 1992], with the operations including merging, selection, permutation, reduction, scans, and the ubiquitous map – see table 2.1 for a number of Haskell-style examples. Some of the operations require an ordering to be imposed on the elements of the collection, and this is typically either array-based or uses a (possibly unique) key. The data placement, pattern of communication, and synchronisation are usually implicitly specified by whichever operation is used, although a number of languages allow the programmer full control over some of these areas [Bala, Ferrante and Carter, 1993]. In order to illustrate some of the issues touched upon here, the following sections look at two example languages, Paralation Lisp and data-parallel Haskell. A number of other equally important examples exist, including NESL [Blelloch et al., 1993] and Sisal [Skedzielewski, 1991; Oldehoeft and Cann, 1988].

Paralation Lisp

A paralation [Sabot, 1988] is a collection of fields, with each field containing an index which identifies the site, or location, of the field (the index does not have to be unique.) The index-to-site relationship is, by default, arbitrary, but the programmer can enforce a particular shape on the paralation, including rings and grids. There are four levels of locality defined by the model: the elements in a single field are guaranteed to be near (elementwise locality) – this applies even if one of the fields is itself a paralation (inherited locality); the next closest items are fields from the same paralation with similar indices (shape locality); then comes any fields within the same paralation; and finally, fields from different paralations are the most distant. The degree of synchronisation required is minimal, as any function which is sensitive to the order of evaluation is defined to be invalid under the model [Sabot, 1988, pages 19–21].

Ignoring the underlying computational model, Paralation Lisp only requires three basic operations (excluding creation and testing for equality), elwise, match, and move, as detailed in table 2.2. In the following examples, a field is represented as value\_index rather
elwise paralations function

concurrently applies function to each set of fields with matching indices taken from the paralations. In its simplest form, elwise is equivalent to the map operation

match paralation\textsubscript{1} paralation\textsubscript{2}

creates a partial, multi-valued index mapping based on the comparisons of all the elements of the two paralations

move paralation mapping default combine

returns a paralation whose contents are generated by applying the mapping to the specified paralation. The default and combine operators handle the special cases of zero or many elements being mapped to one location

Table 2.2: Paralation Lisp’s data-parallel constructs

than the more usual \((index, value)\) presentation:

| elwise \([0,3,6]\) \((1+)\) | \([2,4,7]\) |
| elwise \([0,2,9]\) \([0,3,6]\) \((+\) | \([2,5,15]\) |
| match \([4,7,3]\) \([0,4,5,2,4]\) | \([0 \mapsto 1, 0 \mapsto 3, 2 \mapsto 0]\) |
| move \([5,9,12]\) \([0 \mapsto 2, 1 \mapsto 0, 1 \mapsto 2]\) \(\delta \oplus\) | \([9,0,\delta, 12]\) |

Data-parallel Haskell

A \textsc{pod} [Hill, 1994, page 18] is a collection of index/value pairs, where the index element of each pair is unique within the \textsc{pod}. The main operation on the data is a restricted form of list comprehension [Hudak et al., 1992, page 16], with the map operation being defined as follows:

\[
\text{map pod} f \text{ pod} = [(index, f x) | (index, x) \leftarrow \text{pod}]
\]

Multi-\textsc{pod} comprehensions are similar, but, to make it clear where the index set should be taken from, all secondary \textsc{pods} are introduced via the \(\leftarrow\) operator:

\[
\text{add pod} \text{ pod}_1 \text{ pod}_2 = [(index, x + y) | (index, x) \leftarrow \text{pod}_1, (index, y) \leftarrow \text{pod}_2]
\]

\[
\text{fetch pod function pod} = [(index, y) | (index, x) \leftarrow \text{pod}, (function index, y) \leftarrow \text{pod}]
\]

\[
\text{send pod function pod} = [(function index, x) | (index, x) \leftarrow \text{pod}]
\]

The functions \text{fetch pod} and \text{send pod} demonstrate how the comprehensions can concisely specify arbitrary communication patterns, and as such are similar to a combined match and move operation under Paralation Lisp. In line with its pedigree, data-parallel Haskell does not provide any mechanism to specify a \textsc{pod}s shape, and synchronisation is automatically handled by the run-time system.

2.4.3 Skeletal parallelism

The \textit{skeleton} paradigm was so named by Cole [1989], who described the use of a set of fixed algorithmic templates, into which problem-specific routines can be slotted. Typical
skeletons include divide-and-conquer (and its small scale equivalent, the processor farm), pipelines, loops and trees [Burkhart, Korn, Gutzwiller, Ohnacker and Waser, 1993], with each skeleton having several associated components [Darlington et al., 1993]: a declarative meaning, which is the programmer's primary reference when developing skeletal algorithms; one or more implementation templates, for efficiently performing the required computation; performance models to predict the run-time costs of a particular instantiation of a skeleton; and a set of transformation rules, which, in combination with the cost model, allow a program to be optimised for a particular architecture. These components are examined in sections 2.4.3 to 2.4.3.

The distinction between data parallelism and skeleton-oriented programming is unclear, as the former could be considered a sub-paradigm of the latter.

**Declarative meaning**

A skeleton is typically represented by its functional-language definition, with this model serving two main purposes: it acts as the primary reference for the programmer, and it allows the programs to be tested on a sequential architecture. Some example definitions are given below:

```haskell
pipe :: [a -> a] -> (a -> a)
pipe [f] = f
pipe (f:fs) = f . (pipe fs)

divide_and_conquer :: (a -> Bool) -> (a -> b) -> (a -> [a]) -> ([b] -> b) -> a
              -> b
divide_and_conquer is_trivial solve split combine values
  | is_trivial values = solve values
  | otherwise = combine [divide_and_conquer is_trivial solve split combine
                        part | part <- split values]
```

**Implementation templates**

The actual implementation of the model deals with all the low-level issues such as data placement, synchronisation and scheduling. In order to cater for different architectures, or even slight variations within a particular configuration, a number of implementation templates will be necessary to achieve architecture independence. The specification of a template is usually given in terms of the component processes and the interconnection network used. For example, figure 2.2 shows the model for the \( P^2L \) dedicated-farm skeleton [Pelagatti, 1993, pages 92, 97, and 139].

**Performance models**

In order to make resource-allocation decisions it is essential to have an accurate performance model of both the parallel and sequential parts of the program. There are a number of factors that must be considered when developing the model, including: the available resources, such as the number of processors; the overheads associated with the implementation of the skeleton; the performance of the problem-specific code; the volume of data to be produced/consumed; and so on. Obviously some of these parameters will have to be estimated, either via abstract analysis or run-time profiling. As an example, Bratvold
Figure 2.2: Operational template of $P^3L$'s dedicated-farm skeleton

[1994, pages 105–106] uses the following equations to characterise the pipeline skeleton:

1. \[ L = \sum_{1 \leq i \leq s} (L_i + C_i) \]

2. \[ T_C = (\max_{1 \leq i \leq s} T_{Ci}) + L - L_{\max} \]

<table>
<thead>
<tr>
<th>$L$</th>
<th>total latency of the pipeline</th>
</tr>
</thead>
<tbody>
<tr>
<td>$s$</td>
<td>number of stages</td>
</tr>
<tr>
<td>$L_i$</td>
<td>latency of stage $i$</td>
</tr>
<tr>
<td>$C_i$</td>
<td>communication costs of sending a value from stage $i$ to stage $i + 1$</td>
</tr>
<tr>
<td>$T_C$</td>
<td>completion time</td>
</tr>
<tr>
<td>$T_{Ci}$</td>
<td>completion time for stage $i$</td>
</tr>
<tr>
<td>$L_{\max}$</td>
<td>largest latency</td>
</tr>
</tbody>
</table>

Typical applications of cost models include discriminating between sequential and parallel implementations [Danelutto, Pelagatti and others, 1992, section 4.1], and deciding when to stop unfolding a divide-and-conquer problem [Darlington et al., 1993, section 4].

**Transformation rules**

Transformation rules declare two expressions or operational templates to be semantically equivalent, and, using a cost calculus as a guide, allow a transformation system to optimise a program with respect to a particular architecture [Pelagatti, 1993]. In addition, a precondition may have to be satisfied before the rule can be applied. The following rules are taken from Bratvold [1994, appendix B] and Darlington et al. [1993, section 5]:

\[
\begin{align*}
    \text{map } f \ (\text{map } g \ l) & \iff \text{map } (f \cdot g) \ l \\
    \text{map } (\text{divide_and_conquer } t \ s \ d \ c) & \iff \text{pipe } (\text{rept } q \ (\text{map'} n \ c)) \cdot \text{map } s \cdot \text{pipe } (\text{rept } q \ (\text{foldr1 } (\oplus)) \cdot \text{map } d)
\end{align*}
\]

Notice that the second transform is unidirectional, and uses an architecture-specific constant, $q$, which encodes the optimal depth of the pipeline. Transformations not only apply to program constructs, but to the implementation templates as well.

**2.4.4 Low-level annotations**

Unlike the other approaches to parallelism, low-level languages rely on the programmer to identify and control all aspects of the parallel program. This can lead to improved performance, but at the price of increased programmer effort and reduced portability.
Traditionally, annotations have been used to control the following operational properties of functional programs:

**thread identification** threads are the basic unit of work of a system, with each responsible for evaluating a single expression.

**degree of evaluation** by default, a thread will reduce an expression to head-normal form, which may not entail sufficient work to justify the overhead of thread creation.

**thread and data placement** the thread-to-processor mapping and data distribution encode load balancing and locality information.

**order of evaluation** scheduling annotations provide fine control over the sequence of computation, and can be used to augment, or override, the default strategy.

### Thread identification

Probably the best-known annotation in parallel functional programming is the *spark* construct [Clack and Peyton Jones, 1986] – most commonly manifesting as either the “!” character or the *par* combinator. This indicates that the decorated expression can be evaluated (usually to normal form) in parallel with the current computation:

```
Haskell

dsum low high | (high == low) = high
| otherwise = (dsum low middle) + {! worth_it}
(dsum (middle + 1) high)
where middle = (high + low) 'div' 2
worth_it = (high - low) > 50
```

where `{!exp_cond}` [Peyton Jones, 1989, page 181] is syntactic sugar for `let x = exp in if exp_cond then x{!} else x`. Another variation on the theme is Mattson Jr.’s speculative spark, `{# PCT potential #}` [1993a, page 66], where *potential* is an estimate of the probability that the expression will be needed.

### Degree of evaluation

In order to justify the overhead of thread creation, it may be necessary to increase the work associated with an expression. Typically, this is done by overriding the default reduction strategy and evaluating to, for example, *irreducible normal form* [Kewley and Glynn, 1990, page 330]. In the presence of constructors, there are a wide spectrum of possible evaluation orderings [Burn, 1991, page 114], some of which may generate further threads of computation: (the following example uses the STG' language from chapter 4 as standard Haskell does not support the necessary operations)

```
STG' code

force_tree = □ \r [tree] -> force_tree' tree tree;
force_tree' = □ \r [tree original_tree]
               -> case tree of {Branch treel tree2 -> {letpar treel' = force_tree tree1 in
                                           letpar tree2' = force_tree tree2 in
                                           original_tree ;
                                           Leaf a}
                           Leaf a -> original_tree ;};
```

These transformers are either hand coded or automatically generated – Mirani and Hudak [1995, section2] support both approaches by using Haskell’s class system to provide default definitions for any missing instances.
Table 2.3: Concurrent Clean’s topology functions

<table>
<thead>
<tr>
<th>Function</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>NeighbourP n proc_id</td>
<td>returns the ID of the nth neighbour of the specified processor</td>
</tr>
<tr>
<td>ChannelP var</td>
<td>returns the ID of the processor on which the argument variable is stored</td>
</tr>
<tr>
<td>ITOP integer</td>
<td>converts an integer into a processor ID, allowing user defined mapping functions to be written</td>
</tr>
<tr>
<td>CurrentP</td>
<td>returns the ID of the current processor</td>
</tr>
<tr>
<td>RandomP</td>
<td>generates a random processor ID</td>
</tr>
</tbody>
</table>

Thread and data placement

The simplest example of process mapping is provided by Concurrent Clean’s Self and Par annotations [Nöcker, Smetsers, Plasmeijer and van Eekelen, 1991] – threads spawned as a result of the former will be evaluated locally, and cannot be migrated to other processors, while those identified by the latter will probably be run on a remote processor. Within the same language, finer control is provided by the $P AT expPROCID$ directive, where the expression is of primitive type $PROCID$. The primitive operations for manipulating such expressions are shown in table 2.3. In order to make use of the ITOP function, the mapping between integers and processor identifiers must be defined for each target architecture. To provide support for the various annotations, Concurrent Clean extends the type system so that it includes process types [Plasmeijer and van Eekelen, 1993b].

Para-functional Haskell [Hudak, 1991, section 5.3.3, pages 171–175] provides similar facilities, but for placing data rather than tasks. Also, by using an operating system monad to structure access to them, the inherent non-determinism is restricted to a purely operational level [Mirani and Hudak, 1995, section 4].

Caliban [Kelly, 1989] uses the moreover clause to both identify the threads and specify the required topology. This takes as its argument a conjunction of assertions, where an assertion is either an arc statement, or a network-forming expression. The statement arc $a b$ indicates that process $a$ derives its input from the output of process $b$, and that it is safe to run both processes concurrently, e.g.

\[
\text{main} = (f \cdot g \cdot h) \ d
\]

where $f = \text{map} (\text{(+)} 2)$, $g = \text{map} (\text{(*)} 3)$, $h = \text{map} \text{sqrt}$, $d = \text{from} 1$

moreover $(\text{arc } @f @g) \land (\text{arc } @g @h) \land (\text{arc } @h d)$

By using higher-order functions to manipulate arc definitions, a network-forming expression can concisely define a complex process structures. This is illustrated by the following definition of the pipeline function:

\[
\text{pipeline} :: [a \rightarrow a] \rightarrow a \rightarrow a
\]

\[
\text{pipeline} \ fs \ x = (\text{fold} \ (\_ \ id \ fs)) \ x
\]

moreover $(\text{chain} \ (\text{arc} @f \@g)) \land (\text{arc } @g \@h) \land (\text{arc } @h \ d)$

\[
\text{chain} :: (\text{Bool} \rightarrow \text{Bool} \rightarrow \text{Bool}) \rightarrow [(a \rightarrow b)] \rightarrow \text{Bool}
\]

\[
\text{chain} \ relation \ [f] = \text{True}
\]

\[
\text{chain} \ relation (f : \ fs) = (\text{relation } f1 \ (\text{head} \ fs)) \land (\text{chain} \ relation \ fs)
\]
Order of evaluation

In place of the usual spark and evaluation-override annotations, para-functional Haskell provides schedule expressions [Mirani and Hudak, 1995, section 2]:

“Schedules define partial orders on events of which there are two kinds for every expression: (1) a demand for the expression’s evaluation, and (2) a wait for the completion of the expression’s evaluation. Many demands may be made for an expression’s evaluation, but only the first one will have any effect.”

A schedule, therefore, consists of either an event, or the concatenation or concurrence of two other schedules, denoted by $s_1, s_2$ and $s_1 \parallel s_2$, respectively. The operational effect of these operators is similar to the traditional seq and par combinators. For example, consider the following function applications:

1) \[ f a b \text{ schedule } (\text{demand } a \parallel \text{demand } b \parallel \text{demand } f) \]
2) \[ f a b \text{ schedule } ((\text{demand } a . \text{wait } a) \parallel (\text{demand } b . \text{wait } b)) . \text{demand } f \]

The first expression reduces both the function application and the arguments in parallel, while the second one concurrently evaluates the two arguments, and proceeds with the application only after both threads have completed.

2.5 Prototyping parallel functional languages

A would-be designer of a parallel functional system is faced with three major obstacles:

1. due to the large number of dimensions involved and to the lack of a common benchmarking system, it is extremely difficult determine which components are central to the performance of the system.

2. having selected and integrated the components, the cost of developing an efficient implementation for just one platform is considerable.

3. once the base implementation is complete, experimentation with any but the most trivial of subsystems may require significant effort.

What is needed is a system to rapidly develop and test ideas before committing to a full-scale implementation. Indeed, Hiromoto [1994, section 5] argues for a:

“Incremental, cyclic and comparative approach in the evaluation of [parallel functional] languages, compilers and machine architectures.”

The traditional software-engineering solution to this problem is the development of a prototype [Balzer et al., 1988, page 8]:

“Prototyping is the process of constructing software for the purpose of obtaining information about the adequacy and appropriateness of the designers' conception for a software product...

...a prototype is distinguished from a production system by typically being more quickly developed, more readily adapted, less efficient and/or complete, and more easily instrumented and monitored.”

Chapter 3 outlines a possible prototyping framework, while the remainder of this section examines the relevant literature.
2.5.1 The problem with performance comparisons

Jain [1991] motivates the need for performance evaluations as follows:

“A performance evaluation is required when a computer designer wants to compare a number of alternative designs and find the best design... Even if there are no alternatives, performance evaluation of the current system helps in determining how well it is performing and if any improvements need to be made.”

However, when Langendoen [1993, table 3.2, page 57] attempted to compare the performance of a number of parallel implementations, the following problems were encountered:

“Unfortunately, few results are actually reported for each machine, and, worse, different algorithms have been used to solve the same problem. Only the notorious nfib program, a one-liner to compute the number of function calls per second, has been coded in similar style and measured on most parallel reduction machines.”

Furthermore, Kaser et al. [1992, table 2, page 342] are probably the only others to attempt to directly compare different implementations (in this case, EQUALS, the \((v, G)\)-machine, and, GAML). They encountered similar problems: only two of the systems ran on the same architecture, and one of these did not support garbage collection (potentially 30% of the total run time). In the end, they resorted to comparing relative speedups.

Are these experiences surprising? Tables 2.4 and 2.5 summarise a number of existing implementations, including the test programs used to generate the performance results. Even disregarding the differences in the architecture, evaluation strategy, and source of parallelism, it is fairly clear that only a minority of systems have run even similar tests (fewer still have also included the source code and provided actual timings, rather than relative speedups.) Even in a purely sequential context, Partain [1993, section 1.1] is highly critical of the current state of performance evaluation:

“The quantitative measurement of systems for lazy functional programming is a near-scandalous subject. Dancing behind a thin veil of disclaimers, researchers in the field can still be found quoting nfibs/sec (or something equally egregious), as if this refers to anything remotely interesting.”

2.5.2 Is a standard benchmark suite the solution?

Consider, for example, PARKBENCH (PARallel Kernels and BENCHmarks [Hockney et al., 1993]), which has the following objectives:

1. to establish a comprehensive set of parallel benchmarks that is generally accepted by both users and vendors of parallel systems.

2. to provide a focus for parallel benchmark activities and avoid unnecessary duplication of effort and proliferation of benchmarks.

3. to set standards for benchmarking methodology and result-reporting together with a control database/repository for both the benchmarks and the results.

4. to make the benchmarks and results freely available in the public domain.
<table>
<thead>
<tr>
<th>system</th>
<th>source</th>
<th>strict</th>
<th>reference</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alfalfa</td>
<td>implicit</td>
<td>yes</td>
<td>[Goldberg and Hudak, 1987]</td>
</tr>
<tr>
<td>Graphinators</td>
<td>implicit</td>
<td>no</td>
<td>[Hudak and Mohr, 1988]</td>
</tr>
<tr>
<td>BBN ML</td>
<td>implicit</td>
<td>no</td>
<td>[George, 1989]</td>
</tr>
<tr>
<td>HDG-machine</td>
<td>implicit</td>
<td>no</td>
<td>[Kingdon et al., 1991]</td>
</tr>
<tr>
<td>EQUALS</td>
<td>implicit</td>
<td>no</td>
<td>[Kaser et al., 1992]</td>
</tr>
<tr>
<td>SISAL</td>
<td>data-parallel</td>
<td>yes</td>
<td>[Oldehoeft and Cann, 1988]</td>
</tr>
<tr>
<td>NESL</td>
<td>data-parallel</td>
<td>yes</td>
<td>[Bleloch et al., 1993]</td>
</tr>
<tr>
<td>DP Haskell</td>
<td>data-parallel</td>
<td>no</td>
<td>[Hill, 1994]</td>
</tr>
<tr>
<td>Gamma</td>
<td>skeletal</td>
<td>yes</td>
<td>[Kuchen and Gladitz, 1992]</td>
</tr>
<tr>
<td>WYBERT</td>
<td>skeletal</td>
<td>no</td>
<td>[Langendoen, 1993]</td>
</tr>
<tr>
<td>Skeletal ML</td>
<td>skeletal</td>
<td>yes</td>
<td>[Bratvold, 1994]</td>
</tr>
<tr>
<td>(v, G)-machine</td>
<td>low-level</td>
<td>no</td>
<td>[Augustsson and Johnsson, 1989]</td>
</tr>
<tr>
<td>GAML</td>
<td>low-level</td>
<td>no</td>
<td>[Maranget, 1991]</td>
</tr>
<tr>
<td>PAM</td>
<td>low-level</td>
<td>no</td>
<td>[Loogen et al., 1991]</td>
</tr>
<tr>
<td>BBN Haskell</td>
<td>low-level</td>
<td>no</td>
<td>[Mattson Jr., 1993a]</td>
</tr>
<tr>
<td>STAR:DUST</td>
<td>low-level</td>
<td>no</td>
<td>[Ostheimer, 1993]</td>
</tr>
<tr>
<td>pD</td>
<td>low-level</td>
<td>yes</td>
<td>[Schreiner, 1994]</td>
</tr>
<tr>
<td>GUM Haskell</td>
<td>low-level</td>
<td>no</td>
<td>[Trinder et al., 1996]</td>
</tr>
</tbody>
</table>

Table 2.4: Comparing implementations of parallel functional languages – language issues

Other C/Fortran-centric parallel suites include: SPLASH [Singh, Weber and Gupta, 1992], Genesis [Addison et al., 1993], and the NAS benchmarks [Bailey, 94].

With regards to functional programming, the nofib [Partain, 1993] suite (see appendix C) is probably the first attempt at providing a standard set of benchmark programs (a large subset of the collection had already been used by Hartel and Langendoen [1993].) The “pseudoknot” benchmark, based on a single float-intensive program, is also worthy of note, even if only for the unprecedented scale of collaboration achieved. However, ignoring the inherent problems of benchmarking [Bailey, 1991; Jain, 1991], there are a number of additional problems posed when moving to a parallel environment:

- there is no standard syntax for expressing parallelism. Indeed, even the general paradigm is still a matter for debate.

- each implementation embodies a large number of (potentially arbitrary) design decisions – isolating the effect of each would be very difficult, if not impossible.

- each architecture has different communication properties, and normalising the results would, again, be difficult.

- there is no consensus as to what metrics would be useful.

In summary, an implementation, with respect to a benchmark suite, appears to be a monolithic black box. This not only limits the useful experiments that can be devised, but also requires that a complete and optimised implementation exists. The problem is further compounded by the number of variables – consider, for instance, tables 2.4 and 2.5 – not least of which is the parallel architecture itself. It is therefore unlikely that any
<table>
<thead>
<tr>
<th></th>
<th>arch.</th>
<th>P</th>
<th>benchmarks</th>
<th>code</th>
<th>times</th>
</tr>
</thead>
<tbody>
<tr>
<td>Graphinators</td>
<td>SIMD</td>
<td>8K</td>
<td>sum, matmult</td>
<td>yes</td>
<td>no</td>
</tr>
<tr>
<td>NESL</td>
<td>SIMD</td>
<td>16K</td>
<td>linefit, median, matmult</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>DP Haskell</td>
<td>SIMD</td>
<td>1K</td>
<td>map</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>SISAL</td>
<td>GMSV</td>
<td>10</td>
<td>sieve, simple, kernel</td>
<td>some</td>
<td>yes</td>
</tr>
<tr>
<td>(v,G)-machine</td>
<td>GMSV</td>
<td>16</td>
<td>nfib 30, queens 10, euler</td>
<td>yes</td>
<td>no</td>
</tr>
<tr>
<td>BBN ML</td>
<td>GMSV</td>
<td>16</td>
<td>nfib 20, queens 8, sieve</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>GAML</td>
<td>GMSV</td>
<td>8</td>
<td>nfib 30, queens 10, euler</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Gamma</td>
<td>GMSV</td>
<td>6</td>
<td>mergesort, minimum</td>
<td>no</td>
<td>yes</td>
</tr>
<tr>
<td>EQUALS</td>
<td>GMSV</td>
<td>6</td>
<td>nfib 30, queens 10, euler</td>
<td>no</td>
<td>yes</td>
</tr>
<tr>
<td>BBN Haskell</td>
<td>GMSV</td>
<td>122</td>
<td>euler, sieve, primes</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>WYBERT</td>
<td>GMSV</td>
<td>4</td>
<td>nfib, queens, det, wang</td>
<td>some</td>
<td>yes</td>
</tr>
<tr>
<td>pD</td>
<td>GMSV</td>
<td>20</td>
<td>linear, resultants</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>GUM Haskell</td>
<td>GMSV</td>
<td>6</td>
<td>fac, loadtest, bulktest</td>
<td>yes</td>
<td>no</td>
</tr>
<tr>
<td>Alfalfa</td>
<td>DMMP</td>
<td>36</td>
<td>queens 6</td>
<td>no</td>
<td>yes</td>
</tr>
<tr>
<td>HDG-machine</td>
<td>DMMP</td>
<td>4</td>
<td>nfib 20, queens 6</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>PAM</td>
<td>DMMP</td>
<td>12</td>
<td>nfib 24, matmult, towers</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Skeletal ML</td>
<td>DMMP</td>
<td>34</td>
<td>ray, match, area</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>STAR:DUST</td>
<td>DMMP</td>
<td>24</td>
<td>nfib, qsort</td>
<td>some</td>
<td>yes</td>
</tr>
<tr>
<td>GUM Haskell</td>
<td>DMMP</td>
<td>8</td>
<td>fac, loadtest, bulktest</td>
<td>yes</td>
<td>no</td>
</tr>
</tbody>
</table>

Table 2.5: Comparing implementations of parallel functional languages – benchmarks. The code field indicates whether the source code of the benchmark programs was supplied, while the time column differentiates between systems that only supply speedup figures and those that provide the total elapsed times.
meaningful comparison or conclusions can be made, based on ad-hoc performance data (consider again table 2.5).

2.5.3 Existing approaches to developing functional implementations

The Haskell approach

One solution to the problems outlined previously, is to standardise one or more components of the language, compiler, and, architecture triple. For example, the Haskell language was designed to “reduce unnecessary diversity in functional programming languages” [Hudak et al., 1992, page iv]. Also, with regards to compiler implementation, GHC aims “to provide a modular foundation that other researchers can extend and develop” [Peyton Jones et al., 1993, section 2]. Both of these ideas have already been adopted by a section of the non-strict parallel functional community – pH [Nikhil et al., 1995, section 1, page 1], a Haskell derivative extended to include explicit parallelism, has as one of its goals:

“To share infrastructure (compilers, systems, application programs), and to facilitate interesting research topics, such as comparing lazy evaluation vs. lenient evaluation...”

This is certainly a move in the right direction. However, by necessity, these compilers are written primarily for speed and efficiency, possibly at the expense of clarity – based on personal experience, this is certainly true of GHC! Moreover, the system will be sufficiently complex that familiarisation and development will take a significant amount of time.

Simulating multiprocessor architectures for compiled graph reduction

Before building the GRIP multi-processor [Peyton Jones, Clack, Salkild and Hardie, 1987], Deschner [1990] developed a “highly flexible simulation system” to explore task partitioning, memory usage, scheduling, topology, and run times. The simulator took as input a precedence graph and a description of the hardware configuration. The former is automatically generated by tracing the sequential execution of the test program, while the latter comprises: the number of processors, the task-pool size, the partitioning and scheduling policies, and the costs associated with some basic operations.

Hammond and Peyton Jones [1992, section 5.2], when analysing the performance of the final hardware implementation, acknowledge the accuracy of the simulator:

“Somewhat surprisingly, in the absence of throttling, the choice of FIFO or LIFO [scheduling] strategy has at most a marginal impact on GRIP. We first realised this as a result of Deschner’s simulation experiments, and then verified it on GRIP...”

An executable specification of the HDG machine

The HDG machine [Kingdon, Lester and Burn, 1991] was both specified and tested as a Miranda script. The authors note that using a functional language enabled them to write the simulator more quickly, debugging was easier, and the resulting definitions bore a strong resemblance to the traditional state-transition model of abstract machines. Indeed, Burn [1989, section 6, page 391] concludes that:

“This has turned out to be such a powerful technique that we would highly recommend it to other machine designers.”
Ginger

Joy and Axford [1992] describe the Ginger language, "which sits somewhere between the low-level FLIC [Functional Language Intermediate Code] and high-level Miranda or Haskell." The interpreter is combinator based, supports both strict and non-strict evaluation, and provides explicit parallelism via placed sparks and data-parallel lists [Axford and Joy, 1991]. A simulator is used to "facilitate research and teaching into parallelism," and this is capable of modelling both shared- and distributed-memory machines. No details of the simulation parameters are provided.

Simulating shared-memory graph reduction

Bennett [1993] uses simulation to explore the behaviour of a parallel functional system on shared-memory machines – primarily focusing on the caching mechanism. The simulator enables both theoretical and existing configurations to be explored, something that would have been impossible if using a design-build approach. The results lead to the design of a scalable cache mechanism which takes advantages of the memory reference characteristics of parallel functional programs.

A graphical winnowing system for Haskell

Hammond, Loidl and Partridge [1995a] use simulation to explore the impact of language and implementation on task granularity. The simulator is based on GHC, and models a distributed-memory machine. A visualisation tool helps to analyse the results, which have been used to uncover a previously unknown relationship between a program's run time and its heap-granularity profile (a histogram of the memory used by each thread of execution) [Hammond, Loidl and Partridge, 1995b]. In addition, and rather unusually, they also state that the simulation has confirmed the experimental results of a real system. However, they do point out the main problem with simulation:

"There is, of course, a danger that the design of the simulator may obscure real artifacts or introduce false one."

2.6 Summary

Non-strict higher-order functional programming languages are elegant, concise, mathematically sound and contain few environment-specific features. Considering that sequential compiler technology has recently begun to compare with that of their imperative counterparts, they then become obvious candidates for harnessing high-performance parallel architectures. The validity of this approach has been established by a number of experimental compilers. However, a would-be designer of a parallel functional system is faced with three major obstacles:

1. due to the large number of dimensions involved and to the lack of a common benchmarking system, it is extremely difficult determine which components are central to the performance of the system.

2. having selected and integrated the components, the cost of developing an efficient implementation for just one platform is considerable.

3. once the base implementation is complete, experimentation with any but the most trivial of subsystems may require significant effort.
What is needed is a system to rapidly develop and test ideas before committing to a full-scale implementation.
Chapter 3

A framework for prototyping parallel functional intermediate languages

3.1 Introduction

This chapter describes an approach to the design of an explicitly parallel intermediate language for use during the compilation of non-strict higher-order functional programming languages. The framework is based upon the development of both a denotational and operational model for the intermediate language, which are then used to produce specifications for the parser and code generator. Haskell animations of these components aid with both debugging and informal validation. (For a more detailed and example-driven description of the prototyping system see [Ben-Dyke and Axford, 1995], which is reproduced in appendix A.)

3.2 A three-phase compilation system

As illustrated by figure 3.1, the prototyping framework is modelled on a traditional three-phase compilation system [Santos, 1995, figure 2.1, page 6]. The phases are as follows:

the source language typically a Haskell-like non-strict functional programming language supporting higher-order functions and abstract data types. Parallelism will either be explicitly specified, as is the case with para-functional Haskell and Caliban, or abstract-analysis techniques will be employed to automatically detect the potential parallelism (see section 2.4). The translation rules convert from the source language to the intermediate representation.

the intermediate representation the sequential STG language [Peyton Jones, 1992] is used for the purpose of this study. As well as converting to and from the intermediate language via the translation and code-generation rules, the optimisation rules convert between equivalent language terms, with the aim of improving efficiency.

Both WYBERT [Langendoen, 1993, figure 5.1, page 96] and Clean [Plasmeijer and van Eekelen, 1993a, figure 8.1, page 253] have four phases, but such systems can be considered as comprising two distinct compilation processes.
Figure 3.1: An overview of the prototyping framework
the target language the Alpha AXP instruction set [DEC, 1992] is the primary interface between the compiler and the architecture simulator. Section 8.2 discusses the reasons for selecting this over the high-level language C [Kernighan and Ritchie, 1978]. The code-generation rules convert from the STG representation into the RISC format.

3.3 Translation, optimisation, and code generation

The three rule sets shown in figure 3.1—translation, optimisation, and code generation—in combination with the run-time support, serve as the specification of the compilation system. These can therefore be considered as the final outputs of the prototyping system:

translation rules will typically serve one of two possible roles: de-sugaring, i.e. the removal of any syntactic construct added solely for the purpose of improving programmability; and explicitly identifying the parallelism inherent in the source program. While the lexing and parsing of high-level languages is well understood [Watson, 1989; Peyton Jones, 1987], automatic parallelisation [Jones and Hudak, 1993; Burn, 1991] is still a major research topic—therefore, the front end of the compiler will not be discussed in this thesis.

optimisation rules form an important part of most modern compilers, and this is especially true of functional systems [Gill, 1992; Beemster, 1994]. Due to the similarity between the Core [Santos, 1995, section 2.2] and STG languages, all of the optimisation rules, heuristics, and algorithms presented by Santos [1995] have STG-language equivalents. However, in a parallel context, another major rule group is often required—these deal with architecture-specific optimisations, as illustrated by the skeletal transformations described in section 2.4.3.

compilation rules generate the low-level machine code, and therefore have to deal with such issues as register allocation [Fraser and Hanson, 1992; Boquist, 1995], heap representations [Shao and Appel, 1994], control flow [Bernstein, 1985], and stack frames [Douence and Fradet, 1995]. As noted by Shao and Appel [1995], it is important to take advantage of the information provided by the intermediate language’s static semantics (see section 4.5):

“Our measurement shows that a combination of several type-based optimisations reduces heap allocation by 36%; and improves the already-efficient code generated by the old non-type-based compiler by about 19%...”

(See chapter 8.)

the run-time system covers such sequential technology as garbage collection, optimised library routines, error handling. In addition, certain parallel components are also necessary, including termination-detection algorithms, the implementation of any skeletal sub-systems, and interfaces to the communication network. (See chapter 8.)

3.4 Structuring the design

While not forming part of the final output, the various semantics models of the three phases, once developed, are arguably the most important components of the framework. Henderson [1986, section 7, page 249] notes that:
“A formal language can be effective as a tool for communication of designs on a larger scale and to suggest the way in which software design should proceed using formal methods.”

**the sequential semantics** two descriptions are used, with the first being a denotational semantics [Stoy, 1977], which assigns values to programs. This provides the reference model against which the compilation rules can be tested and the translation rules validated (assuming that the source language also has a denotational semantics). Furthermore, as outlined in section 5.4, the development of the semantics forces the designer to concentrate on a number of important issues, including the order and degree of evaluation, non-determinism, and run-time errors.

The second description, a Hindley–Milner type-inference algorithm, restricts the set of valid language expressions. This simplifies the compilation rules (as well as enabling a number of advanced optimisations [Shao and Appel, 1995]), and does away with the need for run-time type checking.

**the operational model** based on a state-transition system, the operational model provides a concise high-level description of the intended behaviour of the compilation rules. This approach has been widely used to develop a number of abstract machines, including both Tim and the STG machine. Having constructed the model, and tested it against the simpler denotational description, it should be easier to develop the actual compilation rules.

**the architecture simulator** provides the framework for testing both the correctness and performance of the compilation rules. Note that any generated results are potentially inaccurate, and should therefore only be used to motivate design choices.

**the STG-language prelude** offers a source of examples and test cases. Appendix B.1 includes some typical prelude definitions.

In addition to driving the design process, the development of the (semi-formal) models mean that it is, in theory, possible to prove the correctness of the rules discussed in section 3.3. However, while important, this subject is beyond the scope of this thesis.

### 3.5 Animating the compiler

As the semantic descriptions are used to validate the compiler’s front and back ends, it is important that there is a degree of confidence in the models themselves. One possible solution to this problem is suggested by Henderson [1986, section 7, page 249]

“The executable prototype introduces a realistic element of validation of the design sufficiently early in the development process that there is some likelihood of eventual cost saving due to the early determination of design flaws.”

Therefore, using the functional programming language Haskell [Hudak, Peyton Jones, Wadler and others, 1992], executable versions of the specifications are developed. The prescriptive approaches to the animation process are described in sections 4.3.5, 4.5.3, 4.7.4, and 4.8.10 – dealing with abstract syntax, Hindley–Milner type-inference rules, denotational semantics, and state-transition rules respectively.
3.6 Summary

Most parallel implementations of functional programming languages have at least one point of commonality: the use of an intermediate form. Typically, these abstract representations explicitly identify all parallel components but without the background noise of syntactic and (potentially arbitrary) implementation details. We suggest that this is a good point at which to start to draw comparisons, and the problem now becomes one of isolating and testing the effect of a particular design feature. To this end, this chapter outlined a framework for rapidly prototyping such intermediate languages. Based on the traditional three-phase compiler model, the design process is driven by the development of semantic descriptions of the source, intermediate, and target language (and architecture). Executable versions of the specifications help to both debug and informally validate these models.
Chapter 4

The sequential STG' language

4.1 Introduction

This chapter describes the STG' language, a variant of the Shared Term Graph (STG) language, in terms of its abstract and concrete syntax, denotational semantics, and operational semantics. A Hindley–Milner style type-inference algorithm is also presented, which serves to restrict the language and produces information useful to a compilation system.

The chapter starts with a discussion of the utility of a sequential language in a parallel prototyping system in section 4.2, and the abstract and concrete syntaxes are covered in sections 4.3 and 4.4. In section 4.5 the type-inference algorithm is presented, and the problem of how to record the resulting type annotations is addressed in section 4.6. The denotational and operational semantics of the language are then dealt with in sections 4.7 and 4.8, before the chapter is summarised in section 4.9.

4.2 Why use a sequential language?

Gelernter and Carriero [1992, page 97] state that a complete programming model consists of two orthogonal components, a computation model and a coordination model:

"The computation model allows programmers to build a single computational activity: a single-threaded, step-at-a-time computation. The coordination model is the glue that binds separate activities into an ensemble."

It follows that when developing a coordination model it will be necessary to couple it with an existing (sequential) computational model. However, Gelernter and Carriero argue for the complete separation of these two components on the grounds of portability and heterogeneity. In principle, they are correct, but in practice little is lost by creating a mixed-model language, and it is likely that there will be a performance gain due to the close coupling of the two.

Having decided that a computation model will be needed, the selection of the STG language over the other suitable candidates – Tim [Fairbairn and Wray, 1981; Chittnis, Satpathy and Obaroi, 1995], a continuation-passing system [Appel, 1992], functional quads [Traub, 1991], or the ABC machine [Plasmeijer and van Eekelen, 1993a] – has to be justified:

1. the STG language is “a very austere purely-functional language” [Peyton Jones, 1992, section 4] making the conversion from a high-level functional language to the
intermediate form particularly simple. In addition, STG language expressions are concise, yet easy to read.

2. the STG-machine [Peyton Jones and Salkild, 1989] provides the language with an operational reading, the efficiency of which has been demonstrated by the performance of the Glasgow Haskell compiler in a recent benchmark test [Hartel, 1994].

3. there exists of a large body of literature relating to the STG language, covering a wide range of topics, from semantics [Peyton Jones and Launchbury, 1991] to parallel implementation [Hill, 1993].

4. the Glasgow Haskell compiler is capable of dumping the STG language equivalent of a Haskell program, via the -ddump-stg command-line switch, providing a ready supply of example code.

4.3 Abstract syntax

The abstract syntax\(^1\) of the STG' language is given in figure 4.1, with the exception of identifiers, which are discussed in section 4.3.1 (appendix B presents a number of example STG' programs.) The significant differences between the STG' and STG languages are:

- **introduction of algebraic data-type declarations** as the operational semantics made no use of the Haskell-style sum-of-products declarations, their definition was omitted from the original STG report. But, in order to develop both type-inference and compilation rules, such information is vital.

- **removal of named defaults from case expressions** as noted by Peyton Jones [1992, section 5, rule 8], the presence of named defaults complicates the operational semantics, and similar difficulties arose when developing the type-inference and compilation rules.

- **introduction of unboxed and strict let expressions** the let\# and let\textbf{ strict} expressions compensate for the removal of the named defaults from the literal and algebraic case expressions respectively.

The minor changes include the removal of a number of extraneous symbols, some renaming, and the provision of the \textit{bind} production. These have the net effect of simplifying the development and presentation of the syntax-driven algorithms described within this thesis.

Even though the STG-machine is not considered until chapter 6, the operational reading of the language expressions is given in table 4.1. Note that evaluation necessitates the creation of one or more continuations to which the resulting constructor, literal or primitive expressions will return.

Algebraic data types are discussed in section 4.3.2, case expressions in section 4.3.3, and the new \texttt{let\textbf{ strict}} and \texttt{let\#} expressions in section 4.3.4. As for the other production rules, these are as presented by Peyton Jones [1992], to whom the interested reader is referred. Finally, the problem of animating the abstract syntax is dealt with in section 4.3.5.

\(^1\)The terminology used within this section is primarily based on that used by Watt [1991]
Figure 4.1: Abstract syntax of the STG' language

<table>
<thead>
<tr>
<th>Construct</th>
<th>Operational reading</th>
</tr>
</thead>
<tbody>
<tr>
<td>function application</td>
<td>tail call</td>
</tr>
<tr>
<td>let(rec) expression</td>
<td>heap allocation</td>
</tr>
<tr>
<td>let# expression</td>
<td>evaluation and register assignment</td>
</tr>
<tr>
<td>letstrict expression</td>
<td>evaluation and heap allocation</td>
</tr>
<tr>
<td>case expression</td>
<td>evaluation</td>
</tr>
<tr>
<td>constructor application</td>
<td>return to algebraic continuation</td>
</tr>
<tr>
<td>primitive application</td>
<td>return to continuation$^a$</td>
</tr>
<tr>
<td>literal expression and literal-variable application</td>
<td>return to literal continuation</td>
</tr>
</tbody>
</table>

$^a$Primitive functions will either return a literal or constructor value, depending upon the primitive in question.

Table 4.1: The operational reading of STG' language expressions
4.3.1 Identifiers

While the exact format of the different types of identifiers is left unspecified, the naming scheme is in line with Haskell’s policy [Hudak et al., 1992, pages 6-9]: variables, \textit{var}, and type variables, \( \alpha \), are represented by identifiers beginning with lower-case letters; constructors, \textit{cons}, and type constructors, \( \chi \), are either identifiers which start with a capital letter, or are a sequence of non-alphanumeric characters (\_:; +, and := are examples of this form); primitives, \textit{primitive}, are similar to variables, except the identifier will end with a hash symbol; finally, literals, \textit{literal}, are represented by the usual constants (integers, floating-point numbers, ASCII characters etc.). The main exception is the representation \( n \)-ary tuples, which are encoded as \texttt{Tup0}, \texttt{Tup2}, \texttt{Tup3} etc.

4.3.2 Algebraic data-type declarations

A data-type declaration [Hudak et al., 1992, pages 27-28] defines a new sum-of-products type, consisting of one or more constructors. The following example defines booleans, lists and trees:

```
STG' code
_____________________________
data Bool = True | False ;
data List a = Nil | Cons a (List a) ;
data Tree a = Leaf a | Branch (Tree a) (Tree a) ;
```

Using these declarations it is possible to define enumerated, recursive and (polymorphic) composite types [Bird and Wadler, 1988, pages 204-219].

4.3.3 Named defaults and case expressions

In the original STG language, algebraic case expressions containing named defaults served two distinct roles. Firstly, they provided a way of avoiding allocation of the result of the scrutinised expression whenever the result matched any of the non-default alternatives. For example, in the following example, \( r \) and \( r' \) compute identical values, but \( r \) will not create a closure for the value of \( (f \ a) \) if it matches the pattern \( S \ x \):

```
STG code
_____________________________
r = \[a] -> case f a of { S x -> g x ;
t -> h t } ;
r' = \[a] -> let { result = \[u] \[a] -> f a ; } in
      case result of { S x -> g x ;
                      _ -> h result } ; { simple default -}
```

Secondly, they allow a value to be forced to head-normal form, presumably encoding the result of a strictness-analysis phase:

```
STG code
_____________________________
enumFromTo = \[n m] -> case enumFrom n of {
                        n_to_inf -> let { predicate = ... } in
                                      {- named default -}
                                      takeWhile predicate n_to_inf ;
```

One case expression may use the named default for both of these purposes. However, analysis of the STG representation of the \texttt{nofib} benchmark suite (see appendix C) has
shown that named defaults are only ever used to encode strictness information. Therefore, in practice, there are just two distinct uses of the algebraic case expression: alternative selection based on the de-construction of the value of the scrutinised expression; and forced evaluation combined with heap allocation of the result. The second usage bears more resemblance to variable binding than to selection. Moreover, supporting both types of behaviour leads to complications in the operational semantics [Peyton Jones, 1992, section 5, rule 8], type-inference rules and compilation system. For example, it would be difficult to develop a concise type rule which rejected the following function definition:

```
seq = [] \r [x y] -> case x of {x' -> y x'};
```

For these reasons, named defaults were removed from algebraic case expressions, with the new letstrict expression taking over the role of strictness encoding (see the following section).

The use of named defaults in literal case expressions is slightly simpler, as unboxed values are never directly heap allocated. Based on analysis of the nofib benchmark suite, named defaults are always unaccompanied and serve to bind the result of a computation to a variable, as illustrated below:

```
... case minusInt# [x', y'] of
  { xy -> case plusInt# [xy, 1#] of
      { xy' -> ... expression using xy and xy' ... }       
     }                                            
  }                                              
```

To keep the case expression symmetrical, named defaults were also removed from the literal version (let# binds literal expressions to variables.)

### 4.3.4 Unboxed and strict let expressions

Having removed named defaults from both literal and algebraic case expressions, it became necessary to determine if any important functionality had been lost. The answer, in the case of literal defaults, was a definite yes - there was now no way to bind a temporary literal value to a variable (short of defining a new function whose arguments were the value plus the free variables of the remaining computation). To this end the let# expression was introduced, the use of which is illustrated below:

```
... let# xy = minusInt# [x', y'] in
  let# xy' = plusInt# [xy, 1#] in ... expression using xy and xy' ...
```

The right-hand side expression must evaluate to one of the primitive unboxed types.

With respect to algebraic defaults, the situation is less straightforward as it is still possible to achieve the same results through the use of an additional let expression (see the previous section for an example). Yet the named algebraic default accounted for approximately seven percent of the total dynamic bindings (i.e. any variable that is introduced by a let(rec) expression or named algebraic default) of the optimised nofib benchmark suite. The letstrict expression was therefore introduced, as shown here:
The right-hand expression must evaluate to an algebraic data type (see section 4.5.1).

### 4.3.5 Animating the abstract syntax

Most of the algorithms and transition rules used during prototyping are syntax driven, so the animation of the abstract syntax is arguably the most important aspect of the entire system. Fortunately, using Haskell's algebraic data types and type synonyms, the task is a simple one.

For each group of production rules a new data type is created, and each production rule within the group becomes a constructor of the new type. So, for example, the unboxed-type production rules \( \nu \rightarrow \text{Int} \mid \ldots \mid \text{Float} \) translate to:

```haskell
data UnboxedType = UnboxedInt | ... | UnboxedFloat
```

The choice of type and constructor names should reflect the group and individual production rules respectively, but some mangling may be necessary to arrive at a unique name (a restriction of the Haskell language).

In general, a constructor will have one argument type for each constituent non-terminal symbol, unless there are a variable number of the same symbol, in which case a `List` type is used. This is illustrated by the boxed-type rules (\( \pi \rightarrow \alpha \mid \tau_1 \rightarrow \tau_2 \mid \chi \pi_1 \ldots \pi_v \)):

```haskell
data BoxedType = BoxedVar TypeVariable |
    BoxedFun MonoType MonoType |
    BoxedCon Constructor [BoxedType]
```

In addition to the non-terminal symbols, extra arguments may be added to a constructor to facilitate parts of the prototyping system. A case in point is the `exp` group of rules, to each of which has been added an `ExpressionId` field, providing a unique key with which to look up expression-specific information (see section 4.6):

```haskell
data Expression = Let ExpressionId Bindings Expression | ... |
    Case ExpressionId Expression Alts (Maybe Default) | ...
    Value ExpressionId Literal
```

The previous example also illustrates the use of the `Maybe` type to represent optional non-terminal symbols, such as the `default` symbol in a `case` expression. The data declaration for this type is: `data Maybe a = Just a | Nothing`.

Variables, constructors and all other identifiers are represented using the `String` synonym.

### 4.4 Concrete syntax

It may seem strange that an intermediate language would ever make use of a concrete syntax, but such a representation is useful in two important situations: when reporting an error; and during testing, where parsing a textual description of an STG' program is...
quicker, more convenient, and less prone to error than hand coding the Haskell representation. So saying, the exact details of the concrete syntax used are not important, and only the input and output routines are considered here. To simplify the parser, keywords, such as `let` and `of`, are not allowed to be used as variable names — a production compiler may well lift this restriction.

With regards to the conversion of the abstract syntax to text, the simplest solution would be the use of derived instances of the `Text` type class [Hudak et al., 1992, 147–148] for each of the production-rule data types. Unfortunately, the resulting output is awkward and does not match the format used by the parser. Hand coding is the only alternative:

```
Haskell

lambdaformShow (LambdaForm free_vars uflag args exp)
    = "\[" ++ free_vars ++ "] " ++ uflag ++ " \[" ++ args ++ "] \rightarrow" ++ exp'
where
    free_vars' = variablesShow free_vars
    uflag' = updateflagShow uflag
    args' = variablesShow args
    exp' = expressionShow exp
```

Developing a robust parser is more taxing, but most of the complexity can be avoided by using Happy [Gill and Marlow, 1993]:

"Happy is a parser generator system for Haskell, similar to the tool ‘yacc’ for C. Like ‘yacc’, it takes a file containing an annotated BNF specification of a grammar and produces a Haskell module containing a parser for the grammar."

As an example, here is the Happy specification of the `lambda_form` production rule (of the concrete syntax):

```
Happy

LambdaForm ::
    { LambdaForm }
LambdaForm :
    \[
    Arguments \]
    UpdateFlag \[
    Arguments \]
    right_arrow
    Expression
    \{
    LambdaForm $2 $4 $6 $9 \}
```

Notice the symmetry between this rule and the previous display routine.

4.5 Language restrictions, type inference and free variables

In this section, the problem of restricting STG' language programs to ensure the validity of the STG machine is addressed. Section 4.5.1 enumerates the required restraints and illustrates the need for a type-inference system. The advantages and limitations of static typing are then outlined in section 4.5.2, while section 4.5.3 outlines a Hindley–Milner style type system for the STG' language. Finally, an algorithm for generating the free-variable annotations of a binding (`var = lambda_form`) is presented in section 4.5.4.

4.5.1 The STG language and the STG machine

To simplify the design of the STG machine, and thereby improve its efficiency, Peyton Jones [1992, section 4] and Peyton Jones and Launchbury [1991, section 7.1] explicitly placed the following (informal) restrictions on STG language programs:

1. global (top-level) bindings, and `let` and `letrec` expressions cannot bind a variable of unboxed type.
2. all constructors and primitives are saturated (have the correct number of arguments).
3. Polymorphic functions cannot manipulate unboxed values.

Also, Beemster [1994] has shown that the STG machine cannot force the evaluation of a partial application due to its aggressive take (the method by which a function's arguments are fetched, as embodied by rules 17 and 17a in section 5.6 of the STG report). Essentially, the following expression will not terminate under the STG machine:

```
  let \x y -> ...
  in case f x y of
  f' -> ...
```  

This leads to the following additional limitation:

4. **case** expressions can only scrutinise values whose type is either unboxed or algebraic. (section 6.3.1 shows how this restriction may be removed.) Finally, three additional requirements are needed:

5. the top-level variable `main` is defined, and is bound to an expression of type `Dialogue`.

6. the operational decorations (i.e. update flags and free-variable information) are correct.

7. all of the patterns from a **case** expression's alternatives are unique i.e. only one alternative will ever be applicable for any given result.

Looking at the first five rules, it is clear that detailed type information is required if the validity of a program is to be verified. There are two possible sources for this data: *type-inference*, the attributes are automatically derived using a system of type rules; and *type annotations*, the abstract syntax is extended to include type information, thereby delegating responsibility for type inference to the previous stage (conversion from the source language to the intermediate form). The latter is the approach adopted by the Glasgow Haskell compiler, although the type information is recorded in a database similar to that described in section 4.6.

Whichever approach is taken, some form of type inference will be required. For the sake of generality, it was decided to frame this problem in the context of the STG' language. The traditional approach to typing in a functional programming language is to use a Hindley–Milner style algorithm [Milner, 1978; Damas and Milner, 1982] – both ML and Haskell employ this technique and this is also the solution adopted for the STG' language.

With regards to the sixth requirement, section 4.5.4 presents an algorithm for checking or generating the free-variable information, while section 4.2 of the STG report discusses the problem of setting the update flag.

### 4.5.2 The advantages of static typing

A language is said to be *statically typed* [Schmidt, 1994, page 6] if a type inference algorithm exists which can calculate the type attributes of a program without evaluating the program. As the algorithm presented in section 4.5.3 is purely syntax driven, the STG' language is statically typed. The principle benefits of static typing are improved debugging, and an increase in the number of optimisations that can be performed by a
compilation system [Shao and Appel, 1995; Hall, 1994; Gill and Peyton Jones, 1994]. It is
the latter property that is of primary importance in the context of the prototyping system.
The primary drawback of adopting a static system is that it becomes difficult, if not im­
possible, to use the intermediate form as a target for dynamically-typed source languages.
The majority of modern functional programming languages are statically typed [Hudak
et al., 1992; Harper et al., 1989], so this becomes an acceptable limitation. Indeed, Cardelli
and Wegner [1985, page 474] suggest that

“In general, we should strive for strong typing and adopt static typing whenever
possible.”

4.5.3 Hindley–Milner type inference for the STG’ language

The type-inference system presented in this section is based on the work of Peyton Jones
and Wadler [1992], which, in turn, is based on the Hindley–Milner algorithm [Milner, 1978;
Damas and Milner, 1982]. Both ML and Haskell use variants of this algorithm. For both
an overview of this technique and a discussion of alternative approaches [Reynolds, 1985;
Schmidt, 1994; Cardelli and Wegner, 1985] are highly recommended.

Note that no attempt is made to relate the type algorithm to any of the other semantic
descriptions, neither is it proved that the algorithm assigns the most general type to an
expression.

Limitations of the inference rules

As a side effect of using the algorithm outlined in this section, the following additional
language restrictions are imposed:

8. a variable bound by a letrec expression must have the same type for all occurrences
   in the right-hand sides of the bindings.

9. lambda-bound and pattern-defined variables must take the same type for all occur­
   rences in the body of the function or algebraic alternative.

Such limitations are common, as typified by Haskell’s monomorphism restriction [Hudak
et al., 1992, pages 40–41]. A number of attempts to remove these restrictions [Kfoury,
Tiuryn and Urzyczyn, 1993; Henglein, 1993] have met with limited success, as the problem
is, in general, undecidable.

Terminology

The notation adopted here is based on that used by Peyton Jones and Wadler [1992] and
is only briefly introduced here, as a full account is given in appendix D.

The abstract syntax of types, part of which was included in figure 4.1, is shown in
figure 4.2. Following the usual conventions, function types of the form \( \tau_1 \rightarrow (\tau_2 \rightarrow \cdots \rightarrow \tau_n) \cdots \) will be written as \( \tau_1 \rightarrow \tau_2 \rightarrow \cdots \rightarrow \tau_n \), and brackets will only be used if one of the
argument types is itself a function type.

An environment is a finite mapping, usually from identifiers to types, either explicitly
constructed, e.g. \( \{ var_1 \mapsto \tau_1, \ldots, var_n \mapsto \tau_n \} \), or created by combining two existing
environments. Two forms of merge operations are used: \( env_1 \oplus env_2 \), which is only defined if
the domains are distinct; and \( env_1 \oplus env_2 \), where an identifier will take its value from the
second environment if it is defined by both. An identifier’s value is retrieved by treating
the mapping as a set of tuples and testing for membership i.e. \((id, value) \in env\).
| Polytype       | $\sigma \rightarrow \forall \alpha_1 \ldots \alpha_n. \tau$ | type signature |
|               | $\tau$                             | simple type    |
| Monotype      | $\tau \rightarrow \pi$             | boxed type     |
|               | $\nu$                             | unboxed type   |
| Boxed type    | $\pi \rightarrow \alpha$           | type variable  |
|               | $\tau_1 \rightarrow \tau_2$        | function type  |
|               | $\chi \pi_1 \ldots \pi_n$          | parameterised data type |
| Unboxed type  | $\nu \rightarrow \text{Int#}$       | integer        |
|               | $\text{Float#}$                     | floating-point number |
|               | $\text{Char#}$                      | character      |

Figure 4.2: Abstract syntax of types

<table>
<thead>
<tr>
<th>Environment</th>
<th>Notation</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>constructor environment</td>
<td>$CE$</td>
<td>$\text{cons} \mapsto (n, \sigma)$</td>
</tr>
<tr>
<td>primitive environment</td>
<td>$PE$</td>
<td>$\text{primitive} \mapsto (n, \sigma)$</td>
</tr>
<tr>
<td>general variable environment</td>
<td>$GVE$</td>
<td>$\text{var} \mapsto \sigma$</td>
</tr>
<tr>
<td>local variable environment</td>
<td>$LVE$</td>
<td>$\text{var} \mapsto \tau$</td>
</tr>
<tr>
<td>type-constructor environment</td>
<td>$TCE$</td>
<td>$\chi \mapsto (n_{\alpha}, n_{\text{cons}}, (\text{cons}<em>1, \ldots, \text{cons}</em>{n_{\alpha}}))$</td>
</tr>
<tr>
<td>total environment</td>
<td>$TE$</td>
<td>$(CE, PE, GVE, LVE)$</td>
</tr>
</tbody>
</table>

Table 4.2: Summary of the environments used during type inference

The environments used by the type rules are summarised in table 4.2, where: $n$ is either the arity of a function or a constructor; $n_{\alpha}$ is the number of type variables needed to saturate an algebraic type; $n_{\text{cons}}$ the number of constructors; and $(\text{cons}_1, \ldots, \text{cons}_{n_{\alpha}})$ the constructors themselves.

Algorithm overview

As all of the rules are included in appendix D, only a brief overview of the algorithm is given here.

initial rule the PROGRAM rule serves as the starting point of the algorithm, a simplified version of which is shown in figure 4.3. Notice how the primitive environment, $PE$, is passed as an argument to the inference algorithm, with figure 4.4 providing some example entries. The algorithm proceeds as follows:

1. generate the constructor environment, $CE$.
2. initialise the total environment, $TE$.
3. infer the type of the top-level definitions, treating the entire program as one large letrec expression.
4. ensure that the fourth restriction presented in section 4.5.1 is met.
constructor-environment generation apart from checking that the definitions are well formed, the main purpose of the type-declaration rules is to produce the constructor environment. This is primarily achieved by the CONDECL rule, shown in figure 4.5, which:

1. verifies that each of the constructor's argument types are well formed.
2. using the function type as a convenient representation, generates a polytype description of the constructor.
3. generates a environment whose sole entry associates the constructor with the description from step 2.

bindings top-level definitions and let(rec) expressions are the only way to introduce variables with polymorphic type signatures, as illustrated by the BINDS rule shown in figure 4.6. In general, the type of the right-hand side is first inferred (step 1) and then generalised (step 2), and the resulting signature added to the general variable environment, $GVE$ (step 3).

expressions this group of rules forms the heart of the algorithm, with each rule deriving the monotype associated with one of the expression constructs. As an example, the CONS-EXP rule, shown in figure 4.7, proceeds as follows:
1. lookup the constructor’s type signature and arity, \( n \), in the constructor environment, \( CE \).
2. create a fresh instance of the polytype.
3. match the inferred types of the arguments with the monotype from step 2. Also the number of arguments has to match the arity, \( n \), from step 1, so satisfying the second restriction presented in section 4.5.1.

**generalisation and specialisation** as the previous examples illustrate, the \( GEN \) and \( SPEC \) rules are used to convert monotypes to polytypes and vice versa. The *generic instance* of a monotype \( \tau \) is \( \forall \alpha_1 \ldots \alpha_n.\tau \), where each \( \alpha_i \) is a free type variable of \( \tau \), which is also free in the current environment. Similarly, an *instance* of a type signature is simply the right-hand side monotype with all occurrences of the type variables replaced with fresh ones.

**unification** Robinson’s unification algorithm [Robinson, 1965] is used to determine if two monotypes are compatible (see step 3 of the \( CONS-EXP \) type rule for an example of where unification is used.) If unification succeeds, the algorithm returns the most general type that matches both of the arguments, as well as a set of substitutions. These substitutions represent the restrictions (on free type variables) that have had to be made in order to resolve the two types, and they must be applied to the current environment to ensure consistency. For this application, unification fails if a substitution of the form \( \alpha \mapsto \ldots \alpha \ldots \) would be required to unify the two types. This is commonly referred to as the *occurs check*, which prevents the introduction of infinite types.

**Animating the algorithm**

As the development of Hindley–Milner style algorithms using functional programming languages is well documented in the literature [Hancock, 1987; Peyton Jones and Lester,
Making use of the syntax-driven nature of the rules, the first step of the process is to construct type signatures for each of the rule groups. There are three general forms that the signatures can take, with examples of each being shown below:

<table>
<thead>
<tr>
<th align="left">Haskell</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">atomInferType :: Atom -&gt; TypeState -&gt; MonoType</td>
</tr>
<tr>
<td align="left">expressionInferType :: Expression -&gt; TypeState -&gt; (MonoType, TypeState)</td>
</tr>
<tr>
<td align="left">bindingsInferType :: Bindings -&gt; TypeState -&gt; (GeneralVariableEnv, TypeState)</td>
</tr>
</tbody>
</table>

where TypeState is the Haskell representation of the total environment, TE, with the addition of some miscellaneous extras, such as a unique name supply for specialising polymorphic types. Similarly, GeneralVariableEnv corresponds to the general variable environment, GE, and MonoType to the abstract syntax of monotypes (see figure 4.2). The definition of these types is not discussed here, but the general technique for doing so is illustrated in section 4.8.10.

Each function is made up of a series of pattern matched definitions, with one branch for every constructor associated with the primary data type:

<table>
<thead>
<tr>
<th align="left">Haskell</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">expressionInferType (Let exp_id binds exp) type_state = ...</td>
</tr>
<tr>
<td align="left">expressionInferType (Value exp_id literal) type_state = ...</td>
</tr>
</tbody>
</table>

The body of the definition will depend upon the rule it implements, and, as an example, the Haskell implementation of the CONS-EXP rule (see figure 4.7) is given below:

<table>
<thead>
<tr>
<th align="left">Haskell</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">expressionInferType (Cons exp_id cons atoms) type_state</td>
</tr>
<tr>
<td align="left">| (envIsDefined cons constructor_env) &amp;&amp; (cons.arity == length atoms)</td>
</tr>
<tr>
<td align="left">= (substApply subst result_type, substApplyToEnv type_state2)</td>
</tr>
<tr>
<td align="left">where</td>
</tr>
<tr>
<td align="left">constructor_env = typestateGetConsEnv type_state</td>
</tr>
<tr>
<td align="left">(arity, polytype) = envGet cons constructor_env</td>
</tr>
<tr>
<td align="left">(monotype, type_state1) = polytypeSpecialise polytype type_state</td>
</tr>
<tr>
<td align="left">(arg_types, result_type) = monotypeSplitFun monotype</td>
</tr>
<tr>
<td align="left">(atom_types, type_state2) = atomsInferType atoms type_state1</td>
</tr>
<tr>
<td align="left">OK subst = monotypesUnify arg_types atom_types</td>
</tr>
</tbody>
</table>

The order of the definitions closely follows that of the original rule: envGet is a library function and its use corresponds to the first step of the rule i.e. (cons, (n, σ)) ∈ CE; while polytypeSpecialise and atomsInferType are themselves type rules, and complete the second and third steps respectively. Of the remaining expressions, only monotypeUnify is of significance, ensuring, as it does, that the types of the atoms match those specified by the constructor’s declaration.

The guard expression checks that the constructor is defined within the constructor environment, CE, and that it is also fully saturated (see restriction 2 in section 4.5.1). These are both implicit conditions of the CONS-EXP rule.

4.5.4 Free variables

From an operational perspective, free-variable information is essential whenever either a closure or a continuation has to be created – it identifies which variables are live and need
to be saved so that the computation can be re-started (either when the closure's value is demanded, or when the continuation is returned to). The STG' language's lambda-form annotation cover the first usage, but the second is unsupported. Moreover, incorrect annotations will lead to problems - therefore, an algorithm is needed which can both generate the missing data and check the decorations.

By nature, free-variable algorithms are simple [Peyton Jones, 1987, page 14], and the only complication to developing an algorithm for the STG' language is the restriction imposed by Peyton Jones [1992, section 4.1.2]: free variables should not include any variables bound at the top level of the program. The solution is to pass the top-level variables as an argument to all of the algorithm's rules, as illustrated by the $\mathcal{FV}_{\text{program}}$ rule:

$$
\mathcal{FV}_{\text{program}} \left[ \begin{array}{l}
\var_1 = \text{lambd}_1 \\
\vdots \\
\var_n = \text{lambd}_n
\end{array} \right] = \begin{cases}
\{\} & \text{(definition)} \\
\bigcup_{i \leq n} \mathcal{FV}_{\text{lambd}} \left[ \text{lambd}_i \right] \{\var_1, \ldots, \var_n\} & \text{(derived)}
\end{cases}
$$

To ensure that the language's lexical scoping rules are followed, the set of global variables, $g$, has to be trimmed whenever a new variable is defined, as done by the rule handling algebraic alternatives:

$$
\mathcal{FV}_{\text{alt}}[\text{cons } \var_1 \ldots \var_n \to \text{exp}] \ g = \mathcal{FV}_{\text{exp}}[\text{exp}] \ g' \setminus \text{vars\_bound}
$$

where $g' = g \setminus \text{vars\_bound}$
and $\text{vars\_bound} = \{\var_1, \ldots, \var_n\}$

The animation process is straightforward, as shown by the Haskell implementation of the previous rule:

```haskell
altFreeVars :: AlgebraicAlt -> Variables -> Variables
altFreeVars (AlgebraicAlt cons vars_bound exp) globals
  = expressionFreeVars exp globals' \ vars_bound
  where globals' = globals \ vars_bound
```

The complete set of rules is presented in appendix E.

4.6 Annotations are not enough?

The previous section presented two algorithms, both of which generate information that may be of use to a compiler. In this section, the problem of encoding this data is addressed, with there being two obvious solutions:

**extend the abstract syntax** this is the approach used to record the free variables and update flag of a lambda-form. When considering the amount of generated data, it becomes clear that this method is not a general solution, as the language constructs would quickly be obscured by operational annotations. Furthermore, each algorithm would have to return a modified version of the original program, complicating all aspects of the system.

**use an attribute database** a database makes it possible to unobtrusively record the required information, such that the addition of new algorithms will not entail the modification of existing routines. The main problem, apart from the introduction of hidden state, is that some form of key is required to access the data.
Throughout this report, the existence of a program-specific database is assumed. Two main types of key are used: identifiers, such as variable and constructor names, and unique labels (typically integers) attached to language constructs – the \textit{ExpressionId} field presented in section 4.3.5 is an example of the latter type of key.

When accessing the database, the algorithm which generated the required information should be clearly identified. So for example, the free variables of an expression would be referred to as $\mathcal{F}\mathcal{V}[\text{exp}]$, and the type of a variable as $\vdash \text{var} : \tau$. Notice the omission of both the rule name and the formal arguments in each reference. The actual mechanism used to store and retrieve the information will not be considered unless it impacts upon the topic under discussion.

4.7 Denotational semantics

The denotational semantics presented in this section is essentially the non-strict model described by Peyton Jones and Launchbury [1991, section 3.2], with only a few minor modifications. The reader interested in further information on the motivation and theoretical underpinnings of denotational semantics is referred to [Schmidt, 1986], while Stoy's seminal work [Stoy, 1977] and Tennent's short introduction [Tennent, 1976] are also both highly recommended.

4.7.1 Domain equations

The domains used by the valuation functions are defined using the following recursive equations:

\begin{align*}
I\# & = \text{The set of fixed-precision integers} \\
F\# & = \text{The set of fixed-precision floating-point numbers} \\
Id & = \text{The set of all identifiers} \\
Cons & = \text{Val}^* \\
Fun & = \text{Val} \rightarrow \text{Val} \\
Val & = I\# \cup \cdots \cup F\# \cup Id \cup (Cons + Fun)_\bot \\
Env & = Id \rightarrow \text{Val}
\end{align*}

4.7.2 The meta-language

The notation used here follows the standard conventions [Schmidt, 1986, pages 52–53] and is summarised in table 4.3. Note that even though the meta-language bears a strong resemblance to the lambda calculus, it should not be confused with it.

4.7.3 Valuation functions

Figure 4.8 shows the valuation functions for well-formed programs and bindings, figure 4.9 deals with expressions, default alternatives and atoms, and figure 4.10 handles case alternatives. To improve readability, the injection, projection and domain membership functions have been omitted.

The conversion of literals to the corresponding domain values by the $\mathcal{L}\llbracket \rrbracket$ function, is
<table>
<thead>
<tr>
<th>operation</th>
<th>result’s domain</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\lambda x.e$</td>
<td>$A \to B$</td>
<td>function construction, such that for all $a \in A$, $[a/x]e$ has a unique value in $B$</td>
</tr>
<tr>
<td>$(e_1 e_2)$</td>
<td>$B$</td>
<td>function application such that $e_1 \in A \to B$, and $e_2 \in A$</td>
</tr>
<tr>
<td>let $x = e_1$ in $e_2$</td>
<td>Val</td>
<td>local definition</td>
</tr>
<tr>
<td>case $e$ of</td>
<td>$A$</td>
<td>conditional selection, such that for $1 \leq i \leq n$, $e_i \in A$</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>$(e_1, \ldots, e_n)$</td>
<td>Val$^*$</td>
<td>short form of ${1 \mapsto e_1, \ldots, n \mapsto e_n}$</td>
</tr>
<tr>
<td>fix($\lambda x.e$)</td>
<td>$A$</td>
<td>the fixed-point operator, such that $\lambda x.e \in A \to A$</td>
</tr>
</tbody>
</table>

Table 4.3: The meta-language of the denotational semantics

<table>
<thead>
<tr>
<th>$\text{Program } [\text{program}]$</th>
<th>Val</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\text{Program } [\text{typedecls bindings}]$</td>
<td>$\mathcal{E} [\text{letrec bindings main}] \emptyset$</td>
</tr>
<tr>
<td>$\text{Binds } [\text{binds}]$</td>
<td>Env $\to$ Env</td>
</tr>
<tr>
<td>$\text{Binds } [\text{bind}_1 \ldots \text{bind}_n] \rho$</td>
<td>$\bigoplus_{i \leq n} \text{Bind } [\text{bind}_i] \rho$</td>
</tr>
<tr>
<td>$\text{Bind } [\text{bind}]$</td>
<td>Env $\to$ Env</td>
</tr>
<tr>
<td>$\text{Bind } [\text{var } = \lambda\text{form}] \rho$</td>
<td>${\text{var } \to \mathcal{L} \mathcal{F } [\lambda\text{form}] \rho}$</td>
</tr>
<tr>
<td>$\mathcal{L} \mathcal{F } [\lambda\text{form}]$</td>
<td>Env $\to$ Val</td>
</tr>
<tr>
<td>$\mathcal{L} \mathcal{F } [\text{vars}_{\text{free } \pi} \text{var}_1 \ldots \text{var}_n \to \text{exp}] \rho$</td>
<td>$\lambda \epsilon_1 \ldots \epsilon_n. (\mathcal{E}[\text{exp}] (\rho \uplus {\text{var}_1 \mapsto \epsilon_1, \ldots, \text{var}_n \mapsto \epsilon_n}))$</td>
</tr>
</tbody>
</table>

Figure 4.8: Denotational semantics of STG’ programs and bindings
Figure 4.9: Denotational semantics of STG’ expressions, defaults and atoms
Figure 4.10: Denotational semantics of STG’s case alternatives

illustrated by the following example:

Primitive functions are the equivalent of lambda-calculus δ-rules [Barendregt, 1981], and, as such, care should be taken with their definition. The following rule serves as an example of the $E[primitive\ atoms]$ set of rules:

<table>
<thead>
<tr>
<th></th>
<th>Val</th>
</tr>
</thead>
<tbody>
<tr>
<td>$L[literal]$</td>
<td>[ Val</td>
</tr>
<tr>
<td>$L[1]$</td>
<td>$\mapsto 1$</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
</tbody>
</table>

4.7.4 Programming denotational semantics

There have been two main approaches taken to animating denotational semantics: the first prescribes hand-coding the rules directly into a general-purpose programming language, such as ML [Jouvelot, 1986] or even Pascal [Allison, 1983]; while the second advocates the use of a meta-language, as typified by Navel [Michaelson, 1993] or Wand’s prototyping system [Wand, 1984]. In keeping with the rest of this thesis, the former approach is advocated here, with the work of Jouvelot [1986] serving as a useful template. So, for example, an element from the Val domain is defined as:

$$\text{Val} = \text{Env} \rightarrow \text{Val} \rightarrow \text{Val} \rightarrow \text{Val}$$
and the `plusInt#` primitive becomes:

```
expressionDenotes (Primitive exp_id "plusInt#" (Atoms [atom1, atom2])) rho
  = let IHashElement e1 = atomDenotes atom1 rho
     IHashElement e2 = atomDenotes atom2 rho in IHashElement (e1 + e2)
```

Notice that Haskell's strong typing automatically detects missing injection and projection operations.

This approach has also been used to animate a continuation-passing semantics for APOSTLE, an object-oriented language for parallel and distributed discrete-event simulation [Wonnacott and Bruce, 1996]. This work is described in [Booth, Bruce and Ben-Dyke, 1996, section 3 and 4.2, and appendices A and B].

### 4.8 Graph reduction and the sequential STG machine

The STG machine [Peyton Jones and Salkild, 1989; Peyton Jones, 1992] is just one of a large number of abstract machines for performing graph reduction [Wadsworth, 1971, chapter 4] (arguably the most efficient approach to evaluating non-strict functional languages). Its selection over the other candidates, however, can be justified by the models comparative efficiency and wide-spread usage.

This section provides an overview of the sequential STG machine, starting with a more detailed examination of the merits of this system in section 4.8.1. Section 4.8.2 then presents the notation used throughout the remainder of this chapter, while sections 4.8.3 through 4.8.9 look at the state-transition model of the abstract machine.

#### 4.8.1 Why use the STG machine?

Considering the large number of viable alternatives, including the ABC machine [Plasmeijer and van Eckelen, 1993a], TIM [Fairbairn and Wray, 1981; Chitnis, Satpathy and Oberoi, 1995], or a G-machine derivate, such as the \(v, G\)-machine [Johnsson, 1991] or the GAML system [Maranget, 1991], what are the reasons for selecting the STG machine?

1. the STG' language is based on the STG language, which serves as the abstract machine code for the STG machine (see chapter 4).

2. the efficiency of the STG machine has been demonstrated by GHC, the Glasgow Haskell compiler, which ranked number one in a recent benchmark study [Hartel, 1994]. The relationship between the specification and its implementation is explored in section 4.8.10.

3. a number of important optimisations can be realised as simple source-to-source transformations [Howe and Burn, 1994; Peyton Jones and Launchbury, 1991, section 5.1], thereby avoiding the need to provide special machine support.
4. the self-updating model of thunks [Peyton Jones, 1992, section 3.1.2] and the uniform representation of closures [Peyton Jones, 1992, section 3.1.3] (which gives rise to the tagless nature of the model) allows for the seamless integration of threads, remote references, and skeletons into the basic model. For example, Mattson Jr. [1993a, figures 4.2 and 4.3, pages 78 and 79] uses black and grey holes to provide automatic thread-level synchronisation.

5. the STG machine has served as the core technology for a number of parallel implementations. These cover a range of platforms, including message-passing [Hwang and Rushall, 1992], shared-memory [Mattson Jr., 1993a], vector [Hill, 1994], and hybrid [Chakravarty, 1994] architectures.


Note that some of the arguments presented here echo those from section 4.2. With regards to the exact differences between the various models of graph reduction, the taxonomy proposed by Douence and Fradet [1995] is highly recommended.

4.8.2 Terminology

Peyton Jones [1992, section 5] specifies the STG machine in terms of a state-transition system. While the presentation does bear a resemblance to Plotkin's structured operational semantics [Hennessy, 1990], the exact relationship is not clear. In fact, Hill [1994, chapter 6, page 94] questions the theoretical foundations of the work, saying:

"The operational semantics given here is a minor abstraction of the assembly code tinkering that was required to implement DP Haskell. However it does provide a clean definition of the implementation of the language"

Rather than attempting to develop a formal description of the STG machine, the original semantics is adopted here, complete with the aforementioned limitations. Furthermore, relating the operational model with the denotational semantics presented in section 4.7 is outside the scope of this thesis.

A state-transition system comprises: a definition of the state in terms of its components, an initial state, a set of state-transition rules, and a set of final states. Each of these items is discussed in the following sections (the presentation is biased towards the modelling of abstract machines for language interpreters.) Section 4.8.10 describes the Haskell animation of such systems.

State as a tuple

The state is used to represent all elements of the system to be modelled, so, for example, a microprocessor's state would include the register file, main memory, and instruction pipeline (see section 7.3). As the component make-up is likely to remain constant with respect to time, it is sensible to represent the state as a tuple of values, \((\text{code}, \text{component}_1, \ldots, \text{component}_n)\), although it is often convenient to omit the brackets and commas, i.e. \(\text{code component}_1 \cdots \text{component}_n\). The ordering of the fields is not significant, but is invariable. The \text{code} component is the primary driving force behind the evaluation process and serves a role similar to that of a microprocessor's instruction stream.
## Components

The state-transition system is based upon the matching and manipulation of components. Typically, a component will either be: a standard mathematical entity, such as a set, sequence, tuple, or variable; an abstract type, including environments, stacks and heaps; or a sum-of-products type, akin to Haskell's algebraic data types. In fact, the notation used is similar to that of Haskell, as illustrated by table 4.4. For an overview of the semantics of pattern matching, [Peyton Jones and Wadler, 1987] is recommended. The code component is typically an algebraic type, with each constructor representing a different mode of operation.
**The initial state**

The initial state is used to bootstrap the abstract machine. This is the only point at which external values can be referenced as there is no support for input or output; for example, the program to be evaluated may be incorporated into the state without specifying its exact origin.

**State-transition rules**

The simplest form a state-transition rule can take is: \( \text{pattern}_{\text{source}} \implies \text{pattern}_{\text{target}} \). If a state matches a rule’s source pattern, then a transition occurs and a new state is constructed as prescribed by the rule’s target pattern. Rules can also include explicit guard conditions and auxiliary definitions:

\[
\begin{align*}
\text{pattern}_{\text{code}}, & \text{ pattern}^1_1, \ldots, \text{ pattern}^n_n \\
\text{such that } \text{condition}^1_1 & \cdots \text{condition}^m_m \\
\implies & \text{(code}', \text{ component}^1_1', \ldots, \text{ component}^n_n') \\
\text{where } & \text{definition}^1_1 \cdots \text{definition}^n_n
\end{align*}
\]

All of the implicit and explicit conditions (patterns and guards respectively) have to hold for the rule to match a given state.

Peyton Jones [1992, section 5, page 33] restricts the rule set by require that any given state matches, at most, one transition rule. If the definitions, conditions, and component specifications are purely functional in nature, then the resulting system is obviously deterministic. However, by relaxing this restriction, a number of important behaviours can be specified (see sections 9.3.2 and 6.2.2).

**The final state**

Starting with the initial state, the state transitions will continue until one of the following situations arise: the current state does not match any of the rules, suggesting either an error, or omission, in the rule set or initial state; or the state matches a final-state pattern, indicating successful completion of the evaluation. The specification of a valid final state is similar to that of a transition rule, but without the target state. It is possible that the system never terminates.

**4.8.3 The abstract state of the STG-machine**

The STG machine uses the following state to model graph reduction:

\[
\begin{align*}
\text{(code, argument stack, return stack, update stack, heap, global env)} & \equiv (\text{code}, \text{as}, \text{rs}, \text{us}, \text{h}, \text{sigma})
\end{align*}
\]

Following the presentation of Peyton Jones [1992, section 5], the individual components of the state are specified in table 4.5. The relationship between the code field and the state-transition rules is illustrated in figure 4.11. Chapter 8 deals with the implementation of both the state components and the transition rules.
Table 4.5: The state components of the STG machine

<table>
<thead>
<tr>
<th>specification</th>
<th>description</th>
<th>rules</th>
</tr>
</thead>
<tbody>
<tr>
<td>code</td>
<td>$\text{Eval exp } \rho$ test evaluate $\text{exp}$ in environment $\rho$</td>
<td>all but 16–17A</td>
</tr>
<tr>
<td></td>
<td>$\text{Enter a}$ closure application</td>
<td>1, 2, 15, 17, 17A</td>
</tr>
<tr>
<td></td>
<td>$\text{Return, result}$ return value of type $\tau$ to continuation</td>
<td>5–14 and 16</td>
</tr>
<tr>
<td>argument stack</td>
<td>$\text{stack of values}$ structure for passing parameters</td>
<td>1, 2, 15, 16–17A</td>
</tr>
<tr>
<td>return stack</td>
<td>$\text{stack of continuations}$ structure for storing control information</td>
<td>4–4B, 6–8', 12', 13</td>
</tr>
<tr>
<td>update stack</td>
<td>$\text{stack of update frames}$ update mechanism</td>
<td>15–17A</td>
</tr>
<tr>
<td>heap</td>
<td>$\text{heap of closures}$ boxed-value storage</td>
<td>2, 3, 8', 15–17A</td>
</tr>
<tr>
<td>global environment</td>
<td>$\sigma \text{var} = a$ closure address of the top-level bindings</td>
<td>5</td>
</tr>
<tr>
<td>value</td>
<td>$\text{Addr a}$ closure address</td>
<td>1, 3, 8', 15–17A</td>
</tr>
<tr>
<td></td>
<td>$\text{Int k}$ literal integer</td>
<td>9–14</td>
</tr>
<tr>
<td>continuation</td>
<td>$\text{Case}_{\text{alt}} \rho$ case expression</td>
<td>4, 6, 7, 11, 13</td>
</tr>
<tr>
<td></td>
<td>$\text{Forced, var } \rho$ letstrict or let$#$ expression</td>
<td>4A, 4B, 8', 12'</td>
</tr>
<tr>
<td>update frame</td>
<td>$(\text{as, rs, a})$ update marker</td>
<td>15–17A</td>
</tr>
<tr>
<td>closure</td>
<td>$(\text{lambda}\text{-}\text{form, values})$ boxed representation of values</td>
<td>2, 3, 8', 15–17A</td>
</tr>
</tbody>
</table>

Figure 4.11: The relationship between the STG-machine rules and the code component
\[ Eval \ (\text{letstrict} \ (\text{var} = \text{exp}_{\text{rhs}}) \ \text{exp}_{\text{body}}) \ \rho \ \text{as} \ \text{rs} \ \text{us} \ \text{h} \ \sigma \]
\[ \Rightarrow \quad \text{Eval} \ \text{exp}_{\text{rhs}} \ \rho \ \text{as} \ \text{return} : \text{rs} \ \text{us} \ \text{h} \ \sigma \]

where \[ \text{return} = \text{Forced}_{\chi} \ \pi_1 \ldots \pi_n \ \text{var} \ \text{exp}_{\text{body}} \ \rho' \]
\[ \text{exp}_{\text{rhs}} \text{ is of type } \chi \ \pi_1 \ldots \pi_n \]
\[ \text{dom}(\rho') = \mathcal{F} \mathcal{V} [\text{exp}_{\text{body}}] \]

Figure 4.12: The STG-machine rule for evaluating letstrict expressions

\[ \Rightarrow \quad \text{Eval} \ \text{exp}_{\text{body}} \ \rho' \ \text{as} \ (\text{Forced}_{\chi} \ \pi_1 \ldots \pi_n \ \text{var} \ \text{exp}_{\text{body}} \ \rho) : \text{rs} \ \text{us} \ \text{h} \ \sigma \]

where
\[ \rho' = \rho \oplus \{ \text{var} \mapsto a \} \]
\[ h' = h[a \mapsto (\text{vs} \ \text{r} \ \{\} \rightarrow \text{c} \ \text{vs}, \text{ws})] \]
\[ \text{vs is a sequence of arbitrary distinct variables} \]
\[ \text{length} (\text{vs}) = \text{length} (\text{ws}) \]

Figure 4.13: The STG-machine rule for returning to a letstrict continuation

4.8.4 The STG' language and the STG machine

In order to adapt the STG machine to work with the STG' language, two new rules were introduced (rules 4A and 4B) and two existing rules were altered (rules 8 and 12). These rules are shown in figures 4.12 through 4.14, with the exception of rule 4B, which is a slight variation of rule 4A (dealing with the let# expression instead of the letstrict expression). The global environment has also been extended to allow access to the program's attribute database.

4.8.5 The initial state

The initial state [Peyton Jones, 1992, section 5.1] takes as its only parameters an STG' program and its attribute database (see section 4.6). The state is constructed so that the code is set to evaluate the variable main, all stacks are empty, the heap contains closure's representing all of the program's top-level bindings, and the global environment contains the addresses of each of these closures. For example, the program shown in figure 4.15 (see section B.2 for the definition of fib.wrk) would result in the creation of the following initial state:

\[ \Rightarrow \quad \text{Eval} \ \text{exp}_{\text{body}} \ \rho' \ \text{as} \ (\text{Forced}_{\text{Int}#} \ \text{var} \ \text{exp}_{\text{body}} \ \rho) : \text{rs} \ \text{us} \ \text{h} \ \sigma \]

where \[ \rho' = \rho \oplus \{ \text{var} \mapsto k \} \]

Figure 4.14: The STG-machine rule for returning to a let# continuation
4.8.6 Variable application, closures, and entry methods

A closure typically represents a variable of boxed type, $\pi$, and to access its value it is necessary to invoke the closure's entry method. To illustrate this, the first few transitions of the initial state presented in the previous section are as follows (ignoring the fact that main’s closure is updatable):

\[
\begin{align*}
\text{(rule 1)} \quad \Rightarrow & \quad \text{Eval main} \{\} \text{env} \quad \langle \rangle \quad \langle \rangle \quad \langle \rangle \quad h_{\text{init}} \quad \sigma \\
\text{(rule 2)} \quad \Rightarrow & \quad \text{Enter a}_2 \quad \langle \rangle \quad \langle \rangle \quad \langle \rangle \quad h_{\text{init}} \quad \sigma \\
\text{(rule 3)} \quad \Rightarrow & \quad \text{Eval} \left( \text{let} \quad z = u \rightarrow \text{fib.wrk} 20 \right) \{\} \text{env} \quad \langle \rangle \quad \langle \rangle \quad h_{\text{init}} \quad \sigma \\
\text{where} \quad h_1 = h_{\text{init}}[a_3 \mapsto (u \rightarrow \text{fib.wrk} 20, \langle \rangle)] \\
\text{(rule 1)} \quad \Rightarrow & \quad \text{Enter a}_1 \quad \text{as}_1 \quad \langle \rangle \quad h_1 \quad \sigma \\
\text{where} \quad \text{as}_1 = (a_3, a_3) \\
\text{(rule 2)} \quad \Rightarrow & \quad \text{Eval} \left( \text{case } x \ xats_1 \right) \{x \mapsto a_3, y \mapsto a_3\} \text{env} \quad \langle \rangle \quad \langle \rangle \quad h_1 \quad \sigma \\
\text{where} \quad xats_1 = \text{Int } x' \rightarrow \text{case...}
\end{align*}
\]

Notice how the second application of rule 1 results in const.Int.*’s arguments being pushed onto the stack, which are then removed and bound upon entry to the function’s closure (see the local environment of the last state).

Even though the operational description only defines one type of closure and one standard entry method, a complete implementation would support a richer mixture (see, for example, sections 6.4.3 and 6.2).

4.8.7 Returning values

With the exception of the variable main, the evaluation of an expression is always initiated by either a case, let#, or letstrict expression. Before the new evaluation begins, each of their associated rules pushes a continuation onto the return stack. This is then removed and invoked when the new expression’s head-normal form is reached, thereby returning
control to the original expression. To illustrate this process, the example from the previous section is continued below:

\[
\begin{align*}
\text{Eval} \quad \text{(case } x \text{ alt}_1 \{ x \mapsto a_3, y \mapsto a_3 \}\text{)}_{env} & \quad \langle \rangle \quad \langle \rangle \quad h_1 \quad \sigma \\
\text{where} \quad alt_1 = \text{Int } x' \mapsto \text{case...} \\
\text{(rule 4)} & \quad \Rightarrow \quad \text{Eval } \{ x \mapsto a_3 \}_{env} & \quad \langle \rangle \quad rs_1 \quad h_1 \quad \sigma \\
\text{where} \quad rs_1 = (\text{Case}_x \text{Int } alt_1 \{ y \mapsto a_3 \}_{env}) \\
\text{(rule 1)} & \quad \Rightarrow \quad \text{Enter } a_3 & \quad \langle \rangle \quad rs_1 \quad h_1 \quad \sigma \\
\text{⇒*} & \quad \text{Return}_{\text{Int}} \text{ Int } 21891 & \quad \langle \rangle \quad rs_1 \quad h_2 \quad \sigma \\
\text{(rule 6)} & \quad \Rightarrow \quad \text{Eval } (\text{case } y \text{ alt}_2 \{ x' \mapsto 21891, y \mapsto a_3 \}_{env}) & \quad \langle \rangle \quad \langle \rangle \quad h_2 \quad \sigma \\
\text{where} \quad alt_2 = \text{Int } y' \mapsto \text{let#...} \\
\text{(rule 4)} & \quad \Rightarrow \quad \text{Eval } \{ y \mapsto a_3 \}_{env} & \quad \langle \rangle \quad rs_2 \quad h_2 \quad \sigma \\
\text{where} \quad rs_2 = (\text{Case}_y \text{Int } alt_2 \{ x' \mapsto 21891 \}_{env}) \\
\text{⇒*} & \quad \text{Return}_{\text{Int}} \text{ Int } 21891 & \quad \langle \rangle \quad rs_2 \quad h_2 \quad \sigma \\
\text{(rule 6)} & \quad \Rightarrow \quad \text{Eval } (\text{let# } xy = \text{timesInt# } x \ y) & \quad \langle \rangle \quad \langle \rangle \quad h_2 \quad \sigma \\
\text{Int } x \ t & \quad \text{where} \quad \rho_1 = \{ x' \mapsto 21891, y \mapsto 21891 \}_{env} \\
\text{(rule 4b)} & \quad \Rightarrow \quad \text{Eval } (\text{timesInt# } x \ y) \ \rho_1 & \quad \langle \rangle \quad rs_3 \quad h_2 \quad \sigma \\
\text{where} \quad rs_3 = (\text{Forced}_x \text{Int# } xy (\text{Int } xy) \{\})_{env} \\
\text{(rule 14)} & \quad \Rightarrow \quad \text{Return}_{\text{Int#}} \ 479215881 & \quad \langle \rangle \quad rs_3 \quad h_2 \quad \sigma \\
\text{(rule 12')} & \quad \Rightarrow \quad \text{Eval } (\text{Int } xy) \{ xy \mapsto 479215881 \}_{env} & \quad \langle \rangle \quad \langle \rangle \quad h_2 \quad \sigma \\
\text{(rule 5)} & \quad \Rightarrow \quad \text{Return}_{\text{Int}} \ 479215881 & \quad \langle \rangle \quad \langle \rangle \quad h_2 \quad \sigma
\end{align*}
\]

Notice how the local and update-frame environments are constantly trimmed to remove redundant entries (see section 6.3.3).

4.8.8 The update mechanism

The update mechanism maintains the laziness of the STG machine by ensuring that an expression will be reduced to head-normal form at most once. The update flag of the STG' language indicates which expressions need to be updated, and upon entry to an updatable closure both the argument and return stacks are reset. An update is then triggered whenever there are insufficient values on either stack to satisfy an access. Consider, for example, the evaluation of the variable \( x \) (the details of which were omitted from the previous section’s description):

\[
\begin{align*}
\text{(rule 1)} & \quad \Rightarrow \quad \text{Eval } \{ x \mapsto a_3 \}_{env} & \quad \langle \rangle \quad rs_1 \quad h_1 \quad \sigma \\
\text{(rule 15)} & \quad \Rightarrow \quad \text{Enter } a_3 & \quad \langle \rangle \quad rs_1 \quad h_1 \quad \sigma \\
\text{where} \quad us_1 = (\langle a_3, \rangle, rs_1)) \\
\text{⇒*} & \quad \text{Return}_{\text{Int}} \text{ Int } 21891 & \quad \langle \rangle \quad us_1 \quad h_1 \quad \sigma \\
\text{(rule 16)} & \quad \Rightarrow \quad \text{Return}_{\text{Int}} \text{ Int } 21891 & \quad \langle \rangle \quad rs_1 \quad h_2 \quad \sigma \\
\text{where} \quad h_2 = h_1[a_3 \mapsto (v x \mapsto \text{Int } v, \langle 21891 \rangle)]
\end{align*}
\]

All future entries of the \( a_3 \) closure will return the new value without having to repeat the costly evaluation. With regards to partial applications, an update is triggered when a function needs more arguments than are available on the stack (rule 17 or 17a).

When dealing with closures that are known to reduce to constructor applications (using type information) clearing both the argument and return stacks upon entry is unnecessary, and a shorter update frame could be used. The compilation rules to support this idea have not yet been developed.
4.8.9 The final state

Evaluation is complete whenever the machine is in the Return mode and all three stacks, as, rs, and us, are empty. The last state of the example shown in section 4.8.7 would be the final state of that evaluation.

4.8.10 Animating state-transition systems

The primary motivation behind the animation process is that of debugging the state-transition system. This includes testing both the correctness of the model (see sections 4.7 and 5.4), and, if feasible, its efficiency (see section 6.2.2). The structure adopted here is based on that outlined by Diller [1994a], although Haskell, rather than Miranda [Holyer, 1991], is used as the target language. The steps are as follows:

1. create a type signature for each state component. If the component is not already supported, an abstract type, and associated operations, will have to be developed.

2. define an algebraic type to represent the abstract state(s). Despite the tuple representation used throughout this chapter, Haskell's algebraic types are better suited to the role. For states with a large number of components, access and update routines need to be developed to support the implementation of the transition rules.

3. specify the initial and final states.

4. develop a partial ordering for the rule set. To improve efficiency, Haskell's built-in pattern matching facilities are used during the implementation of the transition rules. The semantics dictate a sequential left-to-right depth-first evaluation of nested patterns [Hudak et al., 1992, figure 3, page 22]. Hence, the ordering of the rules has to be considered carefully.

5. encode each transition rule. The state-transition rule and its Haskell implementation are very similar, with the latter only requiring some additional plumbing to correctly order accesses and updates to the components.

The first and last pairs of rules are discussed in sections 4.8.10 and 4.8.10 respectively, and section 4.8.10 looks at the third step, the animation of the initial and final states. The sequential STG machine is used as the primary example throughout this material. Also, as it forms a key part of the prototyping system, section 4.8.10 validates the STG animation against the Glasgow Haskell compiler.

The state, its components, and abstract data types

By using Haskell's class and module system [Hudak et al., 1992, sections 4 and 5, pages 24-55] to develop abstract data types for the most common components (see table 4.4) the required type signatures can often be generated immediately:

<table>
<thead>
<tr>
<th>Haskell</th>
</tr>
</thead>
<tbody>
<tr>
<td>type ArgumentStack = Stack Value</td>
</tr>
<tr>
<td>type ReturnStack = Stack Continuation</td>
</tr>
<tr>
<td>type UpdateStack = Stack UpdateFrame</td>
</tr>
<tr>
<td>type MainHeap = Heap Address Closure</td>
</tr>
<tr>
<td>type GlobalEnv = Env Variable Address</td>
</tr>
</tbody>
</table>

The sum-of-products components are the only exception, and these can be directly converted into data declarations:
An algebraic type is also used to encode the abstract state:

```haskell
data STGState = STGState STGCode ArgumentStack ReturnStack UpdateStack
               MainHeap GlobalEnv Extras
```

As well as holding miscellaneous plumbing information, including a unique name supply and possibly a stream of random numbers, the `Extras` field is used to instrument the transition system.

By using a data type instead of a tuple, it is possible for the state and component types to be recursive. For example, the following definitions could be used to bring the operational model into line with the physical implementation of closures (see chapter 8):

```haskell
data Closure = Closure LambdaForm LocalEnv EntryMethod

type EntryMethod = Address -> Closure -> STGState -> STGState
```

One of the major problems with the animation method is that any change to one or more of the underlying types, particularly that of the abstract state, can require that all associated definitions be updated. Fortunately, Haskell's static type system will identify all of the inconsistencies. Furthermore, by using access and update functions to manipulate the state where ever possible, most of the changes can be localised:

```haskell
stgstateGetArgumentStack :: STGState -> ArgumentStack
stgstateSetMainHeap :: MainHeap -> STGState -> STGState
```

Another possible approach would be to pass each state component as an individual argument to each state-transition rule. However, this would require a continuation-passing system, making step-based tracing and/or debugging difficult. Furthermore, as noted previously, the ADT approach provides better encapsulation, thereby localising the changes that have to be made when the state is extended or changed.

The initial and final states

The initial state is realised as a Haskell function, the arguments of which equate to the external parameters of the abstract machine. The body is simply a collection of component instantiations:

```haskell
stgstateInitialise :: TypeEnvironment -> PrimitiveEnv -> Program -> STGState

stgstateInit type_env primitives program = STGState code as rs us heap ge ex

where
code = Eval (envFindAddress "main" envEmpty globalenv)
(as, rs, us) = (stackCreate "as", stackCreate "rs", stackCreate "us")
(ge, heap) = bindsAllocate (programGetBinds program) ge heapInitialise
ex = extrasInitialise type_env primitives
```
Notice how non-strictness has been used to “tie a knot” during the creation of the global environment, ge. Also, as the compiler will automatically resolve the various dependencies, the ordering of the declarations is unimportant.

The final-state predicate is constructed in exactly the same way as the transition rules, except, rather than returning a new state, the result is either True or False.

State-transition rules
The entire rule set could be encoded as a single Haskell function, using a guarded binding for each specific rule. However, as the semantics dictate a left-to-right depth-first evaluation of guards and patterns [Hudak et al., 1992, figure 3, page 22], the ordering of the bindings would implicitly define the rule hierarchy (which is typically flat as rules tend not to overlap – see section 4.8.2). This can cause complications when modifying the rules, so it is suggested that the ordering is clearly defined through the use of dispatch functions:

```haskell
stgstateTransform stgstate@{STGState code as rs us heap ge ex)
| stgstateTriggerGC heap = stgstateInitiateGC stgstate
| otherwise = step code (stgstateIncReductions stgstate)
where
step (Eval expr local_env) = codeEval expr local_env
step (Enter address) = codeEnter address
step (ReturnCon con values) = codeReturnCon con values
step (ReturnLit literal) = codeReturnLit literal
step _ = codeUndefined
```

This technique has the advantage of allowing support functions to be developed along side the appropriate rule. This would not be possible with the one-function approach as GHC does not allow diffuse bindings. It is also easier to instrument the system, as illustrated by the `stgstateIncReductions ticky-ticky` function.

Note that non-determinism, whether introduced by overlapping rules or explicitly specified in a single rule, can be simulated by, for example, extending the `Extras` field to include a stream of random numbers. The dispatch function then selects a rule based on the next value in the stream.

The coding of the rules is usually straightforward, and the following functions implement rule 9 (evalLiteral) and rule 3 (evalLet):

```haskell
evalLiteral :: Literal -> LocalEnv -> STGState -> STGState
evalLiteral literal local_env stgstate = stgstateSetCode (ReturnLit literal) (stateIncEventLit stgstate)
evalLet :: Bool -> Bindings -> Expression -> LocalEnv -> STGState -> STGState
evalLet recursive binds expression original_env state = stateSetCode (Evaluate expression local_env) $
  stateIncEventLet $
  stateGetMainHeap heap’ state
where
  (binds_env, heap’) = bindsAllocate binds rhs_env old_heap
  local_env = envMerge original_env binds_env
  rhs_env | recursive = local_env
            | otherwise = original_env
  heap = stateGetMainHeap state
```

Both of these definitions give rise to one of the main problems affecting the animation. Due to the non-strict semantics of the language, the evaluation of the `ticky-ticky` counts will be deferred as their values are not needed in the calculation of the new state. Each
subsequent step will again defer evaluation, creating a series of linked closures whose length is proportional to the number of transitions made. To prevent this unwanted space leak, infrequently-accessed values have to be artificially forced via dummy case expressions.

**Benchmarking the STG machine**

Following the guidelines laid down by Jain [1991, chapter 25, pages 413–436], this section discusses the verification and validation of the animation of the STG machine – both are essential to having confidence in the output of the animation. Obviously, it is first necessary to consider what the outputs are likely to be:

**final result** the terminal value of the code component is usually $\text{Return}_r \text{value}_{tau}$ (ignoring errors), and $\text{value}_r$ is taken to be the final result of the computation.

**accumulated totals and statistics** this data is stored in the Extras field, and primarily records event counts in much the same way as GHC’s ticky-ticky profiling system. The state components can also act as data sources. For example, the MainHeap data type records the number and the size of the stored closures.

**traces** snapshots of the abstract state are dumped to a file, with the frequency and level of detail controlled by command-line options.

In addition to the usual model-verification techniques [Jain, 1991, section 25.1, pages 413–420], the denotational semantics (see section 4.7) serves as a reference against which the animation’s final result can be checked. Furthermore, the close correspondence between the specification and its implementation further simplifies the debugging process.

In order to validate a model it is necessary to have either expert intuition, real-system measurements, or theoretical results [Jain, 1991, section 2.5.2, pages 420–423]. For the sequential STG machine, the outputs can be compared against the ticky-ticky profiles generated by GHC [AQUA Team, 1993, section 9, page 36] (attempting to predict the run time would be difficult, see section 6.2.2.)

Tables 4.6 through 4.9 present the percentage errors between the estimated and observed values for the fib, primes, queens, and hamming benchmark programs (see sections B.2 to B.5 for the STG'-language versions). The measured values include: closures and words, a record of the number of heap allocations (rules 3, 8', 16, 17, 17A) and the total memory used; entries, a count of the invocations of closure entry routines (rules 2 and 15); updates, the number of thunks updated (rules 17 and 17A); and returns, a tally of the non-uniary constructor returns (rules 6–8').

The percentage errors range between —4.29% and 13.33%, although the errors tend to zero as the number of transitions increases (with the exception of the returns estimates for both the hamming and primes benchmarks). The discrepancies are largely due to the animation not modelling GHC’s input-output mechanism.

### 4.9 Summary

The STG' language provides the computational model upon which a parallel intermediate language can be built, and this chapter has defined the language in terms of its:

**abstract syntax** this is the internal representation used to encode programs, and provides the main structure around which most of the language-processing algorithms are developed.
Table 4.6: The fib benchmark results

<table>
<thead>
<tr>
<th>fib</th>
<th>5</th>
<th>10</th>
<th>15</th>
<th>20</th>
</tr>
</thead>
<tbody>
<tr>
<td>reductions/k</td>
<td>1</td>
<td>9</td>
<td>105</td>
<td>1160</td>
</tr>
<tr>
<td>entries</td>
<td>1-31</td>
<td>0-10</td>
<td>0-00</td>
<td>0-00</td>
</tr>
<tr>
<td>returns</td>
<td>0-99</td>
<td>0-08</td>
<td>0-00</td>
<td>0-00</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>fib -0</th>
<th>5</th>
<th>10</th>
<th>15</th>
<th>20</th>
<th>25</th>
</tr>
</thead>
<tbody>
<tr>
<td>reductions/k</td>
<td>0</td>
<td>2</td>
<td>28</td>
<td>306</td>
<td>3399</td>
</tr>
<tr>
<td>entries</td>
<td>13-33</td>
<td>1-12</td>
<td>0-10</td>
<td>0-00</td>
<td>0-00</td>
</tr>
</tbody>
</table>

Table 4.7: The primes benchmark results

<table>
<thead>
<tr>
<th>primes</th>
<th>50</th>
<th>100</th>
<th>200</th>
<th>300</th>
<th>400</th>
</tr>
</thead>
<tbody>
<tr>
<td>reductions</td>
<td>96374</td>
<td>348835</td>
<td>1293335</td>
<td>2826486</td>
<td>4930210</td>
</tr>
<tr>
<td>closures</td>
<td>-0-03</td>
<td>-0-01</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-15</td>
</tr>
<tr>
<td>words</td>
<td>-0-05</td>
<td>-0-01</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-15</td>
</tr>
<tr>
<td>entries</td>
<td>-0-05</td>
<td>-0-01</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-00</td>
</tr>
<tr>
<td>updates</td>
<td>-0-02</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-00</td>
</tr>
<tr>
<td>returns</td>
<td>-0-08</td>
<td>-0-02</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-00</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>primes -0</th>
<th>50</th>
<th>100</th>
<th>200</th>
<th>300</th>
<th>400</th>
</tr>
</thead>
<tbody>
<tr>
<td>reductions</td>
<td>79032</td>
<td>286485</td>
<td>1063671</td>
<td>2325964</td>
<td>4058756</td>
</tr>
<tr>
<td>closures</td>
<td>0-05</td>
<td>0-01</td>
<td>0-00</td>
<td>0-00</td>
<td>-0-20</td>
</tr>
<tr>
<td>words</td>
<td>0-05</td>
<td>0-01</td>
<td>0-00</td>
<td>0-00</td>
<td>-0-20</td>
</tr>
<tr>
<td>entries</td>
<td>0-02</td>
<td>0-00</td>
<td>0-00</td>
<td>0-00</td>
<td>0-00</td>
</tr>
<tr>
<td>updates</td>
<td>0-10</td>
<td>0-02</td>
<td>0-00</td>
<td>0-00</td>
<td>0-00</td>
</tr>
<tr>
<td>returns</td>
<td>-0-01</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-00</td>
<td>-0-00</td>
</tr>
</tbody>
</table>

**language restrictions** by imposing restrictions upon the set of valid programs it is possible to ensure the language has an efficient operational semantics. To enforce these rules, a static Hindley–Milner type-inference algorithm has been presented.

**denotational semantics** this set-based valuation function maps a program directly onto its meaning, and is uncluttered by operational issues.

**operational semantics** the STG machine, specified as a state-transition system, provides an operational model of the interpretation of the STG' language, and complements the denotational specification.
Table 4.8: The **queens** benchmark results

<table>
<thead>
<tr>
<th></th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>reductions</strong></td>
<td>8426</td>
<td>38630</td>
<td>188174</td>
<td>902002</td>
<td>4568372</td>
</tr>
<tr>
<td>closures</td>
<td>-0.65</td>
<td>-0.18</td>
<td>-0.04</td>
<td>-0.01</td>
<td>-0.01</td>
</tr>
<tr>
<td>words</td>
<td>-0.04</td>
<td>-0.21</td>
<td>-0.23</td>
<td>-0.18</td>
<td>-0.13</td>
</tr>
<tr>
<td>entries</td>
<td>-1.75</td>
<td>-0.55</td>
<td>-0.15</td>
<td>-0.04</td>
<td>0.10</td>
</tr>
<tr>
<td>updates</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>returns</td>
<td>-0.85</td>
<td>-0.08</td>
<td>-0.05</td>
<td>0.01</td>
<td>0.01</td>
</tr>
</tbody>
</table>

Table 4.9: The **hamming** benchmark results

<table>
<thead>
<tr>
<th></th>
<th>500</th>
<th>750</th>
<th>1000</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>cut off</strong></td>
<td>20</td>
<td>50</td>
<td>80</td>
</tr>
<tr>
<td><strong>primes</strong></td>
<td>20</td>
<td>50</td>
<td>80</td>
</tr>
<tr>
<td><strong>reductions/k</strong></td>
<td>195</td>
<td>702</td>
<td>1316</td>
</tr>
<tr>
<td>closures</td>
<td>0.01</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>words</td>
<td>0.01</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>entries</td>
<td>-0.86</td>
<td>-0.31</td>
<td>-0.17</td>
</tr>
<tr>
<td>updates</td>
<td>0.01</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>returns</td>
<td>1.11</td>
<td>0.39</td>
<td>0.22</td>
</tr>
</tbody>
</table>

Table 4.9: The **hamming** benchmark results
Chapter 5

Expressing parallelism – static models

5.1 Introduction

In this chapter a number of guidelines are presented for adding support for parallelism into the sequential STG' language, as described in chapter 4. Typically, this involves extending the abstract syntax, adding language restrictions, and developing a denotational model of the parallel components. The examples used to motivate each of the steps are, where possible, based on the constructs presented in section 2.4. While the issues of language design are not directly addressed, MacLennan's principles [1987, page 547] serve as a useful guide, and are thus reproduced in table 5.1 (the small-caps keywords on the left of the table will be used to refer to these principles throughout the remainder of this chapter).

The basic techniques for introducing parallelism are outlined in section 5.2, while sections 5.3 and 5.4 consider the issues of language restrictions and denotational semantics in the context of parallelism. The chapter is summarised in section 5.5.

5.2 Introducing parallelism into the STG' language

In line with the rest of this thesis, the introduction of parallelism into the sequential language is syntax driven, admitting the following possibilities:

new production rule the addition of new rules allows the introduction of task-oriented expressions, ranging in power from simple spark expressions to comprehensive algorithmic skeletons.

new primitive function superficially, this has similar properties to the addition of a new production rule. However, this method tends to hide the parallelism from the top-level syntax and semantics, and is not recommended for any but the most routine of situations.

new primitive type extensions to the type system can be used to introduce bulk data types, or to improve the encapsulation of other parallel constructs. Both applications require the definition of new production rules (or primitive functions) with which to manipulate values of the new type.

alteration of an existing expression by either modifying the syntax, type rule or denotational semantics of one of the standard constructs, it is possible to radically
<table>
<thead>
<tr>
<th><strong>ABSTRACTION</strong></th>
<th>Avoid requiring something to be stated more than once; factor out the recurring pattern</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>AUTOMATION</strong></td>
<td>Automate, mechanical, tedious, or error-prone activities</td>
</tr>
<tr>
<td><strong>DEFENCE IN DEPTH</strong></td>
<td>Have a series of defences so that if an error isn’t caught by one it will probably be caught by another</td>
</tr>
<tr>
<td><strong>INFORMATION HIDING</strong></td>
<td>The language should permit modules designed so that (1) the user has all of the information needed to use the module correctly, and nothing more; and (2) the implementor has all the information needed to implement the module correctly, and nothing more</td>
</tr>
<tr>
<td><strong>LABELLING</strong></td>
<td>Avoid arbitrary sequences more than a few items long; do not require the user to know the absolute position of an item in a list. Instead, associate a meaningful label with each item and allow the items to occur in any order</td>
</tr>
<tr>
<td><strong>LOCALISED COST</strong></td>
<td>Users should pay only for what they use; avoid distributed costs</td>
</tr>
<tr>
<td><strong>MANIFEST INTERFACE</strong></td>
<td>All interfaces should be apparent (manifest) in the syntax</td>
</tr>
<tr>
<td><strong>ORTHOGONALITY</strong></td>
<td>Independent functions should be controlled by independent mechanisms</td>
</tr>
<tr>
<td><strong>PORTABILITY</strong></td>
<td>Avoid features that are dependent on a particular machine or small class of machines</td>
</tr>
<tr>
<td><strong>Preservation of Information</strong></td>
<td>The language should allow the representation of information that the user might know and that the compiler might need</td>
</tr>
<tr>
<td><strong>REGULARITY</strong></td>
<td>Regular rules, without exceptions, are easier to learn, use, describe, and implement</td>
</tr>
<tr>
<td><strong>SECURITY</strong></td>
<td>No program that violates the definition of the language, or its own intended structure, should escape detection</td>
</tr>
<tr>
<td><strong>SIMPLICITY</strong></td>
<td>A language should be as small and simple as possible. It should contain the minimum number of concepts with simple rules for their combination</td>
</tr>
<tr>
<td><strong>STRUCTURE</strong></td>
<td>The static structure of a program should correspond in a simple way to the dynamic structure of the corresponding computations</td>
</tr>
<tr>
<td><strong>SYNTACTIC CONSISTENCY</strong></td>
<td>Similar things should look similar; different things different</td>
</tr>
<tr>
<td><strong>ZERO-ONE-INFINITY</strong></td>
<td>The only reasonable numbers are zero, one, and infinity</td>
</tr>
</tbody>
</table>

Table 5.1: MacLennan’s language design principles
change the language. As an example, the presented system can be made strict
simply by adjusting the semantics of function and constructor application.

**hybrid definition** it is suggested that any hybrid language be developed incrementally,
in that each of the separate items is prototyped in isolation.

Sections 5.2.1 to 5.2.5 examine each of these approaches in greater depth.

### 5.2.1 New production rules

Introducing parallelism into the language via a new production rule is an attractive propo-
sition for a number of reasons: firstly, all existing algorithms and descriptions will still be
valid, and require only the addition of special cases to bring them into line with the new
syntax; similarly, the test programs will still be well formed and have the same behaviour;
and, finally, the construct is directly visible, making it difficult to overlook or ignore at
any stage of the design process.

In general, the addition of a new production rule will proceed as follows;

1. *extend the abstract and concrete syntax.* The primary decision to be made at this
   stage concerns which production-rule group will be extended. As the concrete syntax
   will only be used to encode test programs, aesthetic considerations can be set aside,
   thereby simplifying one of the more difficult aspects of this step.

2. *generate example programs.* Sample programs not only serve as a useful source of
test data for the various animations, but also provide an insight into potential pitfalls
that may be encountered in the later stages. A random-program generator, along
the lines of the hpg utility (see section C.1), may even be of some value.

3. *modify the type-inference and free-variable algorithms.* The main purpose of the
type system is to imposes restrictions on the language so as to avoid complicating
the run-time system. These limitations arise from consideration of the kind of values
manipulated by the new constructs.

4. *update the denotational semantics.* Despite the limitations of set-based denotational
descriptions, the development of such models focuses attention on the issues of non-
determinism and the default order of evaluation.

The mechanics of the first point have already been outlined in sections 4.3.5 and 4.4, while
sections 5.3 and 5.4 cover the last two points respectively. The remainder of this section
therefore presents a number of examples, followed by a brief overview of the utility of the
major groups of production rules.

**The par combinator**

The traditional production rule, \( \text{exp} \rightarrow \text{par} \ \text{var} \ \text{exp} \), dissociates the thread from the
expression it will reduce, making local optimisations difficult. Moreover, the operational
reading may well include memory allocation, thereby invoking the SYNTACTIC CONSIS-
TENCY principle, such that the following rule is arguably superior:

\[
\text{exp} \rightarrow \text{letpar simplebind exp}
\]

Note that this construct is cumbersome for sparking non-local variables, i.e. a formal
argument or a pattern-matched variable. Surprisingly, this is an advantage as such usage
Table 5.2: Extending production-rule groups

<table>
<thead>
<tr>
<th>Group</th>
<th>Overview</th>
</tr>
</thead>
<tbody>
<tr>
<td>program</td>
<td>useful for adding static or one-off definitions, such as communication channels or initial data mappings</td>
</tr>
<tr>
<td>bindings</td>
<td>similar to the program group, except allowing the creation of dynamic topologies</td>
</tr>
<tr>
<td>binding</td>
<td>appropriate in cases where there will be just one definition involved, or where there is no relationship between each of the bindings</td>
</tr>
<tr>
<td>exp</td>
<td>as the examples presented in section 5.2.1 demonstrate, this group is capable of encoding most forms of task-based parallelism</td>
</tr>
</tbody>
</table>

should be considered carefully, reflecting the lack of control over the computational content. Plasmeijer and van Eekelen [1993b, section 25.3.5, pages 355–357] force this issue by extending the type system so that all possible sources of parallelism have to be clearly identified.

The dual of this operation, the seq combinator, is already represented by the letstrict expression. Furthermore, mutually-recursive threads can only be sparked after the corresponding letrec expression.

Skeletal operations

Moving on to consider skeletal operations, two similar options exist: (farm and pipeline are described in section 2.4.3)

\[
\begin{align*}
\text{exp} & \rightarrow \text{skeleton} \\
\text{skeleton} & \rightarrow \text{farm} \ \text{varfun} \ \text{exp} \\
& \quad \ldots \\
\text{exp} & \rightarrow \text{farm} \ \text{varfun} \ \text{exp} \\
& \quad \ldots \\
\text{pipeline} \ \text{varfun}_1 \ldots \text{varfun}_n \ \text{exp} & (n \geq 1)
\end{align*}
\]

The leftmost system provides better encapsulation and, unless the number of skeletons is small, is the recommended solution. The transformations associated with each skeleton can be performed prior to, during, or after the usual STG-language optimisations (see section 3.3), and take exactly the same form:

\[
\begin{align*}
\text{pipeline} \ \text{varfun}_1 \ \text{varfun}_2 \ \text{exp} & \Rightarrow \text{let} \ \text{fun}' = \ldots \ \text{r} \\
& \quad \ldots \rightarrow \text{compose} \ \text{varfun}_1 \ \text{varfun}_2 \\
& \quad \text{farm} \ \text{fun}' \ \text{exp}
\end{align*}
\]

These rules can cause the specification of the topology to become diffuse, such that the grouping of related definitions may be necessary i.e. skeleton \rightarrow \text{farm} \ binds \ \text{varfun} \ \text{exp}.

Selecting a production-rule group

All of these examples have extended the exp production rules, but for each new addition, all of the groups outlined in table 5.2 should be considered. The groups not covered by this table are either inappropriate (the following section deals with extending the type declarations), have no obvious utility, or can be simulated by the represented groups. To illustrate this last point, consider the following rules: atom \rightarrow \text{var} | \text{par} \ \text{var} | \text{literal},
where, operationally, any `par`-annotated variable should be sparked. Any expression using this syntax can be trivially converted into an equivalent one which uses the `letpar` construct and the standard `atom` rule group, as demonstrated below:

\[
\begin{align*}
&f \, \text{var}_1 \ldots (\text{par} \, \text{var}_i) \ldots \text{var}_n \\
\implies
&\text{letpar} \, (\text{var}_i, \text{par} = r \to \text{var}_i) \\
&f \, \text{var}_1 \ldots \text{var}_i, \text{par} \ldots \text{var}_n
\end{align*}
\]

The latter approach not only increases the spark’s prominence, but is more in keeping with the operational semantics [Peyton Jones, 1992, section 5, rules 1, 5, and 14]. The relationship between the `lambda_form` and `bind` groups is similar, but either is acceptable, depending on the context.

### 5.2.2 New primitive functions

The addition of a new primitive function is quick and simple, requiring no modifications to be made to the abstract or concrete syntax, and only entailing the following steps:

1. *add the type signature to the primitive environment, PE.*
2. *add a new valuation function to the denotational semantics.*

This simplicity has a price, in that such functions can only manipulate `atoms`. Furthermore, due to the low profile of the primitives, only uncomplicated computation should be introduced via this method – non-deterministic operations, or operations which affect the order of evaluation, are not appropriate!

### 5.2.3 New primitive types

Extensions to the type system can be used to introduce bulk data types, improve the encapsulation of other constructs, or to relax some of the language restrictions detailed in section 4.5. Whatever the purpose of the new type, the addition proceeds as follows:

1. *Is the type boxed or unboxed?* Before incorporating the new type into the language, a time must be spent considering its machine representation. This helps to determine where to make the extensions in step 2, and to focus the selection of constructs in step 4.
2. *Extend the syntax of types.* Based on the deliberations of step 1, new rules are added to the type system first outlined in figure 4.2. Although concerned with the language syntax, most of the points raised in section 5.2.1 also apply here. Furthermore, if the new type is to be allowed to appear in data-type declarations, the additions to the syntax of types must be mirrored in the language syntax (see step 4).
3. *Modify the unification algorithm.* This controls how the new type interacts with type variables, i.e. whether values of this type can be manipulated by polymorphic functions. If a type is boxed there is little reason to disallow such interactions.
4. *Extend the language syntax and primitive functions.* Facilities to create and manipulate instances of the type are next introduced into the language. Four categories of operations should be considered: value creation; conversion from or to existing types; transformations within the same type; and conditionals, including comparison operations.
5. update the semantic descriptions. For each new construct added by the previous
stage, steps 2–4 of the method outlined in section 5.2.1 should be followed (or
section 5.2.2 for primitive functions). It may be necessary to update the domain
equations used by the denotational semantics.

The following examples serve as demonstrations of the method, and also highlight some
of the potential applications.

**Data-parallel Haskell**

Hill has implemented data-parallel Haskell (see section 2.4.2) on the AMT DAP (Dis­
tributed Array Processor [MacDonald, 1992]), a SIMD machine using a flexible 64 by 64
grid of 1-bit processors. Operationally, as DAP vectors can only contain unboxed primitive
data types [Hill, 1994, table 5.1], a POD is prevented from storing functions or unevalu­
ated expressions. This restriction also impacts upon the representation of algebraic
PODs, which are thus stored as tables of simpler PODs – the relationship between an algebraic
type, \( \chi \), and its flattened representation, \( \chi' \), is shown below:

\[
\text{data } \chi = \text{cons}_{i_1} \tau_{i_1} \ldots \tau_{i_{a_1}} \implies \text{data } \chi' = \text{flat}_{\chi} \chi_{\text{tag}} \tau_{i_1} \ldots \tau_{i_{a_1}} \ldots \tau_{i_{a_n}}
\]

\[
\vdots
\]

\[
\text{data } \chi_{\text{tag}} = \text{cons}'_{i_1} \ldots \text{cons}'_{i_{n-1}} \chi_{\text{not here}}
\]

The first entry encodes the constructor tag, with the remaining entries representing each
possible argument of every constructor associated with the data type. These restrictions
are reflected in the extensions to the syntax of types of the STG' language:

<table>
<thead>
<tr>
<th>Boxed type</th>
<th>POD vector</th>
<th>( \alpha )</th>
<th>pod</th>
<th>POD vector</th>
</tr>
</thead>
<tbody>
<tr>
<td>POD vector</td>
<td>pod</td>
<td>( \langle \nu \rangle )</td>
<td>primitive vector</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>( \langle \chi_{\text{tag}} \rangle ) pod _1 \ldots pod _n</td>
<td>flattened algebraic type (( n \geq 0 ))</td>
<td></td>
</tr>
</tbody>
</table>

The unification algorithm, \( U \), is extended as shown in figure 5.1, with the first rule
stating that PODs are first-class citizens with respect to polymorphism. Figure 5.2 shows
the production rules added by Hill [1994, chapter 5] to support the new type – there are no
conversion routines defined, only creation (\( \langle \cdots \rangle \)), transformation (MAP, INDICES, SEND,
and FETCH), and conditional (CASE) operations. Notice that named defaults, as described
in section 4.3.3, can be avoided by using the MAP\_1 construct.
Expression  
\[ \text{exp} \rightarrow \text{MAP}_n (\lambda \text{form} | \text{var}) \text{exp}_1 \ldots \text{exp}_n \quad (n \geq 1) \]
| INDICES \text{var} 
| $\text{SEND} \text{var} \text{var}$ 
| $\text{FETCH} \text{var} \text{var}$ 
| $\text{CASE} \text{exp} \text{OF} \text{palts} \text{default}$ 
| \[\langle\text{cons}\rangle\text{atoms}\]

Parallel alternatives  
\[ \text{palts} \rightarrow \text{lpalts}_1 \ldots \text{lpalts}_n \quad (n \geq 1) \]
| $\text{vars} \text{apalt}_1 \ldots \text{apalt}_n \quad (n \geq 1)$
| $\text{lpalt} \rightarrow \forall \text{literal} \rightarrow \text{exp}$
| $\text{apalt} \rightarrow \forall \text{cons} \rightarrow \text{exp}$

Atom  
\[ \text{atom} \rightarrow \langle\ldots\text{atom}\ldots\rangle \]

Figure 5.2: Hill's extended syntax for a data-parallel STG language

Abstract syntax  
\[ \nu \rightarrow \text{PID#} \quad \text{processor identification} \]

Unification rule  
\[ \text{U PID# PID#} = (\text{PID#}, \emptyset) \]

New production rules  
\[ \text{exp} \rightarrow \text{randomPID#} | \text{currentPID#} | \ldots \]

New primitives  
\[ \text{neighborPID#} : \text{Int#} \rightarrow \text{PID#} \rightarrow \text{PID#} \]
\[ \text{itopPID#} : \text{Int#} \rightarrow \text{PID#} \]

Figure 5.3: The PID# type for restricting access to non-deterministic topology functions

Processor identification

The topology operations detailed in table 2.3 are a potential source of non-determinism, and it is desirable to restrict the employment of their results to purely operational matters. This is achieved naturally through the use of the new unboxed type shown in figure 5.3. Operationally, variables of this type will be represented as unboxed integers, i.e. equivalent to values of type Int#. However, by restricting the constructs and functions that produce and consume values of this new type, the desired encapsulation is achieved.

Pipeline parallelism

Most existing implementations of the pipeline skeleton are limited by the static type system to using stages which all have the type \( \alpha \rightarrow \alpha \) [Bratvold, 1994, section 3.4.1]. While it is possible to use algebraic data types to circumvent this restriction, this is a cumbersome and inefficient solution. The boxed type outlined in 5.4 offers a more flexible solution.

5.2.4 Altering existing expressions

This method is deceptively simple, as all of the necessary definitions and functions already exist, and only require modification. However, any change, whether it be to the syntax, language restrictions, or denotational semantics, may cause the test programs to either
Abstract syntax

<table>
<thead>
<tr>
<th>Symbol</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\pi$</td>
<td>pipeline specification</td>
</tr>
</tbody>
</table>

Unification rule

$U \alpha \pi_1 \rightarrow \pi_2 = \left( \frac{\pi_1 \rightarrow \pi_2, \{\alpha \rightarrow \pi_1 \rightarrow \pi_2\}}{\pi_1 \rightarrow \pi_2, S_1 \oplus S_2} \right)$

where

$(\alpha_1, S_1) = U \alpha_{11} \alpha_{21}$

$(\alpha_2, S_2) = U (S_1 \alpha_{12}) (S_1 \alpha_{22})$

New primitives

<table>
<thead>
<tr>
<th>Primitive</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>addstagePipe</td>
<td>$(\alpha_1 \rightarrow \alpha_2) \rightarrow \alpha_2 \rightarrow \alpha_3 \rightarrow \alpha_1 \rightarrow \alpha_3$</td>
</tr>
<tr>
<td>applyPipe</td>
<td>$\alpha_1 \rightarrow \alpha_2 \rightarrow List \alpha_1 \rightarrow List \alpha_2$</td>
</tr>
<tr>
<td>emptyPipe</td>
<td>$\alpha_1 \rightarrow \alpha_2$</td>
</tr>
</tbody>
</table>

Figure 5.4: Improving pipeline parallelism using a new boxed type

become invalid, fail to terminate, or yield different results. Re-checking the integrity of
the collection can be time consuming, and the savings over the addition of a new rule or
type may often be negligible. Moreover, most modifications will have to be made simulta­
neously, before testing can begin in earnest. Hence it is suggested that the alteration of
existing expressions should be employed only when major changes are deemed necessary,
and no other method is applicable.

The following three sections look at alterations to the abstract syntax, language re­
strictions, and denotational semantics respectively.

Abstract syntax

Alterations to the abstract syntax are limited to the following:

removal of a production rule the deletion of a rule effectively removes a capability
from the language. Examples include the removal of: named defaults – see sec­
section 4.3.3; letrec expressions, such that recursion can only be defined at the top­
level; and the literal alternative from the atom group, forcing all primitive values

to be defined using the let# expression. This last change would bring the language
more into line with the continuation-passing style advocated by Appel [1992, figure
2.1].

addition of a new non-terminal symbol to a production rule the introduction of
an new field can increase the expressiveness of an existing construct. For exam­
ple, the lambda_form could be extended to include a location directive (see sec­
section 2.4.4). As a special case, simple expression-based extensions can be avoided
altogether by using the attribute database (see section 4.6). This shortcut can be
stretched to include bindings and lambda forms, but the access routines become

complex.

removal of a symbol from a production rule this is the inverse of the previous op­er­
ation, and is used to delete extraneous symbols. As an example, Hill [1994, figure
5.2] simplified the lambda_form rule to $varsargs \rightarrow exp$, claiming that the missing
operational information could be inferred easily. This is certainly true of the free-variable data, but the removal of the update flag does complicate the encoding of, for example, strictness and complexity information.

Note that re-ordering and replacing fields can be treated as combinations of deletions and additions. With the exception of production-rule deletion, the method outlined in section 5.2.1 should be followed, with the existing definitions serving as an additional guide.

Language restrictions

Alterations to the type rules will either tighten or relax the constraints placed on STG' programs. The method is straightforward, involving only the addition or deletion of assertions, but the total effect can be considerable, as illustrated by the following examples.

The APPLY-EXP' rule, shown below, restricts the result of function application to algebraic values – a lifting algebraic type (data \( \chi_{\text{Lift}} \alpha = \text{Lift} \alpha \)) would have to be used to return functions or polymorphic variables:

\[
\begin{align*}
\text{APPLY-EXP}' & \quad \frac{\text{TE} \vdash \text{var}_{\text{fun}} : \tau_1 \to \cdots \to \tau_n \to \chi \alpha_1 \ldots \alpha_v}{\text{TE} \vdash \text{atom} : \tau_i \ (0 \leq i \leq n)} \\
& \quad \frac{\text{exp}}{
\text{TE} \vdash \text{var}_{\text{fun}} \text{atom}_1 \ldots \text{atom}_n : \chi \alpha_1 \ldots \alpha_v}
\end{align*}
\]

Relaxing any of the restrictions described in section 4.5.1 would require significant changes to be made to the operational semantics, and should therefore be avoided. The technique used in section 5.2.3 to improve the pipeline skeleton is a less powerful, yet workable, alternative. However, as an example, the following type rule removes the second restriction, allowing constructors to appear unsaturated:

\[
\begin{align*}
\text{CONS-EXP}' & \quad \frac{\text{spec}}{(\text{cons}, (n, \sigma)) \in \text{CE}} \\
& \quad \frac{\text{atom}}{
\text{TE} \vdash \sigma : \tau_1 \to \cdots \to \tau_n \to \chi \pi_1 \ldots \pi_v} \\
& \quad \frac{\text{exp}}{
\text{atom} \vdash \text{atom}_i : \tau_i \ (0 \leq i \leq m \leq n)} \\
& \quad \frac{\text{exp}}{
\text{TE} \vdash \text{cons atom}_1 \ldots \text{atom}_m : \tau_{m+1} \to \cdots \to \tau_n \to \chi \pi_1 \ldots \pi_v}
\end{align*}
\]

Denotational semantics

As outlined in section 4.7, the denotational semantics consists of three sets of definitions – the meta language, the domain equations, and the valuation functions – and changes can be made to each of these. Whichever group is targeted, the primary motivation behind any modification is likely to be concerned with the default order of evaluation. As an example, function application can be forced to model applicative-order reduction using the valuation function shown in figure 5.5. The transformation from a non-strict language into a strict language can be completed by redefining constructor application and the variable-binding functions. For the same reasons as outlined in section 5.2.2, non-determinism should not be introduced by this route.
Figure 5.5: A valuation function for strict function application

5.2.5 Hybrid definitions

While it is possible to prototype a simple language using just one of the previous four methods, it is only by combining these strategies that more complex effects can be achieved. For example, the $\alpha_1 \rightarrow \alpha_2$ boxed type and associated primitives (section 5.2.2) can increase the expressiveness of the pipe skeleton (section 5.2.1). Similarly, the PID# type provides support for a data-placement operation. However, apart from suggesting that each of the separate items be developed in isolation, hybrid definitions are beyond the scope of this thesis.

5.3 Language restrictions revisited

This section is concerned with the restrictions that need to be placed on any new language features, starting with an overview of the types of restraints that can be applied in section 5.3.1. The remaining two sections then look at extending the type-inference and free-variable algorithms presented in section 4.5.

5.3.1 Syntactic, algorithmic, and informal restrictions

There are three complementary approaches to limiting the set of valid language terms (DEFENCE IN DEPTH), and these are summarised below:

abstract syntax by controlling the way in which language terms can be constructed, it is possible to prevent undesirable phrases from being expressed. A good example of this is the abstract syntax of the POD type, defined in section 5.2.3, which requires no additional constraints to ensure that a given term is valid.

algorithmic restrictions often, complex restrictions cannot be enforced by the abstract syntax alone, and additional mechanised checks have to be made. The sequential STG' language, for example, already makes use of free-variable and type-inference algorithms. In the context of parallel languages, new algorithmic techniques, such as shape analysis [Jay and Cockett, 1994], have great potential.

informal restrictions in situations where mechanisation is difficult, informal restrictions may be imposed on a language. Examples include requiring that: an operator is associative [Skillicorn, 1990, the reduce and directed-reduce operations, page 45]; or that the head function is never applied to an empty list [Hudak et al., 1992,
PreludeList, page 106]. It is usually left to the programmer to ensure that these conditions are met – failure to do so may result in errors that are difficult for the run-time system to detect.

5.3.2 Type-inference rules

The type-inference algorithm presented in section 4.5 enforces the majority of restrictions required of well-formed sequential STG' language programs. By extending the underlying rule set, most of the mundane restrictions that need to be placed on parallel extensions can also be checked. The rules contained in table 5.3 illustrate this last point, and also serve as an overview of the basic principles. While most cases should be straightforward, care has to be taken when dealing with constructs that force evaluation, as typified by the par combinator. Furthermore, extensions to the total environment, TE, may have to be made to accommodate new primitive types. Both of these issues are discussed in the following sections.

As a final note, by extending the type-inference algorithm it becomes obvious which constructs manipulate collections of values. Rather than using the standard List or Tree data types, these operations may benefit from the support of a specialised primitive type (see section 5.2.3). As an example, consider the PIPE-SKELETON rule from table 5.3.

Constrained types and the polymorphic par combinator

As mentioned in section 4.5.1, the STG machine has an aggressive take mechanism, such that a function can only be reduced when all of its arguments are present. This has obvious implications for any evaluation-forcing operation, which should thus only be used to reduce variables or expressions with algebraic or literal types. Even with a non-aggressive take, the return mechanism assumes that the exact type of the final result is known prior to evaluation – arbitrarily reducing polymorphic variables and expressions could cause problems. It follows that the PIPE-SKELETON rule from table 5.3 is too relaxed, while the PAR-EXP and LETPAR-EXP rules are suitably constrained.

Extending the type environment

With the introduction of new types or new binding mechanisms, the total environment, TE, may need to be extended. For instance, case alternatives for algebraic Pods only include the constructor tag, necessitating a Pod-tag environment, PTE, of type \( \chi_{tag} \rightarrow pod_1 \ldots pod_n \). The ALG-PALTs type rule, shown in figure 5.6, illustrates the use of this new entry. Note that it would be possible to simply use the constructor environment, CE, to store this information, and have the ALG-PALT rule access and return the required information.

With regards to animating these two rules, the ALG-PALTs needs to know the tag type before creating the local variable environment used by the ALG-PALT rule. As this information is inferred by the second rule, there is an obvious cyclic dependency. Two possible solutions exist: either Haskell's non-strict semantics can be used to resolve the conflict; or the first rule can peek at the left-hand side of the first alternative, and independently determine the type. The former strategy is the more elegant and concise, but will be sensitive to the strictness of all the routines which manipulate the environment.
| 1. simple types | \[ \text{CURRENT-PID-EXP} \]
|                | \[ \text{exp} \]
|                | \[ \text{TE} \vdash \text{currentPID} : \text{PID#} \]

| 2. accessing the TE | \[ \text{INDICES-EXP} \]
|                    | \[ \text{exp} \]
|                    | \[ \text{TE} \vdash \text{INDICES var} : \langle \text{Int#} \rangle \]

| 3. dependent types | \[ \text{PAR-EXP} \]
|                   | \[ \text{exp} \]
|                   | \[ \text{TE} \vdash \text{exp} : \tau \]
|                   | \[ (\text{var}, \chi \pi_1 \ldots \pi_v) \in \text{TE} \]
|                   | \[ \text{exp} \]
|                   | \[ \text{TE} \vdash \text{par var exp} : \tau \]

| 4. constrained types | \[ \text{POD-ATOM} \]
| i. unboxed | \[ \text{atom} \]
|            | \[ \text{TE} \vdash \langle \ldots \text{atom} \ldots \rangle : \langle \nu \rangle \]
| ii. boxed | \[ \text{PIPE-SKELETON} \]
|            | \[ \text{TE} \vdash \text{exp} : \text{List} \pi_1 \]
|            | \[ \text{signal} \]
|            | \[ \text{TE} \vdash \text{pipe} \text{var}_\text{pipe} \text{exp} : \text{List} \pi_2 \]

| 5. extending the TE | \[ \text{LETPAR-EXP} \]
|                    | \[ \text{exp} \]
|                    | \[ \text{TE} \vdash \text{exp}_{\text{defn}} : \chi \pi_1 \ldots \pi_v \]
|                    | \[ \text{LVE} = \{ \text{var} \mapsto \chi \pi_1 \ldots \pi_v \} \]
|                    | \[ \text{exp} \]
|                    | \[ \text{TE} \oplus \text{LVE} \vdash \text{exp} : \tau \]
|                    | \[ \text{exp} \]
|                    | \[ \text{TE} \vdash \text{letpar} \text{var} = \text{exp}_{\text{defn}}, \text{exp} : \tau \]

| 6. auxiliary functions | see the \text{CASE-EXP} and \text{PROGRAM} rules in appendix D |

Table 5.3: A selection of type rules for parallel constructs
\[
\begin{align*}
\text{ALG-PALT} & \quad (x_{\text{tag}}, \text{pod}_1 \ldots \text{pod}_n) \in PTE \\
\text{LVE} & = \{\var_1 \mapsto \text{pod}_1, \ldots, \var_n \mapsto \text{pod}_n\} \\
\text{APalt} & \\
\text{PTE} \otimes \text{LVE} & \vdash \text{APalt}_1 : \chi_{\text{tag}} \rightarrow \text{pod} \quad (1 \leq i \leq m) \\
\text{APalt} & \\
\text{PTE} & \vdash \text{var}_1 \ldots \text{var}_n \ \text{APalt}_1 \ldots \text{APalt}_m : \langle \chi_{\text{tag}} \rangle \ \text{pod}_1 \ldots \text{pod}_n \rightarrow \text{pod} \\
\text{APalt} & \\
\text{PTE} & \vdash \text{exp} : \text{pod} \\
\text{exp} & \\
\text{PTE} & \vdash \forall \text{cons} \rightarrow \text{exp} : \chi_{\text{tag}} \rightarrow \text{pod}
\end{align*}
\]

Figure 5.6: The \text{ALG-PALT} and \text{ALG-PALT} type rules for PODs

5.3.3 Free variables

In general, developing the \( \mathcal{FV}[] \) rules for the parallel-language terms is straightforward – it is simply a matter of taking the union of the free variables of each non-terminal symbol that compose the construct:

\[
\mathcal{FV}_{\text{skeleton}}[\text{pipe} \ \text{var}_{\text{pipe}} \ \text{exp}] \ g = \mathcal{FV}_{\text{var}}[\text{var}_{\text{pipe}}] \ g \cup \mathcal{FV}_{\text{exp}}[\text{exp}] \ g
\]

The only complication involves binding operations, where care must be taken to filter out the local variables from the final answer:

\[
\mathcal{FV}_{\text{exp}}[\text{letpar} \ \text{var} = \text{exp}_{\text{defn}} \ \text{exp}] \ g = \mathcal{FV}_{\text{exp}}[\text{exp}_{\text{defn}}] \ g \cup (\mathcal{FV}_{\text{exp}}[\text{exp}] \ g \setminus \{\text{var}\})
\]

5.4 Denotational semantics and parallel languages

In the context of the prototyping framework, once developed, the denotational description of the entire language will serve as a guide during the development of the operational semantics, and as a reference model during the testing phase. In addition, the construction of the denotational semantics focuses the designers attention on the following areas:

**order of evaluation** for the semantics outlined in section 4.7, the order of evaluation is primarily determined by the occurrences of \text{case}, \text{letstrict}, and \text{let#} expressions. In the presence of scheduling constructs (see section 2.4.4) more complex orderings can be specified.

**degree of evaluation** in order to increase the amount of work performed by, for example, a \text{pipe} expression, it may be necessary to reduce its argument further than the usual head normal form. This behaviour should be reflected in the valuation function.

**speculative evaluation and non-termination** an expression which reduces to bottom, \( \bot \), under the denotational semantics will probably fail to terminate in an actual implementation. The effect of non-termination on the run-time system can be grossly specified by the denotational model.

**non-determinism** despite the potential loss of referential transparency, the introduction of non-determinism can be useful, particularly when providing access to operational parameters, such as the system load or the processor identifier.
run-time errors and exception handling it is not uncommon for languages to support the use of informal restrictions by providing a primitive similar to Haskell's `error` operator [Hudak et al., 1992, pages 68 and 88]. Non-fatal errors can be supported through the use of user-level exception handlers.

Each of these issues, along with a brief summary of the different strategies, are explored in the following sections.

5.4.1 Order of evaluation

As the underlying mathematics supports no notion of 'order of evaluation', it seems unlikely that the denotational semantics can be applied to this problem. However, in the absence of side effects, the exact interleaving of computations [Hooman, 1991, sections 2.2 and 4.1] is not important, and only the dependencies need to be expressed [Bloss and Hudak, 1988]. Consider the following example:

\[
E[\text{seq } \text{var } \text{exp}] \rho = \text{case } (\rho \text{ var}) \text{ of } \perp \to \perp \epsilon \to E[\text{exp}] \rho
\]

Remembering that bottom, \( \perp \), equates to non-termination, this valuation function captures the expected behaviour, i.e. \( \text{seq } x y \) will only return the value \( y \) if \( x \) represents a finite computation. Similarly, the \( \text{par} \) combinator can be modelled as follows (ignoring termination properties):

\[
E[\text{par } \text{var } \text{exp}] \rho = E[\text{exp}] \rho
\]

Unfortunately, using this strategy, no satisfactory definition can be arrived at for a passive \( \text{wait } x y \) expression [Goldberg, 1988a, section 3.2], i.e. one which cannot initiate the reduction of \( x \). If a \( \text{seq} \)-style valuation function is used, the possibility that \( x \) is never reduced is not expressed. However, in all of of the examples presented by Goldberg, the \( \text{waits} \) are always matched by preceding \( \text{par} \) operations – if this restriction could be enforced by the language, the proposed description would be valid.

While obtaining an accurate model is desirable, it is not essential, as the operational semantics is a better medium for expressing these concerns. As long as the denotational model is under constrained, i.e. the denotational reading is always as well defined as the operational result, the model can still be used for test purposes.

As a final note, it is expected that the number of constructs which change the default (non-strict) order of evaluation will be small. It was therefore decided not to use the continuation-passing style of denotational semantics [Raskovsky and Collier, 1980; Sethi, 1982; Schmidt, 1986, chapter 9], which would, arguably, make the descriptions harder to follow in most cases.

5.4.2 Degree of evaluation

While testing for bottom can be used to indicate that a value will be reduced to at least head normal form, constructs that manipulate algebraic data types may need to express more complex requirements [Burn, 1991, figure 5.1, page 114]. For example, consider the
following definition of the pipe skeleton:

\[
\text{Skeleton[pipes } \var{pipe} \text{ exp}] \sigma = \begin{array}{c}
\text{let function } = \sigma \var{pipe}, \text{ arguments } = \mathcal{E}[\text{exp}] \rho \\
\text{in } \xi_{\infty}(\text{List } \pi) (\text{map function arguments})
\end{array}
\]

\[
\xi_{\infty}(\text{List } \pi) : \text{ Val } \rightarrow \text{ Val}
\]

\[
\begin{array}{c}
\xi_{\infty}(\text{List } \pi) \epsilon = \begin{cases}
\bot \rightarrow \bot \\
\langle \text{Nil} \rangle \rightarrow \langle \text{Nil} \rangle \\
\langle \text{Cons}, x, xs \rangle \rightarrow \langle \text{Cons}, x, \xi_{\infty}(\text{List } \pi) xs \rangle
\end{cases}
\end{array}
\]

The application of the \(\xi_{\infty}(\text{List } \pi)\) function [Burn, 1991, section 1.2, page 7] to the pipe’s input, forces reduction to spine normal form [Kewley and Glynn, 1990, page 330]. For each data type there is potentially a large number of reduction strategies [Hammond, 1991, “the twenty-four names of Cons”, section 8.3], and it may be worth considering automatically deriving the evaluation transformers from the type declarations.

5.4.3 Speculative evaluation and non-termination

When threads are used to only reduce essential expressions, non-termination of an individual thread is not a problem as the entire computation will, by definition, also fail to terminate. However, by permitting speculative evaluation [Mattson Jr., 1993a, chapter 3], it is possible that a non-essential thread may consume sufficient resources so as to affect the final result. Consider the following valuation functions:

\[
\mathcal{E}[\text{par } \var{var} \text{ exp}] \rho = \begin{cases}
\bot \rightarrow \bot \\
\epsilon \rightarrow \mathcal{E}[\text{exp}] \rho
\end{cases}
\]

\[
\mathcal{E}[\text{speculate } \var{var} \text{ exp}] \rho = \mathcal{E}[\text{exp}] \rho
\]

Based on these definitions, the \texttt{par} combinator can only be used to reduce either essential values, or expressions which are known to terminate. The \texttt{speculate} combinator is less constrained, in that it can be used to evaluate any expression. The cost of this increased expressiveness is that the run-time system must use a fair scheduling algorithm, and be capable of garbage collecting unnecessary threads [Mattson Jr., 1993a, section 7.4.1].

5.4.4 Non-determinism

In general, the introduction of non-determinism results in the loss of referential transparency, and hence invalidates a wide range of compilation techniques [Santos, 1995]. There are only two situations where this loss could be justified: when providing access to run-time values, such as the current processor identifier; and allowing threads to interact non-trivially through the use of side effects. Both cases are considered in the following sections, although [Dennis et al., 1995] is recommended as a succinct review of the situation.

Accessing operational parameters

By providing access to certain run-time parameters, including current workloads and the local-processor identifier, it is possible for a program to adapt its behaviour in the hope
of improving efficiency [Hudak, 1991, section 5.4]. However, such values are inherently non-deterministic, and therefore complicate the development of a denotational semantics. The remainder of this section looks at a number of different ways of incorporating non-determinism, including the use of powerdomains.

The easiest way to handle non-determinism is to ignore it completely, as is done below:

$$\mathcal{E}[\text{currentPID#}] \rho = 42#$$

The only time that this approach is defensible is when the resulting values can only affect the operational behaviour of the program. This is often achieved by imposing type restrictions on the language (see sections 5.2.3 and 5.3.2). Mirani and Hudak [1995, section 3] take this idea one step further by wrapping all such values inside an operating-system monad [Peyton Jones and Wadler, 1993]. Both of these techniques can be used in conjunction with the other methods described in this section.

A more satisfactory solution is to accurately model the parameter in question. For example, Hudak [1986] has used this technique to develop a semantics for a simple language which includes the \texttt{exp1 on exp2}, and self expressions (the latter is similar to the \texttt{currentPID#} construct). The formal arguments of all valuation functions are extended to include a processor identifier, which represents the location of the current computation:

$$\mathcal{E}[\text{exp on pid}] \rho \text{ current_pid} = \begin{cases} \bot & \text{if } \bot \in \{\text{location}\_\text{function} i j \mid \forall i, j \in \text{PID#}\} \\ \mathcal{E}[\text{exp}] \rho \text{ new_pid} & \text{otherwise} \end{cases}$$

The \texttt{Program[]} rule provides the initial value of the \texttt{current_pid} parameter.

Incorporating \texttt{randomPID#}-style expressions, or attempting to model data migration, is more problematic. Consider the implicit specification of the location using a function of type \texttt{PID# \rightarrow PID# \rightarrow PID#} (at run time, the current processor identifier and a random processor identifier will be supplied as arguments.) This removes the need for both the \texttt{currentPID#} and \texttt{randomPID#} constructs, as well as simplifying the denotational semantics:

$$\mathcal{E}[\text{exp1 on exp2}] \rho = \begin{cases} \bot, & \text{if } \bot \in \{\text{location}\_\text{function} i j \mid \forall i, j \in \text{PID#}\} \\ \mathcal{E}[\text{exp1}] \rho, & \text{otherwise} \end{cases}$$

It would obviously be unrealistic to check that each location function meets the above requirement.

If none of the above techniques is applicable, it will be necessary to use a powerdomain [Schmidt, 1986, section 12.1, page 275], replacing all occurrences of the Val domain with $\mathbb{P}(\text{Val})$, where $\mathbb{P}(D)$ represents the powerdomain builder. In addition, each rule will have to be updated to handle multiple values, using either Haskell’s list comprehensions [Hudak et al., 1992, section 3.10, page 16] or a monad [Wadler, 1992, “Non-deterministic choice”, section 2.7] to handle the multiple values, and the order of evaluation re-examined. As a small consolation, the animation of the resulting semantics can be straightforward.

Side effects and thread interaction

The development of denotational semantics for sequential side-effecting language is well understood, and only requires the correct threading of the environment [Schmidt, 1986,
However, the combination of side effects and parallelism [Barth, Nikhil and Arvind, 1991; Jones and Hudak, 1993, section 4.3] generally implies complex descriptions, based on a large number of assumptions and limitations – consider, for example, the semantics presented by Hooman [1991] for an Occam-style language. Unless something can be done to limit the possible interactions, it is recommended that this stage of the design process should simply be missed out. Hudak [1987, section 2.1], for example, only allows destructive array updating if it can be proved (by the compiler) that this will not break the sequential semantics.

5.4.5 Run-time errors and exception handling

Generally, if an informal restriction is violated, the current thread should be terminated (and possibly the entire computation). Assuming that the failure can be detected, the following construct provides the necessary support:

\[
\varepsilon[\text{error } \text{exp}] \rho = \bot
\]

Unfortunately, using the Val domain definition from section 4.7.1, it is not possible to model low-level errors, including division by zero, by returning bottom. Furthermore, due to resource limitations, for example, it is possible that a valid computation will fail to terminate when run on a computer. This kind of error also cannot be easily modelled by the denotational description.

The provision of exception handling mechanisms, as used by Hammond [1991, section 2.4.1], is a more flexible approach to the same problem, in that it can be used to model non-fatal errors without resorting to the use of algebraic data types.

5.4.6 A selection of bottoms

In the previous sections, the bottom element, \( \bot \), has been used in a number of different roles:

- to model scheduling dependencies.
- to force evaluation beyond the usual head normal form.
- to represent non-termination, and thus limit the applicability of a spark or location construct.
- to indicate that a fatal error has occurred.

Obviously, the animation of the denotational semantics will only be able to handle the last type of bottom (a fatal error) in a non-trivial manner – all others will result in the non-termination of the implementation.

As a final note, when testing for bottom, extra care must be taken to avoid non-monotonicity [Schmidt, 1986, pages 112–113]. For example, the following valuation function would invalidate the entire semantics:

\[
\varepsilon[\text{is-bottom } \text{exp } \text{exp} \bot] \rho = \text{case } (\varepsilon[\text{exp}] \rho) \text{ of } \bot \rightarrow \varepsilon[\bot] \rho, \epsilon \rightarrow \varepsilon[\epsilon] \rho
\]
5.5 Summary

In this chapter a number of different syntax-driven guidelines for introducing parallelism into the STG$^t$ language have been proposed, covering: the additions of new production rules, primitive functions, and primitive types; and the modification of existing rules, whether they be taken from the abstract syntax, the type-inference rules, or the denotational semantics. Furthermore, the application of language restrictions and denotational semantics to the development of parallel languages has also been discussed.
Chapter 6

Managing parallelism — operational models

6.1 Introduction

This chapter discusses the development of an operational description to augment the denotational semantics of the parallel STG' language (see chapter 5). The STG machine provides the basic recipe, into which the parallel ingredients, including threads, messages, and shared memory, are added. To facilitate testing and debugging, the animation of the model (which is essentially a state-transition system) is also considered. The final description is then used by chapter 8 to provide the foundation upon which the compilation system is built.

Sections 6.2 and 6.3 are concerned with the introduction of parallelism into an operational model — the former develops a general framework to work within, while the latter deals with the issues specific to a parallel STG machine. The implications of the STG' language manipulations described in the previous chapter are then considered in section 6.4. The animation and testing of the resulting state machines are discussed in section 6.5, before the chapter is summarised in section 6.6.

6.2 Parallelism and the STG machine

This section explores the use of state transition systems to model modern multi-processor systems. Section 6.2.1 discusses the gross representation of the processing and communication elements. This model is then refined to explicitly include the notions of time and inter-processor synchronisation in sections 6.2.2 and 6.2.3 respectively. Shared-memory and message-passing abstractions are then examined in greater detail in sections 6.2.4 and 6.2.5.

6.2.1 One abstract machine or many?

When modelling a parallel or concurrent system, the possible interactions between the component processors can either be explicitly or implicitly specified. As an example of the first approach, both Peyton Jones, Gordon and Finne [1996, section 6.2] and Ostheimer [1993, section 3.4, page 39] use the π-calculus [Milner, 1993] as the underlying formalism. The following congruence and structural rule controls the “reactions” (the number of
\[
\text{(INIT)} \quad \text{init} \implies (1, \text{init}_p, P_1, \ldots, \text{init}_p, P_n) (\text{init}_s S)
\]

\[
\text{(STEP)} \quad \text{step} (i, P_1, \ldots, P_i, \ldots, P_n) S \implies (i', P_1, \ldots, P_i', \ldots, P_n) (\text{step}_s S')
\]
where \(i' = 1 + (i \mod n)\) and \((P_i', S') = \text{step}_p(P_i, S)\).

\[
\text{(FINAL)} \quad \text{final} (i, P_1, \ldots, P_n) S \implies \text{final}_p P_1 \land \cdots \land \text{final}_p P_n \land \text{final}_s S
\]

Figure 6.1: A simple processor framework for the parallel STG machine

Processors is unbounded:

\[
\begin{align*}
\text{(PAR)} & \quad P \mid Q \rightarrow P' \mid Q, \text{ if } P \rightarrow P' \\
\text{(COMM)} & \quad P \mid Q \equiv Q \mid P \\
\text{(ASSOC)} & \quad P_1 \mid (P_2 \mid P_3) \equiv (P_1 \mid P_2) \mid P_3
\end{align*}
\]

On the other hand, the \(\nu\)-STG machine [Hwang and Rushall, 1992] and the abstract machine for pH [Aditya et al., 1995] are defined in terms of a single processor, with each reaction rule specifying only one half of a processor-processor interaction. In fact, the data-parallel STG machine [Hill, 1994, rules 18, chapter 6, page 123] specifies all forms of parallelism in terms of auxiliary functions and set comprehensions.

While the former approach is undeniably superior from a theoretical standpoint, the latter boasts a greater number of relevant examples and is, arguably, less complex, making it the method of choice. However, to both simplify the animation (see section 6.5) and to clearly identify the primary communication mechanisms, an explicit framework will be assumed throughout this section. The framework shown in figure 6.1 will be used as the start point, and will be subsequently refined as the need arises.

This model comprises three distinct state-transition systems, the abstract states of which are: \(S\), the communication system; \(P\), a single processor; and \((i, P_1, \ldots, P_n) S\), an ensemble. Each has its own set of \text{init}, \text{final}, and \text{step} operators for creating the initial state, testing for a final state, and performing one state transition respectively. Note that the processor index, \(i\), records the identity of the processor to be stepped on the next transition. For most applications this round-robin scheduling is overly simplistic, and section 6.2.2 refines the model to allow the use of timing information to control the order of transitions.

To demonstrate the basic principles of the framework, figure 6.2 shows the \text{INIT}, \text{STEP}, and \text{FINAL} reduction rules describing a two processor ping-pong system [Booth et al., 1997]. The first processor, \(P_0\), pings its neighbour, waits for a reply, and then re-starts the cycle. The second processor, \(P_1\), is its dual, and waits to be pinged before ponging \(P_0\). The state diagram of this system is shown in figure 6.3, where the diagonal lines denote communication between the two processors. As specified by the \text{FINAL}_p and \text{FINAL}_s rules,
Figure 6.2: Transition rules for a simple ping-pong system

Figure 6.3: The state-transition diagram for the ping-pong system
\[ \text{(INIT)} \quad \text{init} \implies (\text{init}_v, P_1, \ldots, \text{init}_v, P_n) (\text{init}_s S) \]

\[ \begin{align*} \text{(STEP)} \quad \text{step} \quad & (P_1, \ldots, P_i, \ldots, P_n) \quad S \implies (P_1, \ldots, P'_i, \ldots, P_n) \quad (\text{step}_s S') \\
\text{where} \quad & (P'_i, S') = \text{step}_v (P_i, S) \\
\text{such that} \quad & \forall j \in \{1, \ldots, n\} \cdot \text{local\_time} \ P_i \leq \text{local\_time} \ P_j \end{align*} \]

\[ \begin{align*} \text{(FINAL)} \quad \text{final} \quad & (P_1, \ldots, P_n) \quad S \implies \text{final}_v \ P_1 \wedge \cdots \wedge \text{final}_v \ P_n \wedge \text{final}_s S \end{align*} \]

Figure 6.4: Explicitly modelling time in the processor framework

the system never terminates and simply repeats the cycle shown below:

<table>
<thead>
<tr>
<th>Step</th>
<th>Event</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.</td>
<td>(0, \text{INIT}_v P_0, \text{INIT}_v P_1) \text{INIT}_s S</td>
<td>\implies</td>
</tr>
<tr>
<td>1.</td>
<td>(0, Ping, WaitForPing) Nothing</td>
<td>\implies</td>
</tr>
<tr>
<td>2.</td>
<td>(1, WaitForPong, WaitForPing) HavePinged</td>
<td>\implies</td>
</tr>
<tr>
<td>3.</td>
<td>(0, WaitForPong, Pong) Nothing</td>
<td>\implies</td>
</tr>
<tr>
<td>4.</td>
<td>(1, WaitForPong, Pong) Nothing</td>
<td>\implies</td>
</tr>
<tr>
<td>5.</td>
<td>(0, WaitForPong, WaitForPing) HavePonged</td>
<td>\implies</td>
</tr>
<tr>
<td>6.</td>
<td>(1, Ping, WaitForPing) Nothing</td>
<td>\implies</td>
</tr>
<tr>
<td>7.</td>
<td>(0, Ping, WaitForPing) Nothing</td>
<td>\implies</td>
</tr>
<tr>
<td>\vdots</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Notice that nothing happens to \( P_0, P_1, \) and \( S \) between steps 3 and 4 – \( P_0 \) is waiting for a pong which has not yet been sent. A similar situation occurs between steps 6 and 7. While such wait reductions are acceptable for this small example, they can quickly obscure the other reductions as the number of processors increases. Section 6.2.3 addresses this problem by introducing the concept of busy, waiting, and stopped processors.

### 6.2.2 Abstractions of time

The operational models developed so far make no reference to time, and hence cannot encode either the expected run time of a rule, or the temporal relationships between events [Sadri, 1987, section 2, page 122]. Other situations which require an explicit model of time include time-stamping messages, specifying a time-out period when waiting for an \( \text{Ack} \) message (section 9.3.2), or implementing a heart-beat algorithm [Andrews, 1991, section 4, pages 63–68]. To incorporate time into the processor framework presented in section 6.2.1, each \( P \) is extended to include a local clock, \( t_{local} \), and the \text{STEP} \ rule has to be changed as shown in figure 6.4.

This suggests that a time-aware state-transition system could be used to estimate the run time of a physical system. Hehner [1994, section 12.4, page 195] summarises this line of reasoning, as well as identifying the main problem:

“To obtain the real execution time, just insert time increments as appropriate. Of course, this requires intimate knowledge of the implementation, both hardware and software; there’s no way to avoid it.”

Furthermore, the physical implementations can themselves be unpredictable. Hammond, Burn and Howe [1994, figures 1 and 2] demonstrate that a small variation in the size of
(INIT\(_P\))

\[
\begin{align*}
\text{init}_P P_0 & \implies Ping_{\text{local}=0} \\
\text{init}_P P_1 & \implies \text{WaitForPing}_{\text{local}=0}
\end{align*}
\]

( İnіt\(_S\))

\[
\text{init}_S S \implies \text{Nothing}
\]

(STEP\(_P\))

\[
\begin{align*}
\text{step}_P (\text{Ping}_t, S) & \implies (\text{WaitForPing}_{t+10}, \text{NewPing } t) \\
\text{step}_P (\text{Pong}_t, S) & \implies (\text{WaitForPong}_{t+10}, \text{NewPong } t)
\end{align*}
\]

\[
\begin{align*}
\text{step}_P (\text{WaitForPing}_t, \text{HavePonged } t_{\text{recv}}) \text{ such that } t \geq t_{\text{recv}} & \implies (\text{Ping}_{t+10}, \text{Nothing}) \\
\text{step}_P (\text{WaitForPong}_t, \text{HavePinged } t_{\text{recv}}) \text{ such that } t \geq t_{\text{recv}} & \implies (\text{Pong}_{t+10}, \text{Nothing})
\end{align*}
\]

\[
\text{step}_P (P_t, S) \implies (P_{t+1}, S)
\]

(STEP\(_S\))

\[
\begin{align*}
\text{steps}_S (\text{NewPing } t) & \implies \text{HavePinged } (t + 100) \\
\text{steps}_S (\text{NewPong } t) & \implies \text{HavePonged } (t + 100) \\
\text{steps}_S S & \implies S
\end{align*}
\]

(FINAL\(_P\))

\[
\text{final}_P P \implies \text{false}
\]

(FINAL\(_S\))

\[
\text{final}_S S \implies \text{false}
\]

Figure 6.5: Transition rules for a time-aware ping-pong system

the dynamic heap can give rise to a 50% difference in uniprocessor performance (this was attributed to a cache conflict between the argument stack and instruction stream). With regards to parallel systems, Trinder et al. [1996, section 4.1], commenting on the average speedup observed by the GUM system, note that:

"There is a degree of chaos in the results, since a single change in the placement of a spark at runtime can affect the overall runtime."

In summary, to achieve any degree of accuracy, the level of detail required [Jain, 1991, section 5.2, pages 66–67] would render the rule set worthless as a design tool. It is therefore assumed that each rule takes either one, ten, one hundred, or one thousand time units to complete. This still allows a certain degree of performance debugging without overburdening the design. It also guarantees that any estimate is treated with caution.

To illustrate the use of the extended framework, figure 6.5 shows the new rules for the ping-pong system presented in the previous section. The processors' local clocks appear as subscripts to the original processor states, and the times associated with each operation are shown in figure 6.6. The state-transition diagram for the new system is similar to that of the original (see figure 6.3). The first cycle of reductions is shown in figure 6.2.2 (disregarding the majority of wait reductions). Notice that the wait reductions account for over ninety percent of the total number of reductions. Not only is this distracting when examining the reduction steps, but can cause serious performance problems when it comes to animating the system. This problem is addressed in the following section.
<table>
<thead>
<tr>
<th>operation</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<td>WaitForPing, no ping available</td>
<td>1</td>
</tr>
<tr>
<td>WaitForPong, no pong available</td>
<td>1</td>
</tr>
<tr>
<td>Ping</td>
<td>10</td>
</tr>
<tr>
<td>Pong</td>
<td>10</td>
</tr>
<tr>
<td>WaitForPing, ping available</td>
<td>10</td>
</tr>
<tr>
<td>WaitForPong, pong available</td>
<td>10</td>
</tr>
<tr>
<td>communication delay</td>
<td>100</td>
</tr>
</tbody>
</table>

Figure 6.6: Time costs for the ping-pong system

<table>
<thead>
<tr>
<th>time</th>
<th>system state</th>
</tr>
</thead>
<tbody>
<tr>
<td>INIT</td>
<td>(INIT₉ P₀, INIT₇ P₁) INIT₅ S</td>
</tr>
<tr>
<td>0</td>
<td>(Ping₀, WaitForPing₀) Nothing</td>
</tr>
<tr>
<td>0</td>
<td>(WaitForPong₁₀, WaitForPong₀) NewPing 0</td>
</tr>
<tr>
<td>0</td>
<td>(WaitForPong₁₀, WaitForPong₀) HavePinged 100</td>
</tr>
<tr>
<td>1</td>
<td>(WaitForPong₁₀, WaitForPong₁) HavePinged 100</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>100</td>
<td>(WaitForPong₁₀₁₀, WaitForPong₁₀₀) HavePinged 100</td>
</tr>
<tr>
<td>101</td>
<td>(WaitForPong₁₀₁₀, Pong₁₁₀) Nothing</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>110</td>
<td>(WaitForPong₁₁₁, Pong₁₁₀) Nothing</td>
</tr>
<tr>
<td>111</td>
<td>(WaitForPong₁₁₁, WaitForPong₁₂₀) NewPong</td>
</tr>
<tr>
<td>111</td>
<td>(WaitForPong₁₁₁, WaitForPong₁₂₀) HavePonged 210</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>210</td>
<td>(WaitForPong₂₁₀, WaitForPong₂₁₀) HavePonged 210</td>
</tr>
<tr>
<td>211</td>
<td>(Ping₂₂₀, WaitForPong₂₁₀) Nothing</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
</tbody>
</table>

Figure 6.7: State transitions for the time-aware ping-pong system
6.2.3 Inter-processor synchronisation

Despite extending the framework to include an explicit model of time, it is still not yet suitable for modelling a parallel STG machine – the number of wait reductions becomes a significant problem as a system increases in complexity. The first step to eliminating this problem is to note that a processor can be in one of three states:

<table>
<thead>
<tr>
<th>state</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>active</td>
<td>the processor is busy performing useful work.</td>
</tr>
<tr>
<td>waiting</td>
<td>the processor cannot continue until it either receives data via the communication system or it times out.</td>
</tr>
<tr>
<td>stopped</td>
<td>the processor has completed its task and will play no further part in the global computation.</td>
</tr>
</tbody>
</table>

Within the existing framework, the processor’s local clock, \( t_{local} \), is used to decide which processor to step next. While this is correct for active processors, it is often too early for waiting processors, and they then have to undergo a number of wait reductions to bring them to a time at which they can interact with the communication system. This situation is avoided in the framework shown in figure 6.8. The \( \text{next-time} \) function returns the earliest time at which a processor can become active based on the current state of the processor and of the communication system. Before a processor is stepped, its local clock is set to this new time \( t_{next} \), i.e. \( P_{t_{local}=t_{next}} \). Notice also that the communication system may now have to be stepped independently of the processors.

Returning again to the ping-pong example, the system can be upgraded to use the new processor states by defining the functions \( \text{is-active}, \text{is-waiting}, \text{is-stopped} \), and \( \text{comms-time} \), shown below:

\[
\begin{align*}
\text{(IS_ACTIVE)} & \quad \text{is-active} (P, S) \quad \text{if is-active } P \\
& \quad \text{is-active} (W, S) \quad \text{if is-active } W \\
& \quad \text{is-active} (P, S) \quad \text{false} \\
\end{align*}
\]

\[
\begin{align*}
\text{(IS_WAITING)} & \quad \text{is-waiting} (P, S) \quad \text{false} \\
& \quad \text{is-waiting} (P, S) \quad \text{false} \\
& \quad \text{is-active} (P, S) \quad \text{true} \\
\end{align*}
\]
The reduction cycle is as before, but with all of the trivial waiting reductions removed. The ping-pong example can now be easily extended to include time-outs, as shown in figure 6.9. If either a ping or a pong is lost, then one of the processors will time-out and therefore stop. This will then lead to the other processor stopping, as it will also time-out waiting for the reply to the lost message. The final state is reached as soon as both processors have stopped.

The following two sections look at the modelling of shared-memory and message-passing architectures within the processor framework.

### 6.2.4 Shared Memory

On one level, modelling shared-memory architectures is straightforward, requiring only the specification of the hierarchy of components:

<table>
<thead>
<tr>
<th>P</th>
<th>S</th>
<th>example systems</th>
</tr>
</thead>
<tbody>
<tr>
<td>heap</td>
<td>task pool</td>
<td>heap</td>
</tr>
<tr>
<td>$h_{\text{local}}$</td>
<td>$w_{\text{local}}$</td>
<td>none</td>
</tr>
<tr>
<td>$h_{\text{local}}$</td>
<td>$w_{\text{local}}$</td>
<td>$h_{\text{global}}$</td>
</tr>
<tr>
<td>none</td>
<td>none</td>
<td>$h_1 \cdots h_n$</td>
</tr>
<tr>
<td>none</td>
<td>none</td>
<td>$h_{\text{global}}$</td>
</tr>
</tbody>
</table>

Shared-memory components are manipulated in exactly the same manner as with the sequential STG machine. The following example illustrates how local heap allocation is preferred in BBN Haskell:

\[
\text{Eval } \left( \text{let } \begin{cases} \text{var = vs } \pi \text{ xs } \rightarrow \text{exp}_{\text{rhs}} \end{cases} \right) \text{ exp } \rho \text{ as } rs \text{ us } h_1 \cdots h_i \cdots h_n \text{ wps } \sigma \text{ t} \\
\Rightarrow \text{Eval exp } \rho[\text{var } \rightarrow a] \text{ as } rs \text{ us } h_1 \cdots h'_i \cdots h_n \text{ wps } \sigma \text{ t + 10} \\
\text{where } i = \text{ processor_id} \\
h'_i = h_i[a \rightarrow (\text{vs } \pi \text{ xs } \rightarrow \text{exp}_{\text{rhs}})(\rho \text{ vs})]
\]

In this example, the step prefix has been dropped and $P \equiv (\text{code, as, rs, us, } \sigma, t_{\text{local}})$, and $S \equiv (h_{p_1}, \ldots, h_{p_n}, \text{wps})$. Typically, with shared-memory systems, it takes longer to access memory on a remote machine than it does to access local memory. Again, this is demonstrated using the BBN Haskell system:

\[
\text{Enter a as rs us } h_1 \cdots h_n \text{ wps } \sigma \text{ t} \\
\text{such that } \exists j \bullet h_j[a \rightarrow (\text{vs } \pi \text{ xs } \rightarrow \text{exp}) w_{sf}] \text{ and } \text{length}(\text{as}) \geq \text{length}(\text{xs}) \\
\Rightarrow \text{Eval exp } \rho \text{ as'} rs \text{ us } h_1 \cdots h_n \text{ wps } \sigma \text{ t } + \begin{cases} 1, & \text{such that } i = j \\
10, & \text{otherwise} \end{cases}
\]

where $i = \text{ processor_id}$, ...

However, as noted by Bennett [1993], the second-order effects of the processor's cache are almost as equally important in terms of overall performance. While it would be possible
\[
\begin{align*}
\text{(INIT\_P)} & \quad \text{init}_P P_0 \implies Ping_{t_{\text{local}}=0} \\
& \quad \text{init}_P P_1 \implies \text{WaitForPing}_{t_{\text{local}}=0} = 1000 \\
\end{align*}
\]

\[
\begin{align*}
\text{(INIT\_S)} & \quad \text{init}_S S \implies \text{Nothing} \\
\end{align*}
\]

\[
\begin{align*}
\text{step}\_P (\text{Ping}_t, S) & \implies (\text{WaitForPong}_{t+10} (t + 1000), \text{NewPing}_t) \\
\text{step}\_P (\text{Pong}_t, S) & \implies (\text{WaitForPing}_{t+10} (t + 1000), \text{NewPong}_t) \\
\end{align*}
\]

\[
\begin{align*}
\text{step}\_P (\text{WaitForPing}_t \text{time out}_t, \text{HavePonged}_t t_{\text{recv}}) & \text{ such that } t \geq t_{\text{recv}} \\
& \implies (\text{Ping}_{t+10}, \text{Nothing}) \\
\text{step}\_P (\text{WaitForPong}_t \text{time out}_t, S) & \implies (\text{Stopped}, S) \\
\text{step}\_P (\text{WaitForPing}_t \text{time out}_t, \text{HavePinged}_t t_{\text{recv}}) & \text{ such that } t \geq t_{\text{recv}} \\
& \implies (\text{Pong}_{t+10}, \text{Nothing}) \\
\text{step}\_P (\text{WaitForPong}_t \text{time out}_t, S) & \implies (\text{Stopped}, S) \\
\end{align*}
\]

\[
\begin{align*}
\text{step}\_S (\text{NewPing}_t) & \implies \text{HavePinged}_t (t + 100), \text{ 95\% of the time} \\
\text{step}\_S (\text{NewPing}_t) & \implies \text{Nothing}, \text{ otherwise} \\
\text{step}\_S (\text{NewPong}_t) & \implies \text{HavePonged}_t (t + 100), \text{ 95\% of the time} \\
\text{step}\_S (\text{NewPong}_t) & \implies \text{Nothing}, \text{ otherwise} \\
\text{step}\_S S & \implies S \\
\end{align*}
\]

\[
\begin{align*}
\text{(IS\_ACTIVE)} & \quad \text{is\_active} (\text{WaitForPing}_t \text{time out}_t, S) \implies \text{false} \\
\text{is\_active} (\text{WaitForPong}_t \text{time out}_t, S) & \implies \text{false} \\
\text{is\_active} (P, S) & \implies \text{true} \\
\end{align*}
\]

\[
\begin{align*}
\text{(IS\_WAITING)} & \quad \text{is\_waiting} (P, S) \implies \neg \text{is\_active} (P, S) \\
\end{align*}
\]

\[
\begin{align*}
\text{(IS\_STOPPED)} & \quad \text{is\_stopped} (\text{Stopped}, S) \implies \text{true} \\
\text{is\_stopped} (P, S) & \implies \text{false} \\
\end{align*}
\]

\[
\begin{align*}
\text{(COMMS\_TIME)} & \quad \text{comms\_time} (\text{WaitForPing}_{t_{\text{local}}} \text{time out}_t, \text{HavePinged}_t t_{\text{recv}}) \\
& \implies \max (t_{\text{local}}, \min (\text{time out}_t, t_{\text{recv}})) \\
\text{comms\_time} (\text{WaitForPong}_{t_{\text{local}}} \text{time out}_t, \text{HavePonged}_t t_{\text{recv}}) \\
& \implies \max (t_{\text{local}}, \min (\text{time out}_t, t_{\text{recv}})) \\
\text{comms\_time} (P, S) & \implies \infty \\
\end{align*}
\]

\[
\begin{align*}
\text{(FINAL\_P)} & \quad \text{final}_P P \implies \text{is\_stopped} P \\
\text{(FINAL\_S)} & \quad \text{final}_S S \implies \text{true} \\
\end{align*}
\]

Figure 6.9: Adding time outs to the ping-pong system
to extend the heap model to include such factors, section 6.2.2 strongly warns that the primary aim of the operational model is to concisely specify gross behaviour and not to provide accurate estimates of a system’s run-time performance.

Access to global memory often has to be carefully controlled through the use of locks, semaphores, and monitors [Hwang and Briggs, 1985, section 8.1, pages 557–577]. For example, GAML does not lock the global work pool, thereby reducing run-time overheads at the risk of duplicating work [Maranget, 1991, section 4.3]. Unfortunately, as each state transition is atomic, this aspect of an implementation is difficult to specify without fragmenting each rule into a series of closely-coupled steps – again, the resulting complexity would be hard to justify. Roscoe [1997, section 0.1, page 4] summarises the problems with shared-memory systems as follows:

“The main disadvantage from the point of view of modelling general interacting systems is that the communications between components, which are plainly vitally important, happen too implicitly.”

6.2.5 Message-passing architectures

Traditional message-passing systems provide support for two main operations: sending messages, and receiving messages. High level operations, such as barriers and reduction trees [Snir et al., 1994, chapter 5, pp. 90–126], are often also provided, but these are almost always built on top of the point-to-point primitives. Typically, there are two types of send operation [Hwang and Briggs, 1985, section 5.1.3, p. 332]:

asynchronous an asynchronous send will complete as soon as the message has either been injected into the communication network, or has been stored by the operating system for later transmission;

synchronous a synchronous send will not complete until the target of the message has acknowledged receipt.

However, a process that commits to receiving a message will wait until either the specified message arrives or the operation times out. To help avoid any potentially long and wasteful delays, a poll function is often used to test if a suitable message has already been received (therefore guaranteeing that a receive operation will complete almost immediately).

Asynchronous message passing

The ping-pong system outlined in the previous sections could be re-written as follows:

\[
\begin{align*}
P_0 & \equiv \text{repeat } (\text{SEND}_{\text{asynch}} P_1 \text{Ping}; \text{RECV } P_0 \text{Pong}) \\
P_1 & \equiv \text{repeat } (\text{RECV } P_0 \text{Ping}; \text{SEND}_{\text{asynch}} P_0 \text{Pong})
\end{align*}
\]

The message-based model for the \text{Ping} transition, shown below, is unsurprisingly similar to that presented in figure 6.9:

\[
\begin{array}{c}
\text{SEND } P_1 \text{ Ping} \\
\text{RECV } P_1 \text{ Pong}
\end{array}
\begin{array}{c}
t_{\text{local}} \\
+100
\end{array}
\begin{array}{c}
\text{recvs}
\end{array}
\begin{array}{c}
\text{recvs}
\end{array}
\begin{array}{c}
\text{sends}
\end{array}
\begin{array}{c}
\text{sends}
\end{array}
\begin{array}{c}
\langle(P_1, \text{Ping})\rangle
\end{array}
\]

Here, the communication network is represented as a pair of queues: \text{sends} contains the messages generated on this processor that are to be transmitted over the network; and \text{recvs} contains the messages that have arrived for the processor up to the current time
period. This is a partial view of S, which includes such pairs for each of the processors, and a network model containing all in-transit messages.

Message reception can be defined as follows:

- **(PING)**
  \[
  \text{Receive } P_0 \text{ Ping } \ t_{local} \quad (\text{sends, recvs})
  \]
  such that \((P_0, \text{Ping}) \in \text{recvs}\)
  \[
  \Rightarrow \quad \text{Send } P_0 \text{ Pong } \ t_{local} + 100 \quad (\text{sends, recvs}')
  \]
  where \(\text{recvs}' = \text{remove } (P_0, \text{Ping}) \ \text{recvs}\)

- **(IS_ACTIVE)**
  \[
  \text{is-active (Receive } P_i \text{ message) } \Rightarrow \text{false}
  \]

- **(COMMS_TIME)**
  \[
  \text{comms-time (Receive;} t_{local} P_j \text{ message, } S) \Rightarrow \max(t_{local}, \text{next}_P \text{ } P_i \ (P_j, \text{message}) \ S)
  \]

Note that patterns can be used to specify which messages should be received, with the earliest arrival being returned in the case of multiple matches.

While the previous receive model is superficially correct, the time penalty for receiving a message is paid when the processor commits to receiving it, rather than when the message actually arrived. Despite section 6.2.2's argument against accurate performance modelling, this is a serious flaw. Consider, for example, a centralized transaction server which receives a request every 100 time steps. Using the previous model, and assuming it takes 150 time steps to process and reply to a request, the server will complete a transaction every 250 time steps. However, on a real multiprocessor, the server would be too busy receiving the requests to make any progress towards completing even the first transaction. The following model corrects this problem by immediately copying any new messages across from the communication system into a local queue:

- **(INT_RECV)**
  \[
  \text{code } t_{local} \quad \text{messages} \quad (\text{sends, recvs})
  \]
  such that \(\text{length}(\text{recvs}) > 0\)
  \[
  \Rightarrow \quad \text{code } t_{local} + 100 \quad \text{messages}' \quad (\text{sends, recvs}')
  \]
  where \(\text{messages}' = \text{messages} \oplus \text{message}\)
  \[
  \text{message} = \text{head recvs}
  \]
  \[
  \text{recvs}' = \text{tail recvs}
  \]

Note that any rule which does not match against a specific code mode is analogous to a microprocessor interrupt [Hwang and Briggs, 1985, section 2.5.2, pages 125–126]. Implementing such rules can be problematic, and is discussed in section 8.3.3. Messages are now taken from the local queue rather than directly from the communication system:

- **(RECV)**
  \[
  \text{Receive pattern cont } t_{local} \quad \text{messages} \quad S
  \]
  such that \(\text{pattern} \in \text{messages}\)
  \[
  \Rightarrow \quad \text{cont message} \quad t_{local} + 10 \quad \text{messages}' \quad S
  \]
  where \((\text{message, messages}') = \text{remove pattern messages}\)

The Receive code component takes two arguments: pattern determines which messages are acceptable; and cont specifies what to do with the matching message once it has
been received. As an example, for the WaitForPing transition, the pattern would be $(P_0, Ping)$, and the continuation would be $\lambda message.\text{Send} (P_0, Pong)$.

### Synchronous message passing

Synchronous communication can be modelled using pairs of asynchronous sends and blocking receives, as shown below:

\[
\begin{align*}
\text{SEND}_{\text{synch}} P_i \text{ message} & \equiv \text{SEND}_{\text{asynch}} P_i \text{ message; RECV } P_i \text{ Ack} \\
\text{RECV}_{\text{synch}} P_j \text{ message} & \equiv \text{RECV } P_j \text{ message; SEND}_{\text{asynch}} P_j \text{ Ack}
\end{align*}
\]

These equivalence relations can be used to re-write the ping-pong systems as follows (with the Pong message being replaced by Ack):

\[
\begin{align*}
P_0 & \equiv \text{repeat} (\text{SEND}_{\text{synch}} P_1 \text{ Ping}) \\
P_1 & \equiv \text{repeat} (\text{RECV } P_0 \text{ Ping})
\end{align*}
\]

When dealing with large messages, synchronous sends often involve an initial exchange to allow the receiver to allocate a buffer of sufficient size to hold the message:

\[
\begin{align*}
\text{SEND}_{\text{synch}} P_i \text{ message} & \equiv \text{SEND}_{\text{asynch}} P_i \text{ Req length(message); RECV } P_i \text{ Buffer } a; \\
& \quad \text{SEND}_{\text{asynch}} P_i \text{ Data } a \text{ message;} \\
& \quad \text{RECV } P_i \text{ Ack} \\
\text{RECV}_{\text{synch}} P_j \text{ message} & \equiv \text{RECV } P_j \text{ Req message\_length;} \\
& \quad a = \text{ALLOC message\_length;} \\
& \quad \text{SEND}_{\text{asynch}} P_j \text{ Buffer } a; \\
& \quad \text{RECV } P_j \text{ Data } a \text{ message;} \\
& \quad \text{SEND}_{\text{asynch}} P_j \text{ Ack}
\end{align*}
\]

### 6.3 Operational semantics and the STG machine

While a denotational semantics defines a language, an operational semantics can be considered as an abstract implementation of the language. Typically, in the context of parallel functional programming, an operational description will need to address the following issues:

**The evaluation mechanism** specifies the order of evaluation, the argument-passing convention, the return mechanism, and the closure model. The default sequential STG machine is non-strict, has contiguous argument and continuation-based return stacks, and uses the push-enter closure model [Peyton Jones and Salkild, 1989, section 3].

**Communication and synchronisation** ensures that the myriad computational elements co-operate safely and efficiently. The system can be viewed at three levels: single processors, small groups of inter-working processors, and the system as a whole.

**Resource management** includes the definition of the system components (such as the heap and stack), the sharing mechanism, and any high-level tasks, such as the garbage collector, thread scheduler, or load balancer. The sequential STG machine uses a self-updating model for controlling the sharing of thunks [Peyton Jones and Salkild, 1989, section 3.1.2].
partitioning and naming determines the placement of data and functional groups on processors, and the visibility and scoping of variables. Depending upon the nature of the system, both static and dynamic partitioning may have to be considered. The sequential STG machine uses both global and local environments to control scoping and visibility.

These issues are explored in greater depth in sections 6.3.1 to 6.3.4. Moreover, it is also worth considering what is not dealt with at the operational level, but is deferred to the compilation rules (see chapter 8):

register allocation determines which values should be stored in a processor’s registers at each point in a program’s execution [Muchnick, 1997, section 16.1].

closure layout – while the sequential STG machine uses the push-enter closure model, a number of different implementations are possible. For example, GHC uses reversed information tables to reduce the number of indirections required to enter a closure [Peyton Jones et al., 1993].

low-level implementation of components – the operational semantics uses a high-level model of the components, whereas the implementation may use a more complex representation in order to improve efficiency. For example, the three stacks used by the sequential STG machine are actually implemented using just two stacks [Peyton Jones and Salkild, 1989, section 8.2].

low-level optimisations such as branch optimisations, unreachable-code elimination, and instruction scheduling [Muchnick, 1997].

6.3.1 The evaluation mechanism

The evaluation mechanism used by the sequential STG machine has already been introduced in section 4.8. The remainder of this section, therefore, will concentrate on the ways in which the basic model can be modified. For further details, both the STG report [Peyton Jones and Salkild, 1989] and the implementation taxonomy presented by Douence and Fradet [1995] are highly recommended.

The code component

Before investigating the evaluation mechanism in any detail, it is worth reflecting upon the role the code component plays in the sequential STG machine. As described in section 4.8.2, the code component is the primary driving force behind the evaluation process and serves a role similar to that of a microprocessor’s instruction stream. However, unlike its hardware equivalent, the code component also splits the computation into distinct phases:

<table>
<thead>
<tr>
<th>phase</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Eval</td>
<td>co-ordinates the flow of execution within a closure</td>
</tr>
<tr>
<td>Enter</td>
<td>applies a closure to the arguments on the argument stack</td>
</tr>
<tr>
<td>Return</td>
<td>invokes the appropriate return convention for a given type</td>
</tr>
</tbody>
</table>

While these are sufficient for a sequential system, it will often be necessary to add new phases to the evaluation mechanism. For example, the GUM distributed-memory system presented in section 9.3 adds the GetWork and WaitWork phases in order to model the load-balancing algorithm.
\[ E[\text{let# } \text{var} = \text{exprhs } \text{exp}] \rho = \text{case } (E[\text{exprhs}] \rho) \text{ of} \]
\[ \bot \rightarrow \bot \]
\[ \epsilon \rightarrow E[\text{exp}] (\rho \oplus \{\text{var} \mapsto \epsilon\}) \]

\[ \Rightarrow \]
\[ \text{Eval } (\text{let# } \text{var} = \text{exprhs } \text{exp}) \rho \text{ as } \text{rs us h } \sigma \]
where \[ \text{return} = \text{Forced}_{\text{Int#}} \text{var exp } \rho' \]
\[ \text{exprhs} \text{ is of type } \text{Int#} \]
\[ \text{dom}(\rho') = \mathcal{FV}[\text{exp}] \]

\[ \Rightarrow \]
\[ \text{Return}_{\text{Int#}} k \text{ as } (\text{Forced}_{\text{Int#}} \text{var exp } \rho) : \text{rs us h } \sigma \]
where \[ \rho' = \rho \oplus \{\text{var} \mapsto k\} \]

Figure 6.10: Specifying the order of evaluation of the \text{let#} construct: rule i. is the denotational semantics of the expression, while rules ii. and iii. are, respectively, the required \text{Eval} and \text{Return} phases of the operational semantics

**Order of evaluation**

While it is straightforward to specify the order of evaluation using denotational semantics (see section 5.4.1), reflecting such changes in the operational model is more complex. There are two main reasons for this:

1. The \text{Eval} and \text{Return} phases are closely coupled, requiring that at least two new rules be added to the system. Figure 6.10 illustrates this by presenting both the denotational and operation descriptions required by the addition of the \text{let#} construct (see sections 4.3, 4.3.4, and 4.8.4 for further details).

2. As described in section 4.5.1, the STG machine's aggressive take mechanism [Beemster, 1994] means that the evaluation of a partial application cannot be forced. Therefore, at the operational level, all changes to the order of evaluation can only apply to expressions which are known to return primitive values or algebraic data types. This is an unacceptable restriction to place upon a parallel language which supports both polymorphism and higher-order functions. The modifications required to support the forcing of arbitrary STG' expressions are shown in figure 6.11.

The resulting polymorphic \text{letstrict} construct can be used to alter the default order of evaluation, as demonstrated by the definition of the \text{strict_id} function given below:

```haskell
STG' code
strict_id = [] \forall [x] \rightarrow letstrict y = x in y;
```

The explicit scheduling annotations of Para-functional Haskell discussed in section 9.4 serve as another example of modifying the order of evaluation of the sequential STG machine.

**Passing arguments**

The sequential STG machine uses a contiguous stack on which to pass arguments, as specified by rules 1 (function application), 2 (closure entry), and, to a lesser extent, 17’
\[
\begin{align*}
\text{LET\/STRICT-EXP} & \quad \frac{T E \vdash \text{bind} : (v a r, \pi)}{L V E = \{v a r \mapsto \pi\}} \\
& \quad \frac{\text{exp} \quad \text{LET\/STRICT-EXP} \quad \text{exp} \vdash \text{exp} : \pi_{\text{exp}}}{T E \vdash \text{letstrict bind exp} : \pi_{\text{exp}}} \\
& \quad \frac{\text{exp}}{T E \vdash \text{exp}}
\end{align*}
\]

\[
\begin{align*}
\text{Eval} \quad (\text{letstrict} \ (v a r = \text{exp}_{\text{rhs}}) \ \text{exp}_{\text{body}}) \ \rho & \quad \text{as} \quad r : \text{rs} \ \text{us} \ \text{h} \ \sigma \\
\Rightarrow & \quad \text{Eval} \ \text{exp}_{\text{rhs}} \ \rho \\
\text{where} & \quad r = \text{Forced var} \ \text{exp}_{\text{body}} \ \rho' \\
& \quad \text{dom}(\rho') = F V [\text{exp}_{\text{body}}]
\end{align*}
\]

\[
\begin{align*}
\text{Return}_{\chi \pi_1, ..., \pi_n} \ c \ \text{ws} \ \text{as} \ & \ (\text{Forced var exp}_{\text{body}} \ \rho) : \text{rs} \ \text{us} \ \text{h} \ \sigma \\
\Rightarrow & \quad \text{Eval} \ \text{exp}_{\text{body}} \ \rho' \ \text{as} \quad \text{rs} \ \text{us} \ \text{h'} \ \sigma \\
\text{where} & \quad \rho' = \rho \oplus \{v a r \mapsto a\} \\
& \quad \text{h'} = h[a \mapsto (v s \ r \ \{\} \rightarrow c \ v s, w s)] \\
& \quad v s \ \text{is a sequence of arbitrary distinct variables} \\
& \quad \text{length}(v s) = \text{length}(w s)
\end{align*}
\]

\[
\begin{align*}
\text{Enter} \ a \ \text{as} \ & \ \{\text{stack} \ (a s_u, r s_u, a_u) : \text{us} \ \text{h} \ \sigma \\
\text{such that} & \quad h[a \mapsto (v s \ r \ \{\} \rightarrow c \ v s, w s)], \ \text{and} \ \text{length}(a s) < \text{length}(a s) \\
\Rightarrow & \quad \text{Return}_{\text{FUN}} \ a_u \ \text{as} \ \text{++} \ a s_u \ \text{rs}_u \ \text{us} \ \text{h}_u \ \sigma \\
\text{where} & \quad x s' = \text{take length}(a s) \ x s \\
& \quad f \ \text{is an arbitrary variable} \\
& \quad h_u = h[a_u \mapsto ((f : x s') \ r \rightarrow f \ x s', (a : a s))]
\end{align*}
\]

\[
\begin{align*}
\text{Return}_{\text{FUN}} \ a \ \text{as} \ (\text{Forced var exp}_{\text{body}} \ \rho) : \text{rs} \ \text{us} \ \text{h} \ \sigma \\
\Rightarrow & \quad \text{Eval} \ \text{exp}_{\text{body}} \ \rho' \ \text{as} \quad \text{rs} \ \text{us} \ \text{h} \ \sigma \\
\text{where} & \quad \rho' = \rho \oplus \{v a r \mapsto a\}
\end{align*}
\]

\[
\begin{align*}
\text{Return}_{\text{FUN}} \ a \ \text{as} \ (\text{Case}_r \ \text{alts} \ \rho) : \text{rs} \ \text{us} \ \text{h} \ \sigma \\
\Rightarrow & \quad \text{Enter} \ a \ \text{as} \ (\text{Case}_r \ \text{alts} \ \rho) : \text{rs} \ \text{us} \ \text{h} \ \sigma
\end{align*}
\]

Figure 6.11: Modifying the STG machine to allow the forcing of arbitrary boxed expressions
Figure 6.12: Argument passing using heap-allocated application frames

(updating a partial application). The main alternative to this style, is that of using heap-
allocated application frames, as used by the New Jersey SML compiler [Appel and Jim,
1990]. For a discussion of the relative advantages and disadvantages of these two systems
see Peyton Jones and Salkild [1989, section 3.2.3].

On first inspection of the sequential STG machine, the argument stack appears to
be indespensible – however, figure 6.12 shows the main modifications required to use
application frames. Essentially, the argument stack has been replaced with a frame pointer,
fp, which is the last entry in a singly-linked list of application frames. Instead of pushing a
function’s arguments onto a stack, the values are stored in a new heap-allocated application
frame. The new frame contains a back pointer to the old frame, thereby allowing access
to all of the other unused arguments.

As a further example, Mattson’s speculative evaluation system [Mattson Jr., 1993a]
studied in section 9.2 uses a separate stack for every independent thread of computation
(see rules SCHED₂ and BH₂).
Returning values

The sequential STG machine uses the return stack to control the order of evaluation: once a sub-computation has finished, the top continuation from the return stack is invoked and passed the result as one of its arguments (see section 4.8.7). The New Jersey SML compiler [Appel, 1992] makes this behaviour explicit by transforming all user-defined functions so that they take the return continuation as an extra argument.

The STG machine allows each different data type to have a custom return mechanism. For example, returning literal values is very different from the multi-way switch used when dealing with algebraic constructors. Indeed, the ReturnLit rule serves as a template for a number of more specific rules, including ReturnInt, ReturnChar, and ReturnDouble. In this respect, the STG return mechanism subsumes that of traditional imperative languages such as C and Pascal. As an example, section 6.4.2 presents a return mechanism suitable for handling pipeline representations.

Typically, the Return mechanism will initiate another Eval phase. However, it may be necessary to abnormally return, either due to a runtime error or, for example, the current thread having finished. As a simple example, consider the rule for handling the error primitive in the sequential STG' language:

\[
\text{Eval (error message) } \rho \text{ as } rs \text{ us } h \sigma \\
\text{such that } (\text{message, a}) \in \rho \\
\Rightarrow \text{ReturnError\# a as } rs \text{ us } h \sigma
\]

\[
\text{(ERROR-RET)} \\
\Rightarrow \text{ReturnError\# a as } rs \text{ us } h \sigma \\
\Rightarrow \text{Stop } (\)_{\text{stack}} (\)_{\text{stack}} (\)_{\text{stack}} h \sigma
\]

Notice that these two rules could be combined, such that the code component switches immediately from the Eval phase to the Stop phase. While this would certainly be more concise, the presented rules offer the possibility of providing a comprehensive exception-handling mechanism [Pitman, 1990] to the STG machine. Furthermore, MacLennan's Regularity rule of language design could be invoked, i.e. Eval phases should always end with a transition to a Return phase. However, a more substantial example of alternative return mechanisms can be found in section 9.2.2, where the issue of thread termination is discussed (see rules SCHED_1 and END_THREAD).

6.3.2 Communication and synchronisation

Unlike the other mechanisms described so far in this section, communication and synchronisation is not an end in itself – it simply enables the myriad processing elements to cooperate safely and effectively to achieve a shared goal.

Black holes

The closure serves as the primary point of synchronisation for single-processor operations. As an example, on uni-processor systems, a thunk is overwritten with a black hole pending completion of its evaluation. This enables self-referencing code, such as \( x = \{x\} u \to 1 + x \), to be detected and stopped before the heap is exhausted. The rules for this behaviour are
shown below:

\[
(15') \quad \text{Enter } a \text{ as } rs \quad \text{us } h[a \mapsto (vs \mapsto e, ws)] \sigma
\implies \text{Eval } e \rho \{a, as, rs\} : \text{us } h[a \mapsto \text{BlackHole}] \sigma
\]

where \( \rho = \{v_1 \mapsto w_1, \ldots, v_n \mapsto w_n\} \) and \( (w_i, w_i) = (vs !i, ws !i) \)

\[
(\text{BH}) \quad \text{Enter } a \text{ as } rs \text{ us } h[a \mapsto \text{BlackHole}] \sigma
\implies \text{Eval } \text{error } as \text{ rs us } h \sigma
\]

Closures are also used as synchronisation points in threaded systems, and, again, black holes are used. However, entering a black hole now indicates that the current thread cannot proceed until another thread has finished evaluating the original thunk. The current thread, therefore, blocks and another thread is scheduled (a fuller treatment of this style of synchronisation can be found in section 9.2.2):

\[
(\text{BH'}) \quad \text{Enter } a \text{ as } rs \text{ us } t_id \text{ wp } h[a \mapsto \text{BlackHole } ts'] \sigma
\implies \text{GetThread } as \text{ rs us } t_id \text{ wp } h[a \mapsto \text{BlackHole } ts', t_id \mapsto TSO state_1] \sigma
\]

where \( ts' = \text{enqueue } t_id \text{ ts} \)
\( state_1 = (\text{Enter } a, as, rs, us) \)

In shared-memory systems, using this style of synchronisation may require that locks are used to control access to shared closures. Without locks, for example, it would be possible for two processors to simultaneously start evaluating the same thunk, thereby duplicating work and possibly reducing efficiency.

Processor-processor interactions

The primary mechanism for inter-processor communication and synchronisation will depend upon the the target architecture, i.e. messages or shared data structures. Due to the implicit nature of shared-memory systems (see section 6.2.4), messages will be used throughout this section. However, the principles remain the same for both paradigms.

The GUM’s mechanism for handling references to remote closures will be used to illustrate the basic techniques (see section 9.3.2 for a more comprehensive presentation). Again, the closure is used as the main synchronisation point, with a \text{FetchMe} closure being used to represent a remote reference. When the closure is entered, a message is sent to the owner requesting its value. The current thread suspends, pending a reply from the owner:

\[
(\text{FM}) \quad \text{Enter } a \text{ as } rs \text{ us } t_id \text{ wp } h[a \mapsto \text{FetchMe } j \ a'] \sigma \ b_i
\implies \text{GetThread } as \text{ rs us } t_id \text{ wp } h[a \mapsto \text{Wait } j \ a' \ t_id] \sigma \ b'_i
\]

where \( b'_i = \text{enqueue } j, \text{Fetch } a' \ a \text{ ts} \)
\( state = (\text{Enter } a, as, rs, us) \)

The \text{Wait} closure is used to prevent multiple \text{Fetch} messages being sent, and \( b_i \) represents the message buffers for processor \( i \). Note the similarity between the entry routines for the \text{FetchMe} and threaded \text{BlackHole} closures.

Upon reception of a \text{Fetch} message, the remote processor will reply with a \text{Resume} message, which contains the closure’s actual value (it may well be a thunk with references
to other closures on the remote machine). The rule for handling a Resume message is as follows:

\[
\text{\texttt{Resume\hspace{1em}message\hspace{1em}is\hspace{1em}handled\hspace{1em}as:\hspace{1em}}}
\]

\[
\text{(RM)\hspace{1em}}
\]

\[
\begin{align*}
\text{code\ as\ rs\ us\ t_id\ wp\ h[\sigma\rightarrow\text{Wait\ j\ a'}\ ts]\ \sigma\ (b_{in},\ b_{out})} & \\
\text{such\ that\ } (j,\text{Resume\ a\ packed\_closures}) \in b_{in} & \\
\implies \text{code\ as\ rs\ us\ t_id\ wp'\ h'} & \sigma\ (b'_{in},\ b'_{out}) & \\
\text{where } b'_{in} = \text{dequeue}(j,\text{Resume\ a\ packed\_closures})\ b_{in} & \\
b'_{out} = \text{enqueue}\ (j,\text{Ack\ a'}\ b_{out}) & \\
h' = \text{unpack\ packed\_closures\ h} & \\
w' = \text{add\ active\ ts\ wp} & \\
\end{align*}
\]

The arrival of the Resume message updates the remote closure and releases the blocked threads back into the work pool. An acknowledgement is then sent to the original owner, indicating successful reception of the Resume message. This is another example of an interrupt-driven rule, as first described in section 6.2.5.

Notice that all inter-processor communications needs to contain sufficient context that the receiver can react in an appropriate manner. For example, the FetchMe closure contains both the processor id of the owner and the address at which the closure is stored on the remote processor. This allows the Fetch message to be sent to the correct processor, and that, when received, the owner can identify which closure it needs to pack. Similarly, the Wait closure needs to retain a copy of this information to allow it to construct the Ack response.

As can be seen from the previous examples, the techniques used to model communication and synchronisation are similar to those already used in the STG machine. However, managing the interactions between remote processors is sufficiently complex that the rule design can quickly become challenging. To help manage this complexity, UML sequence diagrams [Fowler and Scott, 1997] prove useful. Figure 6.13 shows the annotated sequence diagram for the series of FetchMe interactions. Each processor appears in its own column (often referred to as a swim lane), and the ordering of events is denoted by horizontal positioning. Messages are represented by dashed lines.

As a final example of inter-processor communication and synchronisation, consider the GUM work-request mechanism shown below:

\[
\text{(SCHED\_1)\hspace{1em}}
\]

\[
\begin{align*}
\text{GetThread\ as\ rs\ us\ t_id\ wp\ h\ \sigma\ (b_{in},\ b_{out})} & \\
\text{such\ that\ } \text{is\_empty}(wp) & \\
\implies \text{WaitWork\ as\ rs\ us\ t_id\ wp\ h\ \sigma\ (b_{in},\ request:b_{out})} & \\
\text{where\ request\ =\ (j,\text{Fish})\ \text{and}\ j = 1 + (i\ \text{mod\ n})} & \\
\end{align*}
\]

This demonstrates how structures other than closures can be used to trigger interactions. The rule is invoked when the local processor has exhausted its work pool, and therefore needs to ask its neighbours for additional tasks. Section 6.3.3 looks at load-balancing strategies in more detail.

**Global communications**

While the majority of communication and synchronisation will occur at the inter-processor level [Cypher et al., 1993], at times it will be necessary for some form of global communication. Arguably, the two most common forms of global operations are broadcasts and
barriers. To demonstrate the use of these communication primitives, the initialisation and termination phases of a parallel STG machine will be examined.

Surprisingly, the sequential STG machine does not specify a rule for ending the computation. One simple definition would be that the evaluation is complete when all three stacks are empty:

\[
(\text{STOP}) \quad \begin{align*}
\text{Return}_x & \quad \text{c ws} \quad \langle \text{stack} \rangle \quad \langle \text{stack} \rangle \quad \langle \text{stack} \rangle \quad h \quad \sigma \\
\iff & \quad \text{Stop} \quad \langle \text{stack} \rangle \quad \langle \text{stack} \rangle \quad \langle \text{stack} \rangle \quad h \quad \sigma 
\end{align*}
\]

In a parallel system, however, when the main thread terminates, all processors need to be notified that the evaluation has completed. On a DMMP architecture, a message broadcast would be used:

\[
(\text{STOP}_1) \quad \begin{align*}
\text{Return}_x & \quad \text{c ws} \quad \langle \text{stack} \rangle \quad \langle \text{stack} \rangle \quad \langle \text{Finished} \rangle \quad t_{id} \quad wp \quad h \quad \sigma \quad b_i \\
\iff & \quad \text{Stop} \quad \langle \text{stack} \rangle \quad \langle \text{stack} \rangle \quad \langle \text{stack} \rangle \quad t_{id} \quad wp \quad h \quad \sigma \quad b'_i \\
\text{where} \quad b'_i & = \text{broadcast} \ Stop \ b_i \\
\text{broadcast} \ m \ (b_{in}, b_{out}) & = (b_{in}, b_{out} + (\forall j \in \{1, \ldots, n\} \cdot (j, m)))
\end{align*}
\]

\[
(\text{STOP}_2) \quad \begin{align*}
\text{code as} \ rs \ us \ t_{id} \ wp \ h \ \sigma (b_{in}, b_{out}) \ & \quad \text{such that} \ (j, \text{Stop}) \in b_{in} \\
\iff & \quad \text{Stop as} \ rs \ us \ t_{id} \ wp \ h \ \sigma (b'_{in}, b_{out}) \ & \quad \text{where} \ b'_{in} = \text{dequeue} \ (j, \text{Stop}) \ b_{in}
\end{align*}
\]

During the initialisation phase, each processor loads the code and data required to perform the parallel graph reduction. It is important that the evaluation does not start until all processors are ready. Otherwise, there is the risk of races, whereby an early starter...
attempts to interact with a laggard, resulting in unpredictable and potentially fatal results. The common solution to this problem is the use of a global barrier. Essentially, a barrier maps each processor onto a logical tree, as shown below.

As soon as a leaf is ready, it sends an OK message to its parent, and then awaits the arrival of a Start message. A node, however, must wait for its left and right children to become ready (signalled by the reception of their OK messages) before it can signal its readiness to its parent. Finally, once the root has received messages from its two children, the Start message will then be broadcast, thereby enabling all of the processors to continue. Note also that by inverting the tree, a similar mechanism can be used to implement a more efficient broadcast operation than the version described in the previous section.

Figure 6.14 shows the initialisation rule for a DMMP system using a barrier operation to ensure all processors are ready. The communication roles (is_root, left, parent, etc.) for a 7 processor system are specified as follows (a more robust definition can be found in [Ben-Dyke, 1997]):

<table>
<thead>
<tr>
<th>i</th>
<th>is_leaf(i)</th>
<th>is_node(i)</th>
<th>is_root(i)</th>
<th>left(i)</th>
<th>right(i)</th>
<th>parent(i)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>true</td>
<td>false</td>
<td>false</td>
<td></td>
<td></td>
<td>5</td>
</tr>
<tr>
<td>2</td>
<td>true</td>
<td>false</td>
<td>false</td>
<td></td>
<td></td>
<td>5</td>
</tr>
<tr>
<td>3</td>
<td>true</td>
<td>false</td>
<td>false</td>
<td></td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>4</td>
<td>true</td>
<td>false</td>
<td>false</td>
<td></td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>5</td>
<td>false</td>
<td>true</td>
<td>false</td>
<td>1</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>false</td>
<td>true</td>
<td>false</td>
<td>3</td>
<td>4</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>false</td>
<td>false</td>
<td>true</td>
<td>5</td>
<td>6</td>
<td></td>
</tr>
</tbody>
</table>

Similarly, the low-precedence infix operator (;) denotes the sequence function, which provides a convenient shorthand for creating values of the code component:

\[
(;) \equiv \lambda \text{operation.} \lambda \text{continuation.} \text{operation continuation}
\]
$G = (P_1, \ldots, P_n) \sigma b_{init}$

where

$P_i = (\text{barrier}(i); \text{GetThread}, \emptyset, \emptyset, h, t_{none}, w_{pi})$

$b_{init} = (\forall i \in \{1, \ldots, n\} \bullet b_i = (\emptyset, \emptyset))$

$w_{pi} = \begin{cases} (\emptyset, (a_{main})), & \text{if } is\_root(i) \\ (\emptyset, \emptyset), & \text{otherwise} \end{cases}$

$h = \{\forall i \in \{1, \ldots, n\} \bullet a_i \mapsto (v_{s_i} \pi_i v_{s_i} \mapsto exp_i, \sigma v_{s_i})\}$

$\sigma = \{g_1 \mapsto a_1, \ldots, g_n \mapsto a_n\}$

**Figure 6.14: Initialisation of a DMMP system using a global barrier operation**

Global barriers may also be required during other phases of the evaluation, including, for example, prior to global garbage collection, and termination. Furthermore, the barrier and broadcast templates provide a sound foundation on which to build more complex global operations, such as parallel scans [Lander and Fisher, 1980; Partain, 1991; Springsteel and Stojmenovic, 1989], and gather and scatter [Gropp and Smith, 1993] operations.

### 6.3.3 Resource management

With regards to the STG machine, resources fall into one of the following three categories:

**data** is often not considered as a system resource, but its maintenance and distribution is central to the efficient progress of the entire computation. On a uni-processor system, the heap serves as the main data repository, but multi-processor systems offer a spectrum of sharing mechanism. Furthermore, data can also include system information, such as the location of idle processors, or the availability of specialised hardware.

**space** primarily relates to the processors' memory pool, but can also include on-line storage, such as hard disks. Garbage collection is the main technology used to control a system's usage of space. Typically, most multi-processor implementations will use a two-tiered approach to garbage collection: firstly, each processor will manage its own local memory pool; secondly, when the local collectors fail to reclaim sufficient space, a global collector will be used.
time includes both processor and communication time. Techniques for efficiently managing processor time include scheduling and load balancing. The former is concerned with allocating time on a single processors, while the latter deals with sharing clock cycles over a group of processors. However, neither technology will compensate for inefficient algorithms or poor implementations.

Often it is difficult to completely separate the different concerns of resource management. For example, the following rule demonstrates an optimisation used in the GHC compiler whereby small integer values are pre-allocated in the global heap:

\[
\text{Return Int} \langle n \rangle \ (\stack) \ (\stack) \ (a_{s_u}, r_{s_u}, a_u) : u s \ h \ \sigma
\]

\[
\text{such that } 0 < n < 100
\]

\[
\implies \text{Return Int} \langle n \rangle \ as_{u} \ r_{s_u} \ us \ h' \ \sigma
\]

\[
\text{where } h' = h[a_u \mapsto \text{Ind } a_{\text{small const } n}]
\]

This can save a large amount of memory as only one closure needs to be allocated for each integer between zero and one hundred. However, this does increase the number of instructions executed during the update phase of integer values. For the GHC compiler, this trade off between the space saved and the increase in time is deemed to be worthwhile. However, in general, such decisions are difficult to make, although the animation can be used to gather supporting empirical data.

The following sections discuss the management of space and time, while the discussion of data management is deferred to section 6.3.4.

**Garbage collection**

While the original abstract machine does not explicitly provide support for garbage collection [Wilson, 1992], it is essential that the following rules are adhered to\(^1\)

1. **when invoking the garbage collector, the root set of live closures must be known.**
   The root set serves as the starting point for the collector, from which all other live closures must be traceable. The root set of the STG machine comprises all of the components, with the exception of the heap.

2. **the garbage collector must have access to all of the addresses stored within a closure.**
   If a closure is known to be live, then all of the closures to which it refers must also be live. To this end, GHC uses specialised garbage-collection entry methods for each type of closure (see section 6.4.3).

3. **during garbage collection it must be possible to differentiate between addresses and literals.** Due to the differences in handling of these two types, a mistake either way could lead to system failure. To avoid tagging, GHC partitions the state’s components so that different types rarely appear together, and, when they do, a mask identifies the addresses.

4. **environments must only contain live variables.** This ensures that the garbage collector can remove dead closures as soon as possible. Figure 4.12 illustrates how the free-variable information can be used to safely trim an environment.

\(^1\)Note that most of these restrictions do not apply if a reference-counting system is used. However, due to their inability to reclaim cyclic data structures [Wilson, 1992, section 2.1], it is unlikely that such a system would serve as the main reclamation technology.
Enter a as rs us h cafs σ

such that is_global a and h[a ↦ (u ↦ e, λ)]

⇒ Eval e {env, stack, stack'} us' h' a : cafs σ
where h' = h[a ↦ FromSpace a', a' ↦ BlackHole]
us' = (as, rs, a') : us, and a' ∉ dom(h)

Figure 6.15: Maintaining a list of CAFs for the garbage collector

A number of different collectors have been used with sequential implementations, including a generational scheme [Sansom and Peyton Jones, 1993] and a hybrid compacting collector [Sansom, 1992]. Despite their differences, a collector can be viewed as a transformation between STG-machine states, and so the basic principles remain the same. Therefore, to illustrate the impact of garbage collection on the STG machine, the following section will focus upon the development of a two-space collector [Wilson, 1992].

Despite dealing with the basics of uni-processor garbage collection, the issue of distributed garbage collection is beyond the scope of this thesis. However, there is no obvious reason why the techniques explored in this chapter could not be used to investigate such algorithms.

A two-space copying collector

As the name suggests, a two-space collector divides the heap into two equal spaces, called from-space and to-space. During normal operation, closures are allocated in from-space, until the available memory is exhausted. The collector is then invoked, copying all live closures from from-space into to-space. The spaces are then reversed, i.e. from-space becomes to-space and vice versa, and normal operation resumes.

Before framing the collector in terms of the STG machine, the root set needs to be identified. The simplest solution would be to include the entire state, with the exclusion of the heap. However, the global environment would then need to be mutable, as the mapping between top-level variables and their heap addresses could change. This would significantly degrade performance as references to globals would then have to be resolved dynamically. One solution to this problem is to allocate the global closures in a separate block of memory from the heap, and outside the remit of the garbage collector. The various addresses will then remain constant, allowing the compiler to hard-code any references to the variables. However, one problem still remains: constant applicative forms (CAFs, see [Peyton Jones, 1987, section 13.2, page 224]). If evaluated, CAFs will be updated with references into from-space. According to the first rule presented in the previous section, all such references need to be part of the root set. To this end, figure 6.3.3 shows the rules necessary to automatically maintain a list of active CAFs. Essentially, every time a CAF is entered, it allocates a proxy closure inside from-space and updates itself with a FromSpace indirection. The proxy then becomes the target for the update frame. The heap is now represented by the triple (static, from, to), where static contains the top-level closures, and from and to are from-space and to-space respectively.
The collector should be invoked whenever heap space becomes scarce, with \texttt{let} expressions being the logical triggers:

\[
\text{(GC}_{\text{init}}) \quad \begin{array}{l}
\text{Eval } e \, \rho \, as \, rs \, us \, h \, caf s \, \sigma \\
\text{such that } \text{is}_\text{let} \, e \text{ and } \text{limit}_\text{reached} \, h \\
\Rightarrow \text{Eval } e \, \rho' \, as' \, rs' \, us' \, h' \, caf s \, \sigma \\
\text{where } (\rho', as', rs', us', h') = \text{two}_\text{space} (\rho, as, rs, us, caf s, h)
\end{array}
\]

The top-level collector simply scavenges the \textit{root set}, and then scavenges the closures which have been copied over into \textit{to-space}:

\[
\text{two}_\text{space} \, \text{state} \equiv \text{Scavenge}_\rho \, \text{state} \, \text{\$} \\
\text{Scavenge}_{as} \, \text{\$} \\
\text{Scavenge}_{rs} \, \text{\$} \\
\text{Scavenge}_{us} \, \text{\$} \\
\text{Scavenge}_{caf s} \, \text{\$} \\
\text{Scavenge}_h
\]

The low-precedence infix operator (\text{\$}) denotes a function for reversing the normal order of application, and provides a convenient notation for threading state:

\[
(\text{\$}) \equiv \lambda \text{state.} \lambda \text{operation.} \text{operation state}
\]

The scavenges routines for the elements of the \textit{root set} simply locates all heap references and copies them into \textit{to-space} by calling the closure's evacuate method (the location of which will be stored in their info table). This method returns the new \textit{to-space} address, and this replaces the original references. As an example, consider the scavenge routine for the local environment, \(\rho\):

\[
(\text{SCAV}_\rho) \quad \begin{array}{l}
\text{Scavenge}_\rho \, \{(v_1 \mapsto w_1, \ldots, v_n \mapsto w_n), \, as, \, rs, \, us, \, caf s, \, h_0\} \\
= \{v_1 \mapsto w'_1, \ldots, v_n \mapsto w'_n\}, \, as, \, rs, \, us, \, caf s, \, h_n\}
\end{array}
\]

where \((w'_i, h_i) = \begin{cases} (w_i, h_{i-1}), & \text{if } v_i : \nu \\ \text{Evacuate}_a (w_i, h_{i-1}), & \text{otherwise} \end{cases}\)

Notice that type information is used to differentiate between literal values and addresses (see rule 3 from the previous section) – a real compiler would probably generate static masks from the type information rather than using a dynamic lookup.

The evacuate method is necessarily closure dependent, but some of the more common variants are shown below, starting with a standard \textit{from-space} closure:

\[
(\text{EVAC}_{\text{std}}) \quad \begin{array}{l}
\text{Evacuate}_a \, (a, \, \text{(static, from}[a \mapsto (vs \, \pi \, xs \to e, ws)], \, to)) \\
= (a', \, \text{(static, from}[a \mapsto \text{ToSpace} \, a'], \, to'))
\end{array}
\]

where \(a' \not\in \text{dom}(to)\) and \(to' = to[a' \mapsto (vs \, \pi \, xs \to e, ws)]\)

Notice that no attempt is made to scavenge the closure's free variables, as this will take place during \text{Scavenge}_h. Updating the original closure with a \text{ToSpace} closure ensures only one copy is every created in \textit{to-space}:

\[
(\text{EVAC}_{\text{toSpace}}) \quad \begin{array}{l}
\text{Evacuate}_a \, (a, \, \text{(static, from}[a \mapsto \text{ToSpace} \, a'], \, to)) \\
= (a', \, \text{(static, from}, \, to))
\end{array}
\]
Similarly, indirection pointers (see section 6.4.3) do not copy themselves into to-space, but simply forward the evacuation:

\[
\text{(EVAC}_{\text{ind}}) \quad \text{Evacuate}_a (a, (\text{static, from}[a \rightarrow \text{Ind } a'], \text{to})) = \text{Evacuate}_a (a', (\text{static, from}, \text{to}))
\]

FromSpace closures, created when evaluating CAFs, forward the evacuation to the from-space closure, but then update themselves with the new to-space address:

\[
\text{(EVAC}_{\text{fromspace}}) \quad \text{Evacuate}_a (a, (\text{static}[a \rightarrow \text{FromSpace } a'], \text{from}, \text{to})) = (a'', (\text{static'}, \text{from}', \text{to'}))
\]

where \((a'', (\text{static'}, \text{from}', \text{to'})) = \text{Evacuate}_a (a', (\text{static, from, to}))\)

Finally, all static non-CAF closures can ignore the evacuation:

\[
\text{(EVAC}_{\text{static}}) \quad \text{Evacuate}_a (a, (\text{static}[a \rightarrow \text{lambda_form}], \text{from}, \text{to})) = (a, (\text{static, from}, \text{to}))
\]

The final stage of the two-space collector requires that the to-space closures are scavenged to allow them to update their free variables:

\[
\text{(SCAV}_{h}) \quad \text{Scavenge}_a (\rho, \text{as, rs, us, cafs, (static, from}, \text{to})) = (\rho, \text{as, rs, us, (static', to', empty_heap)})
\]

where \((\text{static'}, \text{from', to'}) = \text{Scavengetospace} (a_{\text{start}}, \text{static, from, to})\)

\(a_{\text{start}} = \text{get_first_address to}\)

The scavenge process starts at the first closure in to-space and then iterates through each closure until the end of the heap is reached:

\[
\text{(SCAV}_{\text{tospace1}}) \quad \text{Scavengetospace} (a, \text{static, from, to}) = (a, \text{static, from, to})
\]

such that \(a \notin \text{dom(to)}\)

\[
\text{(SCAV}_{\text{tospace2}}) \quad \text{Scavengetospace} = (a_{\text{next}}, \text{static', from', to'})
\]

where \((\text{static'}, \text{from', to'}) = \text{Scavengetospace} (a, \text{static, from, to})\)

\(a_{\text{next}} = \text{next_address a to'}\)

As with the evacuation methods, the scavenge routines are closure dependent, however, only a few types of closure will ever appear in to-space. For the sequential STG machine, only standard closures require a scavenge method:

\[
\text{Scavenge}_a (a, \text{static}_0, \text{from}_0, \text{to}_0[a \mapsto (v_1 \cdots v_n \pi xs \rightarrow e, w_1 \cdots w_n)])
\]

\[
= (\text{static}_n, \text{from}_n, \text{to}_n[a \mapsto (v_1 \cdots v_n \pi xs \rightarrow e, w'_1 \cdots w'_n)])
\]

where \((w'_i, (\text{static}_i, \text{from}_i, \text{to}_i))\)

\[
= \begin{cases} 
(w_i, (\text{static}_{i-1}, \text{from}_{i-1}, \text{to}_{i-1})), & \text{if } \vdash v_i : \nu \\
\text{Evacuate}_a (w_i, (\text{static}_{i-1}, \text{from}_{i-1}, \text{to}_{i-1})), & \text{otherwise}
\end{cases}
\]

The pattern is almost exactly the same as for the \(\text{Scavenge}_\rho\) function, whereby the free variables are evacuated and updated with the new to-space addresses.
Scheduling

Typically, there are three levels of scheduling that can be simultaneously active within a parallel functional implementation:

**process scheduling** tries to ensure that a processor is kept busy by de-scheduling inactive processes and re-scheduling active processes. More sophisticated systems may attempt to ensure fairness (see section 5.4.3) by allowing a process to only run for a fixed *time slice* before re-scheduling.

**algorithmic scheduling** ensures that the computation proceeds correctly by managing the interactions of the run-time system. Section 6.3.2 deals with the main synchronisation techniques used by this style of scheduling, with the BH' rule demonstrating how the scheduler can be invoked whenever necessary.

**user-defined or explicit scheduling** attempts to improve the performance of the computation through the use of a priori information. As noted by Burton and Rayward-Smith [1994], without such data it is impossible to develop a fully automatic scheduling strategy that can ensure good performance. Unfortunately, non-strictness interferes with the traditional algorithms for automatically generating explicit schedules [Norman and Thaniisch, 1993], and this area has received little attention in the literature (with the exception of algorithmic skeletons). Para-functional Haskell is one of the few functional languages that provides the programmer with explicit scheduling operators, allowing highly complex dependencies to be defined.

As an example of process scheduling, consider the thread-management system presented in section 9.2. As it stands, once a thread is scheduled to run, it will not relinquish control until it either terminates or blocks (on a BlackHole). As mentioned above, this is not fair, but the context-switch rule, CS, shown below, provides a solution:

\[
\text{GetThread as rs us h[ats:o \mapsto TSO state'] ats:o wp } \sigma \ t'_{\text{local}}
\]

\[
\Rightarrow \text{GetThread as rs us h[ats:o \mapsto TSO state'] ats:o wp } \sigma \ t'_{\text{local}}
\]

\[
\text{where } state' = (code,as,rs,us)
\]

\[
t'_{\text{local}} = t_{\text{local}} + t_{\text{step}}
\]

This is another example of an interrupt-driven rule, as first described in section 6.2.5. Notice that the guard condition will only capture the intended behaviour if \(t_{\text{step}}\) is identical for each rule (a down counter would be required to ensure a fixed period if \(t_{\text{step}}\) varies between rules).

**Load balancing**

Just as scheduling attempts to maximise the efficiency of a single processor’s operation, load balancing aims to maximise the efficiency of a collection of processors. Traditionally, there have been two different approaches to load balancing in parallel functional implementations:

**active load balancing** is typified by the Alfalfa’s diffusion scheduler [Goldberg and Hudak, 1987, section 4.5], which distributes work to neighbouring processors as it is
generated. As the processors interact, they swap load information, and this is combined with locality maps to determine which processor should receive the new work. The net result is that work "diffuses" through the system. Perhaps surprisingly, no equivalent to the scheduling directives of parafunctional Haskell exists for task placement. Skeletal operators, however, can optimise the load distribution based on knowledge of the precise mechanics of the underlying algorithm.

**passive load balancing** waits until a processor becomes unemployed before attempting to re-distribute the available work. GUM's fishing mechanism (see section 9.3) involves the out-of-work processor sending a work-request message to one of its neighbours. If the neighbour has sufficient extra work, it returns a suitable portion, otherwise the message is forwarded to another candidate.

While both systems have their merits, a combination of the two is probably necessary for optimal performance.

Whatever approach is finally decided upon, it will undoubtedly build upon the techniques described in section 6.3.2. Once more, UML sequence diagrams can significantly reduce the complexity of designing algorithms involving multiple interactions with remote processors. As an example, consider the two sequence diagrams representing GUM's fishing mechanism shown in figures 6.16 and 6.17. The first figure shows the interactions that need to take place before an unemployed processor receives and starts evaluation of a new task. The second figure shows how fish messages are forwarded if the recipient has no spare work. Furthermore, it also shows how the fish message will be re-spawned after a back-off period once the original message completes a cycle and returns to the unemployed processor.

While the details of the the GUM's passive load-balancing system are contained in section 9.3.2 contains, the following section explores the Alfalfa's diffusion scheduling.
Diffusion scheduling

Before moving on to consider the work distribution mechanism, it is worth discussing how each processor is informed of the status of the others. Goldberg and Hudak [1987] use specialised messages to communicate this information, and they develop a number of heuristics to determine the frequency of these transmissions. However, there is no reason why this information could not be piggy-backed onto the regular message traffic. As a simple example, the following rule demonstrates how a GVT-style token-ring algorithm [Ben-Dyke, 1997, section 3.1] could be used to disseminate this information:

\[
\text{code as } rs \text{ us } t_{id} \text{ wp } status \ h \ \sigma \ (b_{in}, b_{out})_i \\
\text{such that } \text{probe } b_{in} \ (j, StatusToken \ new\_status) \\
\Rightarrow \text{code as } rs \text{ us } t_{id} \text{ wp } new\_status \ h \ \sigma \ (b'_{in}, b'_{out})_i \\
\text{where } b'_{in} = \text{dequeue} \ (j, StatusToken \ new\_status) \ b_{in} \\
b'_{out} = \text{enqueue} \ (k, StatusToken \ new\_status') \ b_{out} \\
k = \text{neighbour } i \\
new\_status'!l = \begin{cases} size \ wp, & \text{if } l = i \\ new\_status!l, & \text{otherwise} \end{cases}
\]

Upon reception of the token, the processor updates its own copy of the system's status, modifies the token to reflect its current level of activity, and then passes it on to the next processor in the ring.

The status information enables the local processor to determine the best candidate to receive any new work that is generated. The \texttt{letpar} construct is a classic example of a
task generator:

\[
\text{letpar } v = e_1 \ e_2 \ \rho \ \text{as } rs \ us \ t_{id} \ wp \ status \ h \ \sigma \ b_i
\]

such that \(\text{sufficient\_work \ wp}\)

\[
\implies \text{Eval } e_2 (\rho \ominus \{v \mapsto a\}) \ \text{as } rs \ us \ t_{id} \ wp \ status' \ h' \ \sigma \ b'_i
\]

where \((\text{task, } h') = \text{pack } e_1 \ \rho \ h[a \mapsto \text{Exported } e_1 \ \rho]\)

\[
b'_i = \text{enqueue } (j, \text{Schedule task}) \ b_i
\]

\[
j = \text{select\_target } i \ \text{status } e_1 \ \rho
\]

\[
\text{status'} = \text{inc\_work } j \ \text{status}
\]

The \textit{pack} routine bundles together sufficient context in the hope that the expression \(e_1\) can be evaluated remotely without requiring too much further interaction (see section 6.3.4).

The task is then sent to the identified target, but the local processor keeps sufficient data such that the task can be recreated if the message is lost. Upon reception, the remote processor unpacks the task and then sends an acknowledgement to allow the originator to commit the changes to the closures involved. The acknowledgement will also include the remote address of the task's main closure, allowing the \textit{Exported} closure to be replaced with a \textit{FetchMe} closure (see section 6.3.2). Notice also that the status information for the remote processor is increased to avoid it receiving an avalanche of new tasks.

Looking at figure 6.16, it should be clear that there is, in fact, a great deal of similarity between diffusion scheduling and GUM's fishing mechanism. For example, compare the \textit{letpar} rule presented here with GUM's \textit{send\_wrk} rule from section 9.3.2. The main difference between the two systems is simply the trigger that initiates the transfer of work.

### 6.3.4 Partitioning and naming

As mentioned in section 6.3.3, the maintenance and distribution of data is central to the efficient progress of the STG machine. This section covers the following areas of data management:

**data partitioning** is concerned with striking a balance between the time required to access a particular value, and the amount of time and/or memory dedicated to distributing the data. As with traditional memory management systems [Hwang and Briggs, 1985, section 2.3.1, pages 80–86], the partitioning can either be static (fixed) or dynamic (variable). However, non-strictness again causes problems with abstract analysis, and most modern implementations have to rely on dynamic partitioning.

**scoping** controls the visibility of variables, and the STG' language is lexically scoped: identifiers are only accessible within the expression that defines them. The local and global environments, \(\rho\) and \(\sigma\), are used by the STG machine to implement scoping.

**locating and accessing remote closures** can involve a number of different techniques, depending upon the target architecture. GMSV implementations, for example, have direct access to all closures in the shared heap. However, a number of studies suggest that performance can be improved by moving towards a DMMP design [Hammond and Peyton Jones, 1992; Mattson Jr., 1993b; Islam and Campbell, 1992] where remote values have to be explicitly requested, as with the \textit{FetchMe} closures described previously.

The remainder of this section looks at the first two of these areas.
Static partitioning

The initial partitioning of globally-visible closures is determined by the STG machine's `init` rule. The simplest approach is to allocate all of the `lambda-form` closures to one processor, and use `Ind` or `FetchMe` closures on the others (for GMSV and DMMP architectures respectively). To improve locality at the expense of space efficiency, all functions and constants can be safely allocated on all processors. However, the GUM system goes one step further and even copies top-level thunks [Trinder et al., 1996, section 6.2], risking the duplication of work:

\[
G = (P_1, \ldots, P_n) \ (S_1, \ldots, S_n)
\]

where

\[
P_i = (\text{GetThread}, \emptyset, \emptyset, \tau_{\text{none}}, wp_i, h_i, \sigma)
\]

\[
S_i = (\langle \text{in} \rangle, \langle \text{out} \rangle_i)
\]

\[
wp_i = (\langle \emptyset, \text{sparks}_i \rangle)
\]

\[
\text{sparks}_i = \begin{cases} \langle a_{\text{main}} \rangle, & \text{if } i = 1 \\ \langle \emptyset \rangle, & \text{otherwise} \end{cases}
\]

\[
h_i = \begin{cases} h_i, & \text{if } i = 1 \\ h[a_{\text{main}} \mapsto \text{FetchMe} \ a_{\text{main}}], & \text{otherwise} \end{cases}
\]

\[
h = \{a_1 \mapsto (v_{s_1} \pi_1 v_{s_1} \mapsto \text{exp}_1, \sigma v_{s_1}) \} \}
\]

\[
g_1 \mapsto a_1,
\]

\[
g_n \mapsto a_n
\]

Ideally, some form of automated mapping strategy [Norman and Thanisch, 1993] should be used. As a first step towards this goal, Dennis [1995, section 5.3, pages 155–156] manually generated mapping plans for a Sisal optical-surveillance algorithm. However, in general, traditional static mapping algorithms cannot be readily adapted to work with non-strict systems. Even the large body of work dedicated to strictness analysis has not helped to tame such systems [Bloss and Hudak, 1988; Burn, 1991; Seward, 1992; Beemster, 1994; Peyton Jones and Partain, 1994]. Hence, most modern implementations rely on dynamic partitioning, and only form the crudest of static partitions (as seen previously).

The above discussion ignores the partitioning of the closure-access methods, which typically have to be reproduced on every processor.

Dynamic partitioning

Following on from the previous section, it is unlikely that a static partitioning will prove adequate for the duration of an entire computation. Dynamic partitioning attempts to maintain efficiency by moving data to where it is most needed.

As previously seen, load-balancing systems have a side effect on data placement in that they move clusters of closures between processors as part of their work re-distribution (see, for example, the diffusion scheduler's `LETPAR` rule from the previous section). Typically, a `pack` function is used to select which closures to include, and collects them together into a structure suitable for transmission. Hammond and Loidl [1996] examined a number of packing schemes, ranging between incremental fetching and bulk fetching. Incremental fetching packs just one closure per message, and invokes the closure's pack method to
generate the data:

\[
\text{pack a } j \ h[a \mapsto (v_\text{s } \pi x_\text{s } \to \exp, w_\text{s})] = (\text{data}, h[a \mapsto \text{Fetchme } j \ a])
\]

where

\[
\begin{align*}
\text{data} &= (a, v_\text{s } \pi x_\text{s } \to \exp, \text{mask, } w_\text{s}) \\
\text{mask} &= \text{mask}_1 \cdots \text{mask}_n \\
\text{mask}_i &= \begin{cases} 
0, & \text{if } \vdash (v_\text{s } ! i) : \nu \\
1, & \text{otherwise}
\end{cases} \\
n &= \text{length } v_\text{s}
\end{align*}
\]

Note that the mask field allows the receiver to differentiate between literal values and addresses (see figure 9.21 for the corresponding unpack method). Bulk fetching packs the root closure and as much of its sub-graph as possible using a breadth-first algorithm (see [Trinder, Hammond, Partridge, Peyton Jones and others, 1996] for further details). While these are simple partitioning strategies, they can exhibit good locality of reference, as related values will tend to collect together. However, it is possible for two or more processors to compete for control of a shared thunk, wasting both processor time and communication bandwidth. PAM (the Parallel Abstract Machine [Loogen, Kuchen, In­dermark and Damm, 1991]) circumvents this problem by allowing a thread to be migrated once only.

Explicit placement expressions such as parafunctional Haskell’s on construct and algorithmic skeletons are the other main drivers of dynamic partitioning. However, as can be seen from the rules in sections 9.4 and 9.5 these simply build upon the techniques presented in this chapter.

**Scoping**

Free-variable information can be used to determine the exact extent or lifetime of a variable within an expression [Muchnick, 1997, section 3.1, pages 43-44]. Note that this is a different concept to the lifetime of a closure, as references to a closure may be shared and passed outside the confines of a particular expression. However, reference counting does use extent information to garbage collect non-cyclic data [Wilson, 1992, section 2.1]. Consider, for example, the rule for function application, which increases by one the number of references that exist to the function’s arguments:

\[
\text{Eval } (f \ x_1 \cdots x_n) \rho \ as \ rs \ us \ h_0 \ \sigma \\
\quad \text{such that } \vdash f : \pi \\
\quad \Rightarrow \ \text{Enter } a \ as' \ rs \ us \ h_n \ \sigma
\]

where

\[
\begin{align*}
\text{as}' &= \text{arg}_1 : \cdots : \text{arg}_n : \text{as} \\
\text{arg}_i &= \text{val } \rho \sigma x_i \\
h_i &= \begin{cases} 
h_{i-1}, & \text{if } \vdash x_i : \nu \\
\text{increase_refs } \text{arg}_i \ h_{i-1} & \text{otherwise}
\end{cases}
\end{align*}
\]

increase_refs a h[a \mapsto (\text{closure}, \text{refs})] = h[a \mapsto (\text{closure}, \text{refs }+ 1)]

\[
\text{decrease_refs } a h[a \mapsto (\text{closure}, \text{refs})] = h'
\]

where

\[
h' = \begin{cases} 
h[a \mapsto (\text{closure}, \text{refs } - 1)], & \text{if refs } > 1 \\
\text{add_free_cell } a h[a \mapsto (\text{Reclaimed, } 0)], & \text{otherwise}
\end{cases}
\]
The references counts are decremented at the end of a boxed variable’s extent, as illustrated by the rule handling algebraic returns:

\[
\text{Return}_x \text{ con ws as} (\cdots \text{ con vs } \to e \cdots, \rho) : rs \ text{ us } h_0 \ \sigma \\
\implies \text{Eval } e \ \rho_{final} \text{ as} (\cdots \text{ con vs } \to e \cdots, \rho) : rs \ \text{ us } h_n \ \sigma
\]

where \( \rho_{final} = \rho' \setminus \text{dead vars} \)

\[
\rho' = \rho \upharpoonright \{ v_1 \mapsto w_1, \ldots, v_n \mapsto w_n \}
\]

\[
\text{live vars} = \mathcal{FV}[e]
\]

\[
\text{dead vars} = \text{dom}(\rho') \setminus \text{live vars}
\]

\[
\text{dead vars}_i = \text{dead vars}_i
\]

\[
h_i = \begin{cases} h_{i-1}, & \text{if } \Gamma \vdash \text{dead vars}_i : \nu \\ \text{decrease refs} (\rho' \text{ dead vars}_i) h_{i-1}, & \text{otherwise} \end{cases}
\]

Some expressions will have to both increment and decrement the counts. For example, the \texttt{let} expression will increase references during the heap allocation of the closures, and then decreases references to eliminate the dead variables of the body expression. Note, it is important that the reference counts are always incremented before being decremented to avoid incorrect reclamation of a closure. Furthermore, this technique relies on variable renaming to remove all possible ambiguities with respect to shared variable names.

While reference counting is now rarely used as the main garbage collection technology, it can be combined with a copying collector to achieve safe incremental reclamation [Lester, 1989]. Furthermore, GHC uses very similar rules to implement \textit{stack stubbing} to remove potential space leaks. Instead of decrementing reference counts, the stack slots occupied by any dead variables are overwritten or re-used for storing live variables [Peyton Jones, 1992, section 9.4.1, pages 62-63].

Module systems can introduce further complications with regards to scoping, but this is beyond the range of this thesis.

### 6.4 Modifying the STG machine

In this section a number of guidelines are presented for integrating the changes made to the STG' language (see section 5.2) into the STG machine. The process involves two interdependent steps: firstly, using the syntax-extension method as an indicator, the rules that need to be added to the state-transition system are identified; secondly, the components needed to support these new rules are developed. To avoid complication, the examples used are all sequential in nature (section 6.2 deals with parallel and architecture-dependent features).

Sections 6.4.1 and 6.4.2 consider the effect of adding a new production rule and a new primitive type (see sections 5.2.1 and 5.2.3 respectively) to the original abstract syntax. Section 6.4.3 then looks at a number of different approaches to implementing the new rules.

#### 6.4.1 New production rules

There are two possible consequences of adding a new production rule to the abstract syntax:

**addition of a new state-transition rule** when a syntax group is the primary focus of an existing set of transition rules, any extension to the group will be mirrored in the STG machine by the addition of a new rule. Note that the existing rules can serve
Table 6.1: The relationship between the abstract syntax and the STG-machine rules

<table>
<thead>
<tr>
<th>syntax group</th>
<th>new rules</th>
<th>templates</th>
<th>existing rules</th>
</tr>
</thead>
<tbody>
<tr>
<td>program</td>
<td>mode</td>
<td></td>
<td>initial state</td>
</tr>
<tr>
<td>typedefs et al.</td>
<td>see section 6.4.2</td>
<td></td>
<td>initial state, 3, 5, 8', 16-17A</td>
</tr>
<tr>
<td>bindings</td>
<td>Eval let(rec)</td>
<td>3</td>
<td>initial state</td>
</tr>
<tr>
<td>binding</td>
<td></td>
<td></td>
<td>initial state, 3</td>
</tr>
<tr>
<td>simplebind</td>
<td>Eval letstrict</td>
<td>4A</td>
<td>8'</td>
</tr>
<tr>
<td></td>
<td>Eval let#</td>
<td>4B</td>
<td>12'</td>
</tr>
<tr>
<td>lambda_form</td>
<td></td>
<td></td>
<td>initial state, 3, 15-17A</td>
</tr>
<tr>
<td>(\pi)</td>
<td>Enter</td>
<td>2, 15</td>
<td>16-17A</td>
</tr>
<tr>
<td>(exp)</td>
<td>Eval exp</td>
<td>3-5, 9, 14</td>
<td></td>
</tr>
<tr>
<td>(alts)</td>
<td>Eval case</td>
<td>4</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Return</td>
<td>6, 7, 11, 13</td>
<td></td>
</tr>
<tr>
<td>(lalt)</td>
<td></td>
<td></td>
<td>4, 11-13</td>
</tr>
<tr>
<td>(aalt)</td>
<td></td>
<td></td>
<td>4, 6, 7</td>
</tr>
<tr>
<td>default</td>
<td></td>
<td>4, 6, 7, 11-13</td>
<td></td>
</tr>
<tr>
<td>vars</td>
<td>see the lambda_form and aalt entries</td>
<td></td>
<td></td>
</tr>
<tr>
<td>(atoms)</td>
<td>Eval (\text{var}_{\text{fun}})</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Eval (\text{cons})</td>
<td>5</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Eval (\text{primitive})</td>
<td>14</td>
<td></td>
</tr>
<tr>
<td>(atom)</td>
<td></td>
<td></td>
<td>1, 5, 14</td>
</tr>
</tbody>
</table>

Modification of an existing rule: If the syntax group is only of minor significance with regards to a rule set, all elements will have to be reviewed. This will result in a list of modifications that must be made in order to incorporate the syntactic extension. If the modifications give rise to complex rules, it is recommended that the whole design be reconsidered (see section 5.2.1 with regards to selecting a suitable alternative).

Either one or both may be applicable, depending upon the syntax group in question – the relationship between the groups and the STG-machine rules is shown in table 6.1. The addition of a new primitive can be treated as if it were an extension of the \(exp\) syntax group, i.e. a new transition rule must be developed, using rule 14 as a template.

Section 6.4.3 describes the methods that may be employed to support the required additions or modifications. If further extensions are to be made, table 6.1 will have to be updated to take the new and modified rules into account.

6.4.2 New primitive types

Incorporating a new type into the STG machine requires the construction of a specialised \(\text{Return}_{\text{new}}\) rule. Obviously, if the resulting rule bears little relation to the other of its class (6-8', 11-13, and 16) then the code component should be extended. For example, Hill [1994, figure 6.2, page 107] uses the \(\text{Merge}_{\text{int}}\) and \(\text{Merge}_{\chi}\) modes to control the return of literal and algebraic PODS (see section 5.2.3). With regards to the \(\text{Eval}\) rules
that will initiate the returns, these will be a product of the integration of the primitive functions and production rules that support the type (see section 6.4.1).

If the type is boxed, or if the corresponding values have to be heap allocated [Peyton Jones et al., 1994, primitive arrays, section 1.4], then a new closure must also be designed (the technical details are explained in the following section).

As an example, consider the pipeline type from section 5.2.3. A sequence of addresses could be used to represent a pipeline, with each address pointing to the closure of the function to be performed for that stage. The emptyPipe primitive, therefore, simply returns an empty sequence:

\[
\text{(EMPTY PIPE)} \quad \text{Eval emptyPipe } \rho \text{ as } rs \text{ us } h \sigma \\
\implies \text{Return}_{\alpha_1 \to \alpha_2} (\emptyset ) \text{ as } rs \text{ us } h \sigma
\]

While these issues would have already been addressed by the denotational semantics, it is now necessary to consider how the pipes will be manipulated. For example, should it be possible to deconstruct a particular pipe through the use of a case or equivalent expression? Or is it enough to allow pipes to be specified via chains of \texttt{let#} expressions? For the purpose of this example, the latter approach will be used, and the \texttt{Return} mechanism can now be specified:

\[
\text{(RET PIPE)} \\
\quad \text{Return}_{\alpha_1 \to \alpha_2} ps \text{ as } r : rs \text{ us } h \sigma \\
\text{such that } r \equiv \text{Forced}_{\alpha_1 \to \alpha_2} \text{ var expbody } \rho \\
\implies \text{Eval expbody } \rho' \text{ as } rs \text{ us } h \sigma \\
\text{where } \rho' = \rho \oplus \{\text{var} \mapsto ps\}
\]

Usually the details of the low-level implementation of the sequence should be left to the compilation stage described in chapter 8. However, it is almost certain that the sequence will have to be stored in the heap. This will entail the use of new types of closure, as reflected by the amended rule for emptyPipe, and that for addstagePipe:

\[
\text{(EMPTY PIPE')} \quad \text{Eval emptyPipe } \rho \text{ as } rs \text{ us } h \sigma \\
\implies \text{Return}_{\alpha_1 \to \alpha_2} ps \text{ as } rs \text{ us } h' \sigma \\
\text{where } h' = h[ps \mapsto \text{EmptyPipe}]
\]

\[
\text{(EXTEND PIPE)} \\
\quad \text{Eval (addstagePipe } f \text{ ps) } \rho \text{ as } rs \text{ us } h \sigma \\
\text{such that } (f,a) \in \rho \\
\implies \text{Return}_{\alpha_1 \to \alpha_2} ps' \text{ as } rs \text{ us } h \sigma \\
\text{where } h' = h[ps' \mapsto \text{Pipe a ps}]
\]

6.4.3 Supporting the new state-transition rules

Having identified the modifications that have to be made to the rule set, the task becomes one of implementing the changes. Hence this section outlines a number of example-driven recipes for providing mechanisms that, in isolation or in combination with others, may prove useful. The recipe book is by no means complete.
Extending the state

Arguably, the most obvious approach to extending the STG machine is through the addition of a new state component. The high profile afforded the new field is balanced by the potential cost of dedicating machine resources (see chapter 8) to the new part.

The first step is the specification of the component, followed by its integration into the abstract state. Then, all of the existing rules, including the initial and final states, have to be updated. Fortunately, in most cases, this should be trivial. Finally, if the new field contains heap addresses, the component should be added to the garbage collector’s root set (specific collectors may have additional obligations).

As an example, the TT rule shown below is a specialised instance of rule 2 (closure entry), which, in addition to the usual entry operations, simply increments the new counter field.

| (TT1) | Enter a as rs us h[a → Ind a'] count σ  
→ Eval (c vs) ρ as rs us h count + 1 σ  
where ρ = {v₁ → w₁, ..., vₙ → wₙ} and (vᵢ, wᵢ) = (vs !i, ws !i) |

This is exactly how the AQUA Team [1993, section 9, page 36] implemented GHC’s ticky-ticky profiling.

New closures

Due to the uniform representation of closures [Peyton Jones, 1992, section 3.1.3], the extension of the closure specification will not interfere with other components of the system. The main work lies in the development of new rules to handle all of the applicable entry methods.

For example, the IND₁ rule shown in figure 6.18 defines the standard entry method used to access an indirection node, Ind a [Peyton Jones, 1987, section 12.4, pages 213–218]. The combination of the new closure and rules provides support for variable-sized closures, a prerequisite of a space-efficient system. The ToSpace α₂ closure used by the rules shown in figure 6.20 serves a similar role to an indirection, but is only used during garbage collection [Sansom, 1992, “two-space copying”, section 2.1, page 314].
This method is a special case of a more general approach, that of extending an existing component. Relevant examples include the Forced continuation used by the letstrict and let# expressions, and the addition of the new Merge mode described in section 6.4.2.

Adding a new computational phase

As with adding a new type of closure, this method is a special case of extending an existing component. A new phase is required whenever a new behaviour cannot be categorised under any of the existing phases. For example, the GUM operational model presented in section 9.3.2 introduces the following phases:

<table>
<thead>
<tr>
<th>phase</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>GetWork</td>
<td>when a processor runs out of local work, it requests additional work from its neighbouring processors.</td>
</tr>
<tr>
<td>WaitWork</td>
<td>having asked for work from a remote processor, the processor simply waits for the arrival of new work</td>
</tr>
</tbody>
</table>

It may also be worth considering the introduction of an artificial phase to highlight a particular behaviour. For example, the update mechanism is spread across the Return and Enter phases. The following rules show how an Update phase can be used to collect together relevant rules:

\[
(16') \quad \text{Return } c \ w s \quad \langle \text{stack} \rangle \langle \text{stack} \rangle \ u s \ h \ \sigma \\
\implies \quad \text{Update}_{\pi_1 \ldots \pi_n} c \ w s \quad \langle \text{stack} \rangle \langle \text{stack} \rangle \ u s \ h \ \sigma
\]

\[
(\text{UPD}_1) \quad \text{Update}_{\pi_1 \ldots \pi_n} c \ w s \quad \langle \text{stack} \rangle \langle \text{stack} \rangle \ (a s_u, r s_u, a_u) : u s \ h \ \sigma \\
\implies \quad \text{Return } c \ w s \quad a s_u \quad r s_u \quad u s \ h_u \ \sigma
\]

where \( vs \) is a sequence of arbitrary distinct variables

\[
\text{length}(vs) = \text{length}(ws)
\]

\[
h_u = h[a_u \mapsto (vs \ n \rightarrow c \ vs, ws)]
\]

\[
(17'a) \quad \text{Enter } a \quad as \quad rs \quad us \quad h \quad \sigma
\]

such that \( h \ a = (vs \ n \ xs \rightarrow e, ws_f) \), and \( \text{length}(as) < \text{length}(xs) \)

\[
\implies \quad \text{Update}_{\tau_1 \rightarrow \tau_2} a \ vs \ xs \ e \ ws_f \quad as \quad rs \quad us \quad h \quad \sigma
\]

\[
(\text{UPD}_2) \quad \text{Update}_{\tau_1 \rightarrow \tau_2} a \ vs \ xs \ e \ ws_f \quad as \quad \langle \text{stack} \rangle \ (a s_u, r s_u, a_u) : u s \ h \ \sigma \\
\implies \quad \text{Enter } a \quad as' \quad r s_u \quad u s \ h_u \ \sigma
\]

where \( x s_1 ++ x s_2 = xs \)

\[
\text{length}(x s_1) = \text{length}(as)
\]

\[
as' = as ++ as_u
\]

\[
h_u = h[a_u \mapsto ((vs ++ x s_1) \ n \ x s_2 \rightarrow e, (ws_f ++ as))]
\]

As an added advantage, it is now possible for other phases to make use of the update rules without having to duplicate the behaviour. Notice, however, that rules 17'a and UPD2 are closely coupled, in that a large amount of context has to be explicitly passed as an argument to the Update phase.
Figure 6.19: The state-transition diagram for the Update phase

As the Update example demonstrated, the main considerations when adding a new phase are the entry and exit points. The state diagrams introduced in section 4.8.3, clearly show the possible phase interactions and are therefore highly recommended for this stage of the design. Figure 6.19 shows the state-transition diagram for the Update example.

Adding a new entry method

When access to a closure’s internal representation is required, the only clean solution is to provide a new entry method. However, at one procedure and one word of storage per binding, the associated overhead is high (a number of implementation tricks can reduce both of these costs, see chapter 8). Assuming that the addition cannot be avoided, new rules have to be developed for each type of closure that may be accessed using the new entry method. As an example, figure 6.20 shows the EntryEvac rules for three types of closures: indirections, to-space pointers, and the more usual lambda_form variant.

Extending or modifying an existing rule or component

The arguments for and against the modification of a complex system have already been presented in section 5.2.4, and, as before, caution is recommended. To illustrate the power of this approach, the following example does away with the tagless aspect of the STG machine.

By evaluating the instruction traces of case expressions for a number of modern architectures, Hammond [1992, section 4] notes that a semi-tagging approach to closure representation can achieve a 13% improvement in speed. The rules to effect this change are
Figure 6.20: Evacuation routines for a two-space compacting collector

\[
\text{Eval (case var (...con_i xs \rightarrow \text{exp}_i...)) } \rho \text{ as } rs \; us \; h \; \sigma
\]

such that \((\text{var}, a) \in (\sigma \oplus \rho)\), and \(h[a \mapsto (\text{vs x } \rightarrow \text{con_i vs,ws})]\)

\[
\Rightarrow \text{Eval } \text{exp}_i \; \rho' \text{ as } rs \; us \; h \; \sigma
\]

where \(\rho' = \rho \oplus \{x_1 \mapsto w_1, \ldots, x_n \mapsto w_n\}\)

\(\text{dom}(\rho') = FV[\text{exp}_i]\)

\(\text{eval}_{j} = (\text{xs}!j,\text{ws}!j)\)

Figure 6.21: Semi-tagging \text{case} and \text{letstrict} expressions

shown in figure 6.21. With regards to implementation, each closure contains an evaluation-status field which is used to differentiate between thunks (unevaluated expressions), specific constructors, and functions. If the closure is unevaluated, then the original rule (either 4 or 4A) will still apply.

The indirection rule \text{IND}_2 presented in figure 6.18 could also be considered as a modification of the basic system. However, as it is in keeping with the underlying principles of the STG machine, it is more properly classified as a refinement.

6.5 Animation and testing

While the development of an operational model of a parallel STG machine can be considered an end in itself, the animation of the description provides useful insight into the system dynamics and generally improves confidence in its correctness. The animations
module SystemSpecification where
import Time
data (Eq p, Show p, Show s) => SystemSpecification p s = SystemSpecification
    { initP :: Int -> p,
        initS : s,
        stepP :: (p, s) -> (p, s),
        stepS :: s -> s,
        local_time :: p -> Time,
        comm_time :: (p, s) -> Time,
        set_time :: p -> Time -> p,
        is_active :: p -> Bool,
        is_waiting :: p -> Bool,
        is_stopped :: p -> Bool,
        finalP :: p -> Bool,
        finalS :: s -> Bool
    }

Figure 6.22: The SystemSpecification module

described here are primarily built using the techniques described in section 4.8.10, and, as such, the resulting Haskell code is closely related to the operational description.

Section 6.5.1 looks at the animation of the processor framework, while section 6.5.2 provides a concrete example based upon the ping-pong model from section 6.2. Section 6.5.3 then examines how an animation can be used to test and verify a system, before sections 6.5.4 and 6.5.5 look at interactive and batch-mode animations.

6.5.1 The processor framework

The central data structure used during animation is System Specification, which is reproduced in figure 6.22. This contains all of the support definitions required to model a system, such as the Haskell implementations of STEPp, IS_ACTIVE, etc. As the specification is polymorphic with respect to p and s, it can be used for any system, irrespective of the concrete representations used for the processor and communication states.

The Framework module provides the main tools for manipulating the system specifications, including the interactive and batch-mode simulations used during the testing phase. These tools typically represent a system’s state using the Ensemble type:

\[
\text{ensemble} \ p \ s = ([p], s)
\]

All of the tools, however, build upon the instantiate function, which derive the three framework operations, INIT, STEP, and FINAL, for a particular system specification. There is a strong correspondence between the implementation and the semi-formal description shown in figure 6.8, as shown by the following fragment:

\[
\begin{align*}
\text{init} \ n \proces &= ([\text{initP} \ n \mid n \leftarrow [1..\text{numProcs}]]) \text{, initS} \\
n\text{final}(p, s) &= \text{and} ([\text{finalP} \ p \mid p \leftarrow ps] ++ [\text{finalS} \ s]) \\
n\text{next_time}(p, s) \mid \text{is_active} \ p &= \text{local_time} \ p \\
| \text{is_waiting} \ p &= \text{comms_time} \ (p, s) \\
| \text{is_stopped} \ p &= \text{infinity}
\end{align*}
\]

The initP, is_stopped, etc. functions are extracted from the system specification. The reduce function defined below shows how the instantiated operations can be used to obtain
all of the states generated during a reduction:

```
Haskell
reduce numProcs specification
  = let (init, step, final) = instantiate specification
      history = iterate step (init numProcs)
    in takeWhile (not . final) history
```

Note that this style of coding relies on Haskell’s non-strict semantics as history could well be an infinite list. While it may appear inefficient to generate all reduction states, when combined with a suitable consumer process, the resulting code can be linear in terms of space and time, e.g.:

```
Haskell
putStr $ concat [show e | e <- reduce 2 specification]
```

However, as reported in section 4.8.10, excessive laziness in the system-specification routines can lead to unexpected space leaks, thereby severely damaging performance. To avoid this problem, most of the tools periodically force the evaluation of the entire ensemble.

### 6.5.2 An example animation: the ping-pong system

Having briefly described the animation of the processor framework, this section provides a concrete example of a `SystemSpecification` for the ping-pong system presented in section 6.2.

The first step is to decide on representations for the processors and communication system. The communication model is a good starting point as it is very simple and can be immediately converted into Haskell code:

```
Haskell
module Communications where
import Time
data Communications = NOTHING
  | NewPing    Time
  | HavePinged Time
  | NewPong    Time
  | HavePonged Time deriving Show
```

Note that it is suggested that each component be defined in a separate module – this simplifies testing and also improves the chances of re-use between different models.

The processor model is more complex, and needs to record the processor’s number, the current time, and a representation of its state. This leads to the following definition:

```
Haskell
module Processor where
import Time
data Processor = Processor {
    pid :: Int,
    time :: Time,
    state :: ProcessorState
} deriving Show
```

The processor’s state is no more complex than that for the communication model:

```
Haskell
data ProcessorState = Ping | WaitForPing |
                      | Pong | WaitForPong deriving (Show, Eq)
```
In addition to the type definition, a number of support routines are also required. For example, a SystemSpecification requires that the processor model provides an equality operation, \(==\). For the ping-pong system, two processors can be considered equivalent if they have the same identifier:

\[
\begin{align*}
\text{Haskell} \quad \text{instance Eq Processor where} & \pi == \pi' = (\text{pid} \ \pi) == (\text{pid} \ \pi') \\
\end{align*}
\]

Other support definitions include get and set methods for the processor's time, and the is_active predicate:

\[
\begin{align*}
\text{Haskell} \quad \text{is active Processor} \ {\text{state}} &= \text{state} \neq \text{WaitForPing} && \text{and state} \neq \text{WaitForPong} \\
\end{align*}
\]

Having developed the communication and processor models, it is now possible to create the SystemSpecification shown in figure 6.23. While some of the definitions may look complicated, each operation has been almost directly converted from its operational specification.

### 6.5.3 Verification and testing

There are three phases associated with the verification and testing of an operational model:

1. **Animation of the model.** The process of converting the operational description into a Haskell program may well reveal problems or faults with the model. The primary aide to the animator is likely to be Haskell's type system. This will ensure that each component is treated in a consistent manner, and that the coupling between different phases is plausible. For example, the type of messages sent and received must match, something that is not necessarily checked in a real implementation. Furthermore, the increased level of detail required by the computer program may well uncover omissions in the system.

2. **Micro-level testing.** Once the generated code compiles correctly, testing can begin in earnest. The main aim of this phase is to check that the animation is faithful to the operational description. This entails testing individual rule transitions, and then moving on to examine sequences of reductions. Fortunately, as the parallel system is built on top of the sequential STG machine, only the new or modified rules need to be considered.

3. **Macro-level testing.** Having established that the various pieces of the animation are correct to a first approximation, the system as a whole must be verified. While the final result of the animation can be validated against the denotational semantics, the gross behaviour of the system is of equal importance. Typical areas of interest may include the total run time, the communication/computation ratio, and patterns of communication. However, it is impossible to anticipate the exact analysis needs for all scenarios.

The last two phases of testing are supported by the animation running in two different modes: interactive, and batch-mode. These are examined in the following sections.
module SimplePingPong (systemSimplePingPong) where
import SystemSpecification
import Processor
import Communications
import Time

systemSimplePingPong :: SystemSpecification Processor Communications
systemSimplePingPong = SystemSpecification {
  initP = let initP 1 = Processor {pid = 0, time = 0, state = Ping}
          initP 2 = Processor {pid = 1, time = 0, state = WaitForPing}
          in initP,
  initS = NOTHING,
  stepP = let stepP
           = (Processor id (time + 10) WaitForPong, NewPing time)
           = (Processor id (time + 10) WaitForPong, NewPong time)
           = (Processor id (time + 10) Ping, NOTHING)
           = (Processor id (time + 10) Ping, NOTHING)
           = (Processor id (time + 10) Pong, NOTHING)
           = (Processor id (time + 10) Pong, NOTHING)
           = (Processor id (time + 10) Pong, NOTHING)
           = error "pongpongStepP: no rules matched"
           in stepP,
  stepS = let stepS (NewPing t) = HavePinged (t + 100)
          (NewPong t) = HavePonged (t + 100)
          in stepS,
  local_time = processorGetTime,
  comms_time = let ctime (Processor id time WaitForPing, HavePinged t_recv)
               = max time t_recv
               ctime (Processor id time WaitForPong, HavePonged t_recv)
               = max time t_recv
               ctime _ = infinity
               in ctime,
  set_time = processorSetTime,
  is_active = processorIsActive,
  is_waiting = processorIsWaiting,
  is_stopped = processorIsStopped,
  finalP = \p -> False,
  finalS = \s -> False
}

Figure 6.23: The SystemSpecification for the ping-pong example
1 load *prog* *n* loads the STG' language program, *prog*, and initialises the system with *n* processors.

2 step *n* *d* perform *n* state transitions, displaying a summary every *d* steps. Entering an empty line is equivalent to step 1 1.

3 unstep *n* roll-back *n* state transitions (the system only records the last three states). This command allows a complex or erroneous transition to be re-examined. Furthermore, when used in combination with *set*, it may be possible to repair the system state and continue with the reduction.

4 goto *t* continue reductions until time *t* is reached. This is primarily used during debugging to jump straight to a known trouble spot.

5 show *c* display the named component, *c*. Specialised instances of this command can take additional arguments, enabling them, for example, to display specific heap locations.

6 set *c* *v* set the value of the named component, *c*, to *v*. This is primarily used to create a scenario for exercising a particular reduction sequence.

7 focus *n* modifies the behaviour of the step command, so that only transitions involving processor *Pn* are counted. When first started, the interactive animation will have no focus.

8 ofocus undoes the effect of any previous focus commands.

<table>
<thead>
<tr>
<th>command</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>load <em>prog</em> <em>n</em></td>
<td>loads the STG' language program, <em>prog</em>, and initialises the system with <em>n</em> processors.</td>
</tr>
<tr>
<td>step <em>n</em> <em>d</em></td>
<td>perform <em>n</em> state transitions, displaying a summary every <em>d</em> steps. Entering an empty line is equivalent to step 1 1.</td>
</tr>
<tr>
<td>unstep <em>n</em></td>
<td>roll-back <em>n</em> state transitions (the system only records the last three states). This command allows a complex or erroneous transition to be re-examined. Furthermore, when used in combination with <em>set</em>, it may be possible to repair the system state and continue with the reduction.</td>
</tr>
<tr>
<td>goto <em>t</em></td>
<td>continue reductions until time <em>t</em> is reached. This is primarily used during debugging to jump straight to a known trouble spot.</td>
</tr>
<tr>
<td>show <em>c</em></td>
<td>display the named component, <em>c</em>. Specialised instances of this command can take additional arguments, enabling them, for example, to display specific heap locations.</td>
</tr>
<tr>
<td>set <em>c</em> <em>v</em></td>
<td>set the value of the named component, <em>c</em>, to <em>v</em>. This is primarily used to create a scenario for exercising a particular reduction sequence.</td>
</tr>
<tr>
<td>focus <em>n</em></td>
<td>modifies the behaviour of the step command, so that only transitions involving processor <em>Pn</em> are counted. When first started, the interactive animation will have no focus.</td>
</tr>
<tr>
<td>ofocus</td>
<td>undoes the effect of any previous focus commands.</td>
</tr>
</tbody>
</table>

Figure 6.24: The command set supported by the interactive animation framework

6.5.4 Interactive animation

The interactive mode of the animation provides facilities for the animator to examine and adjust the system state, and to apply or undo reduction steps. As an example, consider the use of the interactive environment with the ping-pong example. The animation starts by creating and displaying the INIT state, and then prompts the user to enter a command:

```
Initial state:
numProcs=2
Processorpid=0, time=0, state=Ping
Processorpid=1, time=0, state=WaitForPing
NOTHING
interactive>
```

The basic set of commands supported by the interactive animation is shown in figure 6.24. Continuing with the example, the user would force a single reduction step as follows:

```
interactive> step 1 1
Processorpid=0, time=10, state=WaitForPong
HavePinged 100
interactive> show P1
Processorpid=1, time=0, state=WaitForPing
```

Single stepping is useful when closely examining short reduction sequences, or when learning about the system. However, often it is useful to skip ahead a number of reductions,
as shown below:

```
interactive> step 3 1
Processorpid=1, time=110, state=Pong
NOTHING
Processorpid=1, time=120, state=WaitForPing
HavePonged 210
Processorpid=0, time=220, state=Ping
NOTHING
```

Notice that only the states that change as a result of a reduction rule are displayed. The following sequence shows how the state can be manipulated to create a new scenario:

```
interactive> unstep 1
State:
Processorpid=0, time=10, state=WaitForPong
Processorpid=1, time=120, state=WaitForPing
HavePonged 210
interactive> set s (HavePonged 10000)
HavePonged 10000
```

Now the system cannot proceed until the Pong message has been received:

```
interactive> step 1 1
Processorpid=0, time=10010, state=Ping
NOTHING
```

### 6.5.5 Batch-mode animation

Unlike the interactive animation, the batch-mode performs reductions until the final state is reached, incrementally generating a log file. The log file contains an entry for the initial state, the final state (assuming the reduction very terminates), and every intermediate state change. The exact format of the state entries is dependent upon the Show instance defined or derived for \( P \) and \( S \). For example, the first few entries of the ping-pong system’s log file are:

```
Processorpid=0, time=0, state=Ping
Processorpid=1, time=0, state=WaitForPing
NOTHING
Processorpid=0, time=10, state=WaitForPong
HavePinged 100
Processorpid=1, time=110, state=Pong
NOTHING
Processorpid=1, time=120, state=WaitForPing
HavePonged 210
Processorpid=0, time=220, state=Ping
NOTHING
Processorpid=0, time=230, state=WaitForPong
HavePinged 320
```

Typical log files can contain millions of entries, and are therefore of little use in themselves. However, when combined with a general-purpose data-analysis tool, the log files
Figure 6.25: The inferred state-transition diagram for the ping-pong system

can potentially be used to extract any behavioural information. To demonstrate this approach, the remainder of this section will describe how the state-transition graph shown in figure 6.25 was derived from the ping-pong system's log file. As well as serving as a useful example, the inferred graph provides an excellent summary of the test coverage of the reduction rules with respect to a particular scenario.

Plumber

Plumber [Haines, Longshaw and Morison, 1997] is a visual programming environment for exploratory data analysis. As can be seen from figure 6.26, the tool comprises two different parts, the canvas and the display table. Computations are constructed by drawing diagrams on the canvas. The diagrams are made up of connected processing elements, where the connecting wires represent the flow of data between them. The display table interactively displays the results either of the whole diagram or of selected processing elements, providing valuable feedback and guidance to the developer. While there are a number of similar tools available both publicly and commercially, Plumber offers a number of advantages:

1. Plumber's open architecture allows it to inter-operate with existing tools, such as databases, spreadsheets, and command-line applications;

2. Plumber has a rich set of built-in components which can easily be extended and customised to meet the needs of a new application domain;

3. Plumber provides support for structured types, including lists, dictionaries, records, and sets;

4. Plumber is written in Java [Sun Microsystems, 1998], and can therefore run unmodified on all of the popular machine platforms.

GML: a portable graph file format

The Graph Modelling Language, GML [Himsolt, 1996a], is a textual language for describing and annotating graphs. The main body of a GML description contains the node and edge definitions – the example given below defines a simple two-node, one edge graph:
Figure 6.26: The Plumber diagram for generating state-transition diagrams from logfiles
Further tags can be added to both node and edge definitions, including layout information, and node and line styles. A number of graphing tools can import GML files, including the Graphlet editor [Himsolt, 1996b], which provides an automatic layout feature. The graph shown in figure 6.25 was generated by using Graphlet’s random layout scheme and then fine tuning the positioning using the manual controls.

Inferring the state-transition diagram

The mechanics of converting the ping-pong log file into a GML description are described below:

1. The log file is split into three streams, each containing the traces for one of the components, \( P_0, P_1, \) and \( S \).

2. The current state of the stream’s component is then determined. For the processor traces, the state is the label following the `state=` string. The communication state is the first text field. Where appropriate, references to specific times are removed.

3. For each stream, the current state and new state are paired to create a state-transition key. The head of each state stream represents the component’s initial state.

4. Each key is then recorded in a counting dictionary, effectively generating a histogram of state transitions.

5. The final dictionary is then converted into a GML description using a Plumber graphing library.

6. The graph definition is written to a file and the Graphlet application invoked on that file. Some manual adjustments may be required to achieve a satisfactory layout of the nodes.

Figure 6.26 shows the Plumber diagram for generating the GML graph for \( P_0 \). Note that minor adjustments may be required when processing other types of log files.

6.6 Summary

This chapter has concentrated on the extension of the sequential STG machine into the realm of parallel processing. The first step was to develop a flexible operational system capable of expressing parallel interactions, particularly those common in GMSV and DMMP systems. This then provided the framework within which to carry out a systematic investigation of the impact of parallelism on the STG’s evaluation mechanism, communication and synchronisation, resource management, and partitioning and naming. The STG’s language manipulations described in the previous chapter were then considered, and a recipe book developed for integrating them into the STG machine. Finally, a number of techniques for animating and testing the operational models were outlined.
Chapter 7

Simulating the target architecture

7.1 Introduction

This chapter describes the simulator used to test and debug the output of the STG' compiler (see chapter 8). A RISC-like instruction set, based on the DEC Alpha processor family, serves as the interface between the two systems. The simulator is interpretive and is specified using the state-transition notation presented in chapter 6. While overall performance is relatively poor, the extensible nature of the state-transition model is more important for this particular application.

After an overview of the merits of simulation in section 7.2, the basic uniprocessor model is presented in section 7.3. Using this as a building block, section 7.4 discusses the simulation of multiprocessor architectures. The chapter is then summarised in section 7.5.

7.2 Why simulation?

Traditionally, simulation is used when either analytical modelling or physical measurement is inappropriate. For the purposes of testing and debugging the STG' compiler, the former can obviously be ruled out, and Bedichek [1995, section 2.1, pages 14–15] attributes the following advantages to simulation over direct measurements: a simulator can easily be augmented with new measurements and debugging features; it can model “ideal” or unavailable components; it is non-intrusive; and simulation runs are often deterministic and therefore repeatable. Taking physical measurements, on the other hand, is typically faster and yields more accurate results. As the correctness of the compiler is the primary concern, these two issues becomes less important, and simulation is the preferred approach.

Multiprocessor simulation is an active area of research, and there are a large number of well-established tools, including PROTEUS [Brewer, Dellarocas and Weihl, 1991], FAST [Boothe, 1994], Shade [Cmelik and Keppel, 1994], and Talisman [Bedichek, 1995]. However, rather than using an existing package, or even a simulation language [Dahl and Nygaard, 1966], it was decided to build a new system using the state-transition approach outlined in chapter 6. The resulting system offers the following advantages: (at the cost of reduced accuracy and performance)

1. the compiler and the simulator use the same internal representation of the target language, thereby simplifying integration and testing. This also allows the representation to be customised (most modern simulators take as their input an executable or a program written either in assembly code or a high-level language, such as C.)
2. Modelling a new target architecture is simply a case of modifying a state-transition system, a topic already covered in section 6.4.1. PROTEUS, for example, allows customisation of its four main modules [Brewer et al., 1991, section 3, pages 4–8] but the interface is static and the code serves as both implementation and specification.

3. By adopting the same general approach to animation as used in chapters 6 and 8, the consistency and coherence of the framework is maintained.

7.3 Modelling a RISC uniprocessor

7.3.1 RISC architectures and the DEC Alpha AXP

The recent trend in computer architecture has been towards RISC (Reduced Instruction Set Computer) systems. The salient features of this class of processor include [Kane and Heinrich, 1992, chapter 1, pages 1–22]: one instruction completed per cycle; simple addressing modes and instruction formats; sufficient on-chip memory (registers and cache) to overcome the processor/memory bottleneck; and a reliance on optimising compilers to obtain the best possible performance.

A large number of commercial RISC systems have been developed, including the MIPS [Kane and Heinrich, 1992] and PowerPC [May, Silha, Simpson and Warren, 1994] architectures. Furthermore, modern parallel computers typically use these uniprocessors as basic computational building blocks. For example, Cray’s T3D [Koeninger, Furtney and Walker, 1992] uses up to 1,024 DEC Alpha AXP microprocessors, while the CM-5 [Hillis and Tucker, 1993] uses a similar number of SPARC processors [Sun Microsystems, 1988].

For the purposes of this study, the Alpha AXP architecture [Sites, 1992] was selected as the basic model for the uniprocessor simulation. The Alpha is well suited to this role because:

- it is, arguably, the fastest commercial processor currently available.
- by avoiding all non-replicated hidden state, including condition codes [RISC Machines Ltd (ARM), 1994, section 4.2, page 20], suppressed-instruction bits [Hewlett-Packard, 1994, section 4, page 4–7], and precise arithmetic exceptions [Kane and Heinrich, 1992, section 9, page 9–2], future designs can take advantage of multiple instruction issue (this also simplifies the design of the state transition model.)
- all operating-system support is handled by privileged software subroutines, called PALcode (see sections 7.3.2 and 7.4).
- shared-memory multiprocessing support is an integral part of the architecture. The load_{linked} and store_{linked} instructions (see section F.3) provide a safe mechanism for updating shared addresses.

7.3.2 The state-transition system

The resulting model is straightforward, if not concise, and uses the abstract state given below:

\[(\text{code}, \text{ program counter}, \text{ registers}, \text{ memory}, \text{ semaphore}, \text{ exceptions})\]

\(^1\text{Clements [1991] argues that the 'R' of RISC should stand for 'Regular' to better reflect the underlying philosophy}\)
2

<table>
<thead>
<tr>
<th>Decode</th>
<th>pc registers memory semaphore exceptions</th>
</tr>
</thead>
<tbody>
<tr>
<td>=&gt;</td>
<td>Execute instruction pc' registers memory semaphore exceptions</td>
</tr>
<tr>
<td>where</td>
<td>instruction = decode memory(pc)</td>
</tr>
<tr>
<td></td>
<td>pc' = pc + 32</td>
</tr>
</tbody>
</table>

6

| Execute load offset(base) target pc registers memory semaphore exceptions |
| =>     | PostExec pc registers' memory semaphore exceptions |
| where  | registers' = registers[target -> value] |
|        | value = memory(address) |
|        | address = offset + 32 registers(base) |

3

| PostExec pc registers memory semaphore (pending, mask, counter, trigger) |
| =>     | Decode pc registers memory semaphore (pending', mask, counter', trigger) |
| where  | counter' = counter + 32 |
|        | pending' = pending ∪ clock_interrupt |
|        | clock_interrupt = if (trigger = counter) then {Clock} else 0 |

Figure 7.1: A selection of RISC state-transition rules

The state components are defined in table 7.1, and a number of example instructions and transition rules are shown in table 7.2 and figure 7.1 respectively (appendices F and G contain the full details of the 49 instructions and 30 transition rules). Note that the exceptions field contains a counter, which is automatically incremented by the Decode mode (rule 3), and this serves as the main performance metric (see section 8.2).

The instruction pipeline

The code component loosely models a processor’s instruction pipeline [Hwang and Briggs, 1985, chapter 3, page 153], and the transitions will typically proceed as follows: Decode => Execute => PostExec => Decode => · · · (rules 2, 5–30, and 3) – see figure 7.1. If an unmasked exception is raised then the sequence will become Decode => Exception (rule 1). The appropriate PAL code will be invoked to handle the interrupt, and this is responsible for clearing the exception and returning to the Decode mode of operation (rule 4).

Accessing operating-system services

The syscall instruction provides the interface between a program and the operating system [DEC, 1992, chapter 9]. Rather than modelling these calls down to the instruction level, a separate transition rule specifies the entire operation, as illustrated by the following example:

<table>
<thead>
<tr>
<th>SET_TRIGGER</th>
<th>Exception pc registers memory semaphore (pending, mask, counter, trigger)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>such that SysCall ∈ (pending \ mask) and memory(pc + 32 1) = syscall SET_TRIGGER</td>
</tr>
<tr>
<td>=&gt;</td>
<td>Decode pc registers memory semaphore (pending', mask, counter', trigger')</td>
</tr>
<tr>
<td>where</td>
<td>counter' = counter + 32 los_call + 32 lset_trigger</td>
</tr>
<tr>
<td></td>
<td>trigger' = registers(1)</td>
</tr>
<tr>
<td></td>
<td>pending' = pending \ {SysCall}</td>
</tr>
</tbody>
</table>

To test this approach, a Unix-style process model [Goodheart and Cox, 1994, chapter 4] has been developed. By adjusting the instruction-count overhead associated with context
<table>
<thead>
<tr>
<th>specification</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>code</td>
<td><strong>Decode</strong> fetch the instruction referenced by the program counter (which is then incremented), decode it, and then pass it on to be executed</td>
</tr>
<tr>
<td></td>
<td><strong>Execute instruction</strong> evaluate the current instruction</td>
</tr>
<tr>
<td></td>
<td><strong>PostExec</strong> increment the timer counter and generate a timer exception if necessary</td>
</tr>
<tr>
<td></td>
<td><strong>Exception</strong> invoke the exception handler</td>
</tr>
<tr>
<td>program counter</td>
<td><strong>address</strong> the address of the next instruction to be executed</td>
</tr>
<tr>
<td>registers</td>
<td><strong>registers(i_{reg}) = value</strong> records the contents of the register file, where 0 ( \leq i \leq 32 ), and ( registers(0) = 0 )</td>
</tr>
<tr>
<td>memory</td>
<td><strong>memory(address) = value</strong> a model of the processor's main memory</td>
</tr>
<tr>
<td>semaphore</td>
<td><strong>(address, boolean_{stale?})</strong> the address field records the memory location read by the last linked load, and stale? indicates if this location has been updated since the load (see the load_{link} and store_{link} instructions)</td>
</tr>
<tr>
<td>exceptions</td>
<td><strong>(set of exceptions, set of exceptions, i_{counter}, i_{trigger})</strong> the first two fields record which exceptions have been raised and those that should be ignored for the present. The counter is incremented after every instruction, and when it matches the trigger a Clock exception is raised</td>
</tr>
<tr>
<td>instruction</td>
<td>see appendix F any one of 49 possible instructions, including memory references, branches, operate instructions, and system instructions</td>
</tr>
<tr>
<td>value</td>
<td>**address</td>
</tr>
<tr>
<td>exception</td>
<td><strong>Clock</strong> raised when the cycle counter equals the trigger value</td>
</tr>
<tr>
<td></td>
<td><strong>Overflow</strong> raised when an add_{trap} or sub_{trap} instruction causes under- or overflow, but only acted upon when the appropriate barrier_{trap} instruction is executed</td>
</tr>
<tr>
<td></td>
<td><strong>SysCall</strong> raised by a syscall instruction</td>
</tr>
<tr>
<td></td>
<td><strong>Unaligned</strong> raised whenever a memory reference is not word aligned (i.e. the two least-significant bits are non-zero)</td>
</tr>
</tbody>
</table>

Table 7.1: State components of the RISC uniprocessor
<table>
<thead>
<tr>
<th>mnemonic</th>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD</td>
<td>load ( \text{offset}<em>{16}(\text{base}</em>{reg}) ) ( \text{target}_{reg} )</td>
<td>load a word from the address (word-aligned) formed by adding the 16-bit signed offset and the contents of the base register. The value is then stored in the target register.</td>
</tr>
<tr>
<td>JSR</td>
<td>jump( \text{link} ) ( \text{offset}<em>{16,2}(\text{base}</em>{reg}) ) ( \text{link}_{reg} )</td>
<td>the 16-bit signed offset is first shifted two places to the left, then added to the base register to form the target address (which must be word aligned). Before the PC is set to this new address, the link register is loaded with the PC's current value, allowing a subroutine to return control back to the caller.</td>
</tr>
<tr>
<td>ADD</td>
<td>add ( \text{value}<em>{reg} ) ( \text{reg}</em>{-\text{imm}} ) ( \text{target}_{reg} )</td>
<td>signed addition of the first two arguments, the result of which is stored in the target register.</td>
</tr>
<tr>
<td>CALL_PAL</td>
<td>syscall ( \text{immediate}_{26} )</td>
<td>cause a system-call exception.</td>
</tr>
</tbody>
</table>

Table 7.2: A selection of RISC instructions

switches and task creation, the system can also (crudely) simulate user-level threads [Birrell, 1989] and hardware contexts [Weber and Gupta, 1989; Agarwal et al., 1993]. Note that no support for any form of I/O [Goodheart and Cox, 1994, chapter 5] is provided, although adding the necessary interfaces should be straightforward.

7.4 Modelling multiprocessor systems

7.4.1 Basic building blocks

The processor model presented in figure 6.1 is still valid, and provides the underlying structure for the multiprocessor simulator – the RISC uniprocessor model fills the role of \( P \), while the communication system, \( S \), is styled after the intended target architecture.

There are two ways of providing a program with access to the communication system: firstly, the \textit{syscall} interface can be used for large-grained operations, such as message passing (see the \texttt{SET_TRIGGER} example from section 7.3.2); secondly, fine-grained activities, including accessing shared-memory, should be directly incorporated into the state-transition rules – for example, the following rule forms part of a crude model of a local cache:

\[
\begin{array}{c}
6' \quad \text{Execute load offset(base) target pc regs cache memory smp exs} \\
\text{such that } (address, value) \in \text{cache} \\
\implies \text{PostExec pc regs' cache memory smp exs} \\
\text{where } \text{regs}' = \text{regs[target} \Rightarrow \text{value}] \\
\text{address} = \text{offset} + 32 \text{regs(base)}
\end{array}
\]
7.4.2 Cost models

As each RISC transition rule is equivalent to one instruction step on a real processor, many of the objections raised in section 6.2.2 do not apply, and the instruction count can be used to estimate a program’s run time. As for estimating the instruction count of any communication primitives, the LogP model proposed by Culler et al. [1993] is recommended, whereby algorithms are modelled using the parameters $L$, $o$, $g$ and $P$, which are defined as follows:

$L$ - An upper bound on the latency involved with communicating a word length message from source to destination.

$o$ - the overhead attributed to the transmission or reception of each message. During this time a processor can engage in other activity.

$g$ - the minimum gap allowed between consecutive message transmission or reception. The reciprocal gives the per-processor bandwidth.

$P$ - the number of processors. (Each local operation is assumed to take unit time.)

Results on a variety of different architectures (including dataflow, shared memory and message passing systems) have shown that the model closely reflects the actual performance of algorithms developed this way. If a more detailed timing model is required, then Talisman’s iterative technique [Bedichek, 1995, section 6.1, page 20] should be applied.

7.4.3 A hybrid architecture

To show the viability of the RISC-based framework, the hybrid architecture shown in figure 7.2 has been developed and tested. Shared memory is accessed via the usual load and store operations, with load\textsubscript{linked} providing a basic semaphore facility (see sections F.3 and G.5 for a full description of these instructions). The message-passing network is
accessed via the system calls shown in table 7.3, which are modelled after the PRO­TEUS [Brewer, Dellarocas and Weihl, 1991, section 4.4, pages 35–36] application program interface (the send primitive is non-blocking).

<table>
<thead>
<tr>
<th>system call</th>
<th>inputs</th>
<th>outputs</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>send</strong></td>
<td>R19 return address</td>
<td>R18 corrupted</td>
<td>send a message</td>
</tr>
<tr>
<td></td>
<td>R20 destination</td>
<td>R19 –1 on failure</td>
<td></td>
</tr>
<tr>
<td></td>
<td>R21 message buffer</td>
<td>R20 destination</td>
<td></td>
</tr>
<tr>
<td></td>
<td>R22 length of data</td>
<td>R21 message buffer</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>R22 length of data</td>
<td></td>
</tr>
<tr>
<td><strong>recv</strong></td>
<td>R20 return address</td>
<td>R19 corrupted</td>
<td>receive a message</td>
</tr>
<tr>
<td></td>
<td>R21 buffer length</td>
<td>R20 destination</td>
<td></td>
</tr>
<tr>
<td></td>
<td>R22 message buffer</td>
<td>R21 length of data</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>(–1 if no message)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>R22 message buffer</td>
<td></td>
</tr>
<tr>
<td><strong>poll</strong></td>
<td>R21 return address</td>
<td>R21 corrupted</td>
<td>test for arrival</td>
</tr>
<tr>
<td></td>
<td></td>
<td>R21 length of data</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>(–1 if no message)</td>
<td></td>
</tr>
</tbody>
</table>

Table 7.3: The hybrid architecture’s message-passing interface

The traces generated by the message-passing components of the simulator follow the PICL standard [Geist, Heath, Peyton and Worley, 1990; Worley, 1992], and, with some manual editing, are suitable for use with the ParaGraph visualisation tool [Glendinning, Hockney, Pritchard and others, 1993]. As an example, figure 7.3 shows spacetime diagrams [Heath and Finger, 1993, section 5.2.2, pages 22–23] for three distributed-memory GVT algorithms [Ben-Dyke, 1997].

### 7.5 Summary

This chapter has presented a state-transition model of a multiprocessor architecture using the Alpha RISC processor as the basic computational engine. This is used to test and debug the output of the STG' compiler (see chapter 8).

How does this model compare with existing simulation tools? Unfortunately, using performance and accuracy as the main criteria, the system is a failure. The latter could be corrected as “the level of detail is limited only by the time available for simulation development” [Jain, 1991, section 24.1, page 394]. However, as a tool for rapidly testing the output of the STG' compiler, the flexibility and convenience compensate for these limitations.
Figure 7.3: Comparing the performance of three GVT algorithms using the ParaGraph visualisation tool
Chapter 8

Compilation rules

8.1 Introduction

This chapter describes how the state-transition model can be used to model a compilation system. Particular emphasis is placed on encoding important optimisations, including register allocation, closure layout, and dead-code elimination. The validity of this approach is demonstrated by developing and prototyping a compilation system for a subset of the sequential STG' language.

Section 8.2 motivates the selection of a RISC assembly language as the target for the compilation system, while the proposed state-transition system is described in section 8.3 (all of the rules are collected in appendix H.) Section 8.4 then considers the development of the run-time support for the generated code, before the chapter is concluded in section 8.6.

8.2 Targeting a RISC assembly language

The two most common target languages for compilers of functional programming languages are C [Kernighan and Ritchie, 1978] and assembly language. Bartlett [1989] cites the following advantages to using C:

- **portable** most modern computers provide a C compiler.
- **high level** many of the technical aspects of efficient code generation will be handled by the C compiler (and others' improvements to this technology will be passed on to the new system)
- **easy to interface with C** typically, if a system provides an inter-language interface it will be modelled on C's calling convention. For example, Glasgow Haskell provides the ccall and casm primitives [AQUA Team, 1993, section 3.2.3, pages 33–34]), and the Unix operating system uses a C-style interface [Leffler, McKusick, Karels and Quarterman, 1989, chapter 1, page 3]

Peyton Jones et al. [1993, section 6.2] also point out the following, unexpected, benefit:

- **debugging** source-level debuggers, such as gdb, simplify the testing process.

The only drawback to the high-level approach is the potential loss of performance, but, where the two language models are similar, this cost is small. For example, depending upon the C compiler used, Bartlett [1989, section 4, pages 20–22] observed either a 5% slowdown or 8% speedup over a traditional Scheme compiler. On the other hand, GHC
requires a jump statement to efficiently implement the Entry mechanism of the STG machine, and going via C would impose a "considerable" overhead [Peyton Jones et al., 1993, section 6.1]. To overcome this problem, GHC uses non-standard features of the GNU C compiler to explicitly manage the register mapping, thereby circumventing the standard calling convention. The complexity of the resulting implementation [AQUA Team, 1994] is such that the decision to target the C language can be questioned.

Therefore, the reasons for the adoption of a RISC-like assembly language as the compiler's target can be stated as follows:

**Expressiveness** assembler traditional provides finer control over the layout of the data and code components of a program. This allows a number of optimisations to be expressed, including register allocation, and reversed info tables.

**Simplicity** a RISC instruction set is regular, thereby simplifying register-allocation and instruction-scheduling algorithms.

**Portability** as a direct result of simplicity, converting to a CISC-like language should be straightforward.

**Accuracy** both Appel [1992, section 15.1, page 182] and Santos [1995, section 2.3, page 13] use the total number of assembler instructions executed as a primary metric for performance evaluation.¹


From a modelling perspective, however, there is one potential problem with the assembly-language approach: developing the run-time support systems can be tedious, error prone, and time consuming. This issue is addressed in section 8.4.

### 8.3 Prototyping a modern optimising compiler using a state-transition system

The work described in this section is based on the third part of the original STG report [Peyton Jones and Salkild, 1989, sections 6-11, and appendix A], and also draws on the techniques described in chapter nine of the dragon book [Aho, Sethi and Ullman, 1986, pages 513–584].

The prototype system uses the following state to structure the compilation:

<table>
<thead>
<tr>
<th>expression</th>
<th>continuation</th>
<th>pending</th>
<th>code</th>
<th>global</th>
</tr>
</thead>
<tbody>
<tr>
<td>(code, stack, stack, bindings, blocks, environment)</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

The individual components are specified in table 8.1, while section 8.3.1 describes the code component in greater depth. The associated rules, collected in appendix H, are only a subset of what would be required for a complete compilation system, with the most notable omissions being the rules dealing with constructors and higher-order functions. The development of the missing rules should not be difficult.

¹The simulator described in chapter 7 originally used a C-like language (see section A.4.3 for further details), but it proved difficult to cost the different statements and expressions.
### Table 8.1: The state components of the compiler framework

<table>
<thead>
<tr>
<th>Code Component</th>
<th>Specification</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>expression stack</td>
<td>stack of instructions</td>
<td>results of sub-compilations are returned on this stack</td>
</tr>
<tr>
<td>continuation stack</td>
<td>stack of code</td>
<td>the return stack whereby control reverts to the initiator of a sub-compilation</td>
</tr>
<tr>
<td>pending bindings</td>
<td>set of bind</td>
<td>global and local definitions awaiting compilation</td>
</tr>
<tr>
<td>code blocks</td>
<td>{label \mapsto instructions}_{env}</td>
<td>accumulates the output of the compilation system</td>
</tr>
<tr>
<td>global environment</td>
<td>\sigma \text{ var = label}</td>
<td>records the static addresses of all top-level closures</td>
</tr>
<tr>
<td>instructions</td>
<td>sequence of instruction</td>
<td>a basic block [Aho, Sethi and Ullman, 1986, section 9.4]</td>
</tr>
<tr>
<td>instruction</td>
<td>a RISC instruction – see appendix F</td>
<td></td>
</tr>
<tr>
<td>label</td>
<td>an operand – see section 8.3.2</td>
<td></td>
</tr>
</tbody>
</table>

#### 8.3.1 The code component

The system has two modes of operation, compiler control and expression compilation, and both are detailed in table 8.2. The former codes manipulate bindings and provide flow control, while the rules associated with the latter codes are closely related to those of their STG-machine counterparts, as demonstrated by the following rule for compiling literal expressions:

\[
\begin{align*}
\text{9} & \quad CEval \ (k) \quad \rho \text{ code returns } \text{exps} \ \text{conts} \ \text{pending blocks} \ \sigma \\
\implies & \quad CReturnInt \ k \quad \rho \text{ code returns } \text{exps} \ \text{conts} \ \text{pending blocks} \ \sigma
\end{align*}
\]

In some cases, however, it has been necessary to introduce an extra stage, as illustrated by the splitting of the STG-machine Enter code into the CEnter and CJoinEnter codes of the compilation system.

\footnote{Indeed, the compilation rules have been numbered in accordance with their STG-machine equivalents – see appendix H.}
### Compiler control

**Continue** sets the next command to be either:

1. the top item on the continuation stack.
2. if the continuation stack is empty, then select a binding from the set of those pending compilation, and set the next command to *CompileBind*.
3. if the set of pending bindings is empty (and the continuation stack is finished) then the next command is set to *Finish*.

**CompileBind** initiates the compilation of a STG binding

**ReturnExpression** returns the instructions needed to evaluate an expression sequence and allows fine-tuning at this level (including common peephole optimisations).

**SealEntry** complements *CompileBind* in that it allows any pre- or post-amble to be added to the main body of a function. This could include, for example, stack, heap and argument checks or (simple) interface manipulations.

**ReturnBind** similar to *ReturnExpression*, except the instructions encode a whole binding, either top-level or bound by a let(rec) expression. This command is a good point at which to generate info tables, static heap entries, and specialised garbage-collection routines.

**Finish** indicates that the compilation process has completed successfully.

### Expression compilation

**CEval** generates the RISC instructions needed to evaluate a given expression sequence.

**CEnter** determines the calling mechanism for a non-literal variable

**CJoinEnter** glues together code either side of a non-local application.

**CReturnCon** determines what return mechanism to use for the given type and is the dual of the *CJoinReturns* instruction. Together they realise the behaviour of the *ReturnCon* code of the operational semantics.

**CReturnLit** has the same effect as *ReturnCon* except it deals with literal values.

**CJoinReturns** combines all of the specified alternatives of a case expression into one return method. The spectrum of possible methods is delimited by vector and in-line returns.

<table>
<thead>
<tr>
<th>Command</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>CEval</td>
<td>generates the RISC instructions needed to evaluate a given expression sequence.</td>
</tr>
<tr>
<td>CEnter</td>
<td>determines the calling mechanism for a non-literal variable</td>
</tr>
<tr>
<td>CJoinEnter</td>
<td>glues together code either side of a non-local application.</td>
</tr>
<tr>
<td>CReturnCon</td>
<td>determines what return mechanism to use for the given type and is the dual of the <em>CJoinReturns</em> instruction. Together they realise the behaviour of the <em>ReturnCon</em> code of the operational semantics.</td>
</tr>
<tr>
<td>CReturnLit</td>
<td>has the same effect as <em>ReturnCon</em> except it deals with literal values.</td>
</tr>
<tr>
<td>CJoinReturns</td>
<td>combines all of the specified alternatives of a case expression into one return method. The spectrum of possible methods is delimited by vector and in-line returns.</td>
</tr>
</tbody>
</table>

Table 8.2: The code component of the compilation state-transition system
8.3.2 Operands and register allocation

The compilation system uses the following operands:

<table>
<thead>
<tr>
<th>operand</th>
<th>STG-machine C-code equivalent</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>register&lt;sub&gt;n&lt;/sub&gt;</td>
<td>var</td>
<td>contents of the nth general-purpose register</td>
</tr>
<tr>
<td>stack&lt;sub&gt;n&lt;/sub&gt;&lt;sup&gt;A&lt;/sup&gt;</td>
<td>SpA[n], SpB[n]</td>
<td>the value stored in the nth slot of the A (boxed) or B (unboxed) stack</td>
</tr>
<tr>
<td>heap&lt;sub&gt;offset&lt;/sub&gt;</td>
<td>Hp[&lt;sub&gt;offset&lt;/sub&gt;]</td>
<td>the address created by adding offset to the heap pointer</td>
</tr>
<tr>
<td>memory&lt;sub&gt;offset&lt;/sub&gt;</td>
<td>operand[&lt;sub&gt;offset&lt;/sub&gt;]</td>
<td>contents of the memory location specified by adding offset to operand</td>
</tr>
<tr>
<td>label&lt;sub&gt;name&lt;/sub&gt;</td>
<td>&amp;name</td>
<td>a named label pointing to a static address, which may reference an entry routine, an information table, a jump table etc.</td>
</tr>
<tr>
<td>literal</td>
<td>1, ..., UINT_MAX</td>
<td>a constant integer value</td>
</tr>
</tbody>
</table>

A modern RISC processor will typically provide either 32 or 64 general-purpose registers, although a number of these are reserved by the compilation system for holding important values, as shown below:

<table>
<thead>
<tr>
<th>register</th>
<th>1-23</th>
<th>24</th>
<th>25</th>
<th>26</th>
<th>27</th>
<th>28</th>
<th>29</th>
<th>30</th>
<th>31</th>
</tr>
</thead>
<tbody>
<tr>
<td>use</td>
<td>general purpose</td>
<td>Ret</td>
<td>Np</td>
<td>StkA</td>
<td>StkABase</td>
<td>StkB</td>
<td>StkBBase</td>
<td>HLimit</td>
<td>Hp</td>
</tr>
</tbody>
</table>

The general purpose registers, in combination with the node pointer, Np, and stack pointers, simulate the local environment of the STG machine.

Furthermore, depending upon the entry and return conventions other registers may have special meanings (see rules 1–2c and 11A–13’). For example, upon entry to a closure (see the following section), register 24 will hold the return address (rule 2c), and the node pointer, Np, will point to the base of the closure (rule 1).

While calling conventions rigidly define the location of certain values upon entry and exit of a basic block, the strategy for making the best use of the general-purpose registers within the block itself is known as register allocation [Aho, Sethi and Ullman, 1986, section 9.7]. As demonstrated by Fraser and Hanson [1992], even simple allocation schemes can be effective. Despite the maturity of such algorithms for imperative languages, their functional counterparts have received little attention, with notable exceptions including the work of Boquist [1995] and Appel [1992, chapter 11].

8.3.3 Counters, timers and interrupts

As discussed in section 9.3.2, it is often useful to interrupt the current thread of control, perform some task, and then continue as before. Unfortunately, this can significantly complicate the run-time code [Axford, 1989, section 1.2], and, therefore, the compilation system itself. The Gambit compiler [Feeley and Miller, 1990] neatly avoids these problems by inserting tests after every basic block – the tests, and any handlers they may invoke, can then assume that the system is in a stable state. To ensure that the tests are performed in a timely manner, it may be necessary for the compilation system to split large blocks into a number of smaller ones.
8.3.4 The fac function

Figure 8.1 shows the output of the compilation system for the specialised fac function shown below:

```
fac = [] \r \[n\] -> case n of
  |
  0# -> 1#;
  _  -> let# n_less_one = minusInt# [n, 1#] in
      let# fac_n_less_one = fac n_less_one
      in timesInt# [n, fac_n_less_one]

  |
```

Figure 8.1: Unoptimised RISC code produced by the compiler for the fac function
Appendix I contains a number of other examples, including the nofib programs, fib, and primes, as well as common prelude functions such as map and quotRem. In addition, it also include the code required to update partial applications and algebraic constructors.

8.4 Run-time support

Run-time support covers both the traditional operating-system libraries (including message passing and thread management) as well as the more specialised capabilities, such as distributed garbage collection [Lester, 1989; Trinder, Hammond, Partridge, Peyton Jones and others, 1996, section 2.3.3] and load balancing (see section 9.3.2). The compilation rules access both types of libraries using application-program interfaces (API) similar to those described in section 7.4.3, but extended to include the static label of the appropriate code block.

While generating a specific API will be straightforward, the implementation in RISC assembler is likely to be tedious, time consuming, and error prone. While it may be tempting to provide the functionality via the simulator’s syscall interface – thereby enabling the use of Haskell – this should only be used for the operations described in section 7.3.2. Apart from “feeling” wrong, abusing the syscall mechanism could affect the accuracy and correctness of the simulation as each such operation is atomic.

The simple solution to the above-mentioned problem is to use an existing compiler to generate the assembly code from, for example, a C implementation of the function. The lcc re-targetable C compiler [Fraser and Hanson, 1991] is an obvious candidate as it supports cross compilation to MIPS assembler [Kane and Heinrich, 1992]. Moreover, lcc only performs simple peephole optimisations, thereby maintaining the correspondence between the source and output codes. Some editing of the resulting code will be required, but there is a considerable net saving in both time and effort, and increased confidence in the correctness of the generated assembler code.

8.5 Benchmarking the nofib routines

Tables 8.5 and 8.5 contain the RISC instruction counts when running unoptimised and optimised versions of the fib, primes, and queens benchmarks. The total instruction count is broken down into the following categories for each benchmark:

- **computation** includes the numerical and logical operators, add, multiply, exclusive or, shift left, etc. These instructions are primarily used when performing argument checks, trimming the stack, and allocating memory. The computation performed as a result of primitive STG' operations is typically less than 10%.

- **memory** includes both loads and stores. Loads are used to retrieve stack parameters, and access data from the heap (typically info tables and free variables). Stores are used to push data onto the stack and to initialise or update closures. Within the benchmarks the ratio between loads and stores is approximately 50% (with the exception of the optimised queens, where loads account for 60% of the total).

- **calls** include both branches (BR and JMP) and subroutine calls (BSR and JSR). Branches are used to call known entry points and to return to the correct vector entry. Calls are used to enter closures for single constructor data-types, such as Integer and Boolean. The ratio between branches and calls is typically between 70% and 80%, although for the optimised fib the ratio drops to 60%.
<table>
<thead>
<tr>
<th>benchmark</th>
<th>total</th>
<th>computation</th>
<th>memory</th>
<th>calls</th>
<th>immediates</th>
<th>conditionals</th>
</tr>
</thead>
<tbody>
<tr>
<td>fib 5</td>
<td>2676</td>
<td>31-6%</td>
<td>41-3%</td>
<td>13-6%</td>
<td>6-8%</td>
<td>6-7%</td>
</tr>
<tr>
<td>fib 10</td>
<td>32565</td>
<td>31-7%</td>
<td>41-4%</td>
<td>13-6%</td>
<td>6-5%</td>
<td>6-8%</td>
</tr>
<tr>
<td>fib 15</td>
<td>363927</td>
<td>31-7%</td>
<td>41-5%</td>
<td>13-6%</td>
<td>6-5%</td>
<td>6-8%</td>
</tr>
<tr>
<td>queens 6</td>
<td>176314</td>
<td>28-5%</td>
<td>46-4%</td>
<td>12-4%</td>
<td>6-2%</td>
<td>6-5%</td>
</tr>
<tr>
<td>queens 5</td>
<td>849691</td>
<td>28-4%</td>
<td>46-6%</td>
<td>12-5%</td>
<td>6-0%</td>
<td>6-5%</td>
</tr>
<tr>
<td>primes 50</td>
<td>299406</td>
<td>27-7%</td>
<td>47-0%</td>
<td>11-9%</td>
<td>7-0%</td>
<td>6-4%</td>
</tr>
<tr>
<td>primes 100</td>
<td>1077037</td>
<td>27-6%</td>
<td>47-1%</td>
<td>12-0%</td>
<td>6-9%</td>
<td>6-4%</td>
</tr>
</tbody>
</table>

Table 8.3: RISC-instruction counts for the unoptimised benchmarks

<table>
<thead>
<tr>
<th>benchmark</th>
<th>total</th>
<th>computation</th>
<th>memory</th>
<th>calls</th>
<th>immediates</th>
<th>conditionals</th>
</tr>
</thead>
<tbody>
<tr>
<td>fib 5</td>
<td>230</td>
<td>40-4%</td>
<td>27-8%</td>
<td>19-6%</td>
<td>5-2%</td>
<td>7-0%</td>
</tr>
<tr>
<td>fib 10</td>
<td>2174</td>
<td>45-3%</td>
<td>25-3%</td>
<td>20-7%</td>
<td>0-6%</td>
<td>8-2%</td>
</tr>
<tr>
<td>fib 15</td>
<td>23726</td>
<td>45-8%</td>
<td>25-0%</td>
<td>20-8%</td>
<td>0-1%</td>
<td>8-3%</td>
</tr>
<tr>
<td>queens 5</td>
<td>42239</td>
<td>25-1%</td>
<td>47-8%</td>
<td>13-4%</td>
<td>8-2%</td>
<td>5-5%</td>
</tr>
<tr>
<td>queens 6</td>
<td>167527</td>
<td>24-9%</td>
<td>48-1%</td>
<td>14-0%</td>
<td>7-6%</td>
<td>5-4%</td>
</tr>
<tr>
<td>primes 50</td>
<td>190890</td>
<td>27-9%</td>
<td>45-6%</td>
<td>11-4%</td>
<td>6-9%</td>
<td>8-2%</td>
</tr>
<tr>
<td>primes 100</td>
<td>673132</td>
<td>27-7%</td>
<td>45-9%</td>
<td>11-4%</td>
<td>6-7%</td>
<td>8-3%</td>
</tr>
</tbody>
</table>

Table 8.4: RISC-instruction counts for the optimised benchmarks

**immediates** represents the loading of numeric constants via the load-address instructions \(LA\) and \(LAH\). These instructions are typically used to load the address of the information tables when initialising heap-allocated closures. As they tend to appear in pairs, halving the number of immediate instructions provides a good estimate of the total number of heap allocations.

**conditionals** includes both the conditional move \(CMOVE\) and the conditional jump \(CBR\). These are used when testing for stack and heap overflows, and for implementing simple case expressions.

Table 8.5 compares the total number of RISC instructions for each benchmark to the total number of reduction steps performed by the STG machine. The bracketed numbers denote the ratio between these two counts, and, with the exception of the optimised fib results, the instruction-level simulation performs two to four times the number of steps of the STG machine. Furthermore, the amount of memory required to simulate the RISC machine is up to twenty times greater than that for the STG machine. The net effect is that the RISC simulator runs considerably slower than the STG machine, and can therefore only be used to evaluate smaller problems.

### 8.6 Summary

This chapter has described a state-transition model of a modern optimising compiler, which is closely related to the STG-machine (see chapter 6). To demonstrate the viability of the resulting rules, a prototype compiler has been developed. The results from benchmarking
<table>
<thead>
<tr>
<th>benchmark</th>
<th>STG unoptimised</th>
<th>STG optimised</th>
<th>RISC unoptimised</th>
<th>RISC optimised</th>
</tr>
</thead>
<tbody>
<tr>
<td>fib 5</td>
<td>771</td>
<td>211</td>
<td>2676 (3.5)</td>
<td>230 (1.1)</td>
</tr>
<tr>
<td>fib 10</td>
<td>9357</td>
<td>2479</td>
<td>32565 (3.5)</td>
<td>2174 (0.9)</td>
</tr>
<tr>
<td>fib 15</td>
<td>104545</td>
<td>306475</td>
<td>363927 (3.5)</td>
<td>23726 (0.1)</td>
</tr>
<tr>
<td>queens 5</td>
<td>38630</td>
<td>16863</td>
<td>176314 (4.6)</td>
<td>42239 (2.5)</td>
</tr>
<tr>
<td>queens 6</td>
<td>188174</td>
<td>75102</td>
<td>849691 (4.5)</td>
<td>167527 (2.2)</td>
</tr>
<tr>
<td>primes 50</td>
<td>96374</td>
<td>79032</td>
<td>299406 (3.1)</td>
<td>190890 (2.4)</td>
</tr>
<tr>
<td>primes 100</td>
<td>348835</td>
<td>286485</td>
<td>1077037 (3.1)</td>
<td>673132 (2.4)</td>
</tr>
</tbody>
</table>

Table 8.5: Comparing STG machine reductions and RISC instructions

the compiled versions of the nofib programs (**fib**, **queens**, and **primes**) show that the instruction-level simulation requires between two and four times as many cycles as the STG machine. This reduces the problem size that can be examined at this level of detail.
Chapter 9

Prototyping parallel functional intermediate languages

9.1 Introduction

In this chapter the use of the prototyping framework is illustrated by four case studies. Each of the studies are based upon existing well-known systems, and, between them, include examples of the main programming abstractions used in modern parallel functional programming (see section 2.4) and cover both GMSV and DMMP architectures (see section 2.2.1). The first (section 9.2) is based upon shared-memory Haskell [Mattson Jr., 1993a], and considers the introduction of parallel threads into the STG' language. This provides a simple overview of the methodology, and serves as a foundation upon which the other case studies build. The second (section 9.3) moves on to consider GUM Haskell [Trinder et al., 1996], essentially a DMMP implementation of the previous study. While the static semantics are very similar to those of the first case study, the operational model is far more complex, and demonstrates how message passing can be modelled by a state-transition system. The third (section 9.4) investigates the data placement primitives of para-functional Haskell [Hudak, 1991]. These prove interesting both in terms of the denotational and operational models. Skeletal parallelism [Cole, 1989] is the subject of the final case study (section 9.5), dealing with farms, pipes and divide-and-conquer skeletons [Darlington et al., 1993].

9.2 Mattson’s speculative evaluation technique

Under the evaluate-and-die model [Peyton Jones, 1989, page 178], a thread is an independent process which computes the value of one expression and then terminates. This approach to thread management has been adopted by most modern systems, including GUM [Trinder et al., 1996, section 2.2], the JUMP* machine [Chakravarty, 1994, section 2.3.2], and the v-STG machine [Hwang and Rushall, 1992, sections 6-8].

Traditionally, only expressions essential to the main computation are candidates for threads. By sparking non-essential expressions, speculative systems increases the number of available threads, thereby decreasing the chance that any processor is idle. However, there is a chance that the time and space expended on the computation will be wasted, and complications arise when a speculative task is detected to be either necessary or irrelevant.

The system presented in this section is primarily based on Mattson’s speculative graph
reducer [Mattson Jr., 1993a, section 4.3, pages 69–80] and the GRIP multiprocessor [Peyton Jones et al., 1987; Mattson Jr., 1993b, sections 2–3].

### 9.2.1 The static semantics

Speculative parallelism is introduced into the STG' language by extending the `exp` production rule (see section 5.2.1) as follows:

```
exp → letspec literal simple_bind exp | ⋯ speculative evaluation
```

The `literal` value should be between 0–100, and estimates the percentage probability that the bound expression will be required as part of the main computation. Note, that this relates to the traditional `letpar` as follows:

```
letpar simple_bind exp ≡ letspec 100 simple_bind exp
```

As an example, figure 9.1 shows a speculative variant of the `map` function, which estimates that the first element of the tail will be required 90% of the time. When a speculative thread evaluates a `letspec` expression, the effective probability of the new thread is the product of the probabilities of the current thread and the specified probability – any thread with a probability less than 10% is ignored. Therefore, the speculative map will evaluate a list up to a maximum depth of 21 elements (if the probability were changed to 50%, then a maximum of 3 elements would be evaluated). The free variables of the new expression are determined by the following equation:

```
\mathcal{FV}_{\text{exp}}[\text{letspec} \text{ literal \ var} = \text{exp}_{\text{rhs}} \text{ exp}_\text{body}] \ g \\
= \mathcal{FV}_{\text{exp}}[\text{exp}_{\text{rhs}}] \ g \cup (\mathcal{FV}_{\text{exp}_\text{body}}[\text{exp}] \ g \ \setminus \ \{\text{var}\})
```

The denotational semantics of the new expression is shown below:

```
\mathcal{E} [\text{letspec} \text{ literal \ var} = \text{exp}_{\text{rhs}} \text{ exp}_\text{body}] \ \rho \\
= \ \text{let } e = \mathcal{E} [\text{exp}_{\text{rhs}}] \ \rho \\
in \ {\text{if} \ (\text{literal} \geq 100) \ \land \ (e = \bot)} \\
then \bot \\
else \ \mathcal{E} [\text{exp}_\text{body}] (\rho \oplus \{\text{var} \mapsto e\})
```

Notice that a test for bottom is only made if the thread is guaranteed to be required. Otherwise, a non-terminating speculative expression can only affect the result of the entire program if it turns out to be required, or if evaluation of the expression causes the system to run out of resources. The denotational semantics captures the first of these conditions, but cannot express the second.
The type rule, shown in figure 9.2, asserts that the probability is an unboxed integer, and that the bound expression must be a data constructor. The reasoning behind the latter restriction is described in section 4.5.1. Section 5.3.1 discusses ways in which the range restriction on the percentage probability could be enforced.

9.2.2 The operational model

The abstract state is defined in table 9.1, and the relationship between the code field and the new rules is illustrated in figure 9.3 (see also figure 4.11). An overview of the rules can be found in table 9.2. With the exception of rules BH1 and BH2, all of the rules are additions to the original STG machine – the two black-hole rules replace rule 15 and 16 respectively (which handle the entry and updating of thunks). The following sections look at these rules in greater detail.
specification | description
--- | ---
$G$ | $(P_1, \ldots, P_n) wp h \sigma$ a collection of processors, $P_i$, all sharing a global work pool, $wp$, memory, $h$, and environment, $\sigma$.

$P$ | $(\text{code}, \ldots, t_{id}, \text{prob})$ the standard STG abstract state extended to include the $id$ of the the currently active thread and its probability. Extensions have also been made to the code, closure, and continuation components.

t$_{id}$ | $a$ a thread’s identifier is the address of its heap-allocated state object, TSO.

wp | $(\text{threads}, \text{sparks})$ the tasks currently available to the system.

threads | queue of $(t_{id}, \text{prob})$ a collection of threads ordered by the threads’ probabilities.

sparks | sequence of $(a, \text{prob})$ pointers to closures whose values may be required as part of the main computation, ordered by probability.

code | $\text{GetThread}$ schedule the next thread to be run.

closure | $\text{TSO prob (code, as, rs, us)}$ represents the state of a thread, which comprises its probability, an instruction sequence, and the three standard stacks.

BlackHole | $t_{id}$ threads records the $id$ of thread which created the black hole, and any threads which are awaiting the final value of the closure.

Active | Stopped these values will only ever be stored at address $a_{\text{status}}$, and are used to indicate the current state of the computation.

continuation | $\text{EndThread}$ terminate the current thread.

 | $\text{Finished}$ terminate the entire computation.

prob | $0$–$100$ likelihood that a thread will be required as part of the main computation. The operational model uses the probability as an indication of a thread’s importance or priority.

Table 9.1: State components of a thread-management system
<table>
<thead>
<tr>
<th>category</th>
<th>rule</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>evaluation</td>
<td>SPEC</td>
<td>evaluates the \texttt{letspec} expression, creating new sparks for use by the scheduler.</td>
</tr>
<tr>
<td>synchronisation</td>
<td>BH1</td>
<td>black holes thunks upon entry</td>
</tr>
<tr>
<td></td>
<td>BH2</td>
<td>suspends the current thread upon entry to a black hole</td>
</tr>
<tr>
<td></td>
<td>BH3</td>
<td>updates a black hole, releasing all suspended threads.</td>
</tr>
<tr>
<td>resource</td>
<td>SCHED1</td>
<td>converts a spark to a thread.</td>
</tr>
<tr>
<td>management</td>
<td>SCHED2</td>
<td>schedules an existing thread.</td>
</tr>
<tr>
<td></td>
<td>SCHED3</td>
<td>busy-wait for new work.</td>
</tr>
<tr>
<td>initialisation/termination</td>
<td>INIT</td>
<td>static partitioning of the STG machine state.</td>
</tr>
<tr>
<td></td>
<td>FINISH1</td>
<td>signal the end of the computation.</td>
</tr>
<tr>
<td></td>
<td>FINISH2</td>
<td>detect the end of the computation.</td>
</tr>
</tbody>
</table>

Table 9.2: Overview of the STG rules for Mattson’s speculative evaluation engine

Thread creation

Thread creation, often referred to as \textit{sparking}, is a two-stage process as shown in figure 9.4. The first step is to identify the necessary and speculative expressions, as demonstrated by the \texttt{SPEC} rule:

\[
\begin{align*}
 & Eval \left( \texttt{letspec prob } v = c_1 e_2 \right) \rho \text{ as } rs \text{ us } t_id \text{ p } (tp, spk) \ h \ \sigma \\
\text{such that } (\text{prob} \geq 10) \\
\Rightarrow & Eval e_2 (\rho \oplus \{v \mapsto a\}) \text{ as } rs \text{ us } t_id \text{ p } (tp, spk') \ h' \ \sigma \\
\text{where } & \text{prob}' = p \ast \text{prob}/100 \\
& h' = h[a \mapsto \text{create\_closure } c_1 \rho] \\
& spk' = \text{insert}\_\text{spark} (a, \text{prob}) \ spk
\end{align*}
\]

This operation is very cheap as it only involves a heap allocation and the addition of the closure’s address and probability to the spark pool, \textit{spk}. The \texttt{insert}\_\text{spark} function maintains the correct ordering of the pool, thereby ensuring the spark with the highest probability appears at the head of the queue. On a single processor system, this rule is equivalent to a normal \texttt{let} expression, as the spark and thread pool will never be used. The associated closure may be evaluated as part of the normal computation, but this will happen within the main thread.

The second part of the \textit{sparking} process involves the closures stored in the spark pool being converted into threads. This occurs when the current thread either blocks or terminates (see the BH2 and \texttt{END\_THREAD} rules):

\[
\begin{align*}
 & \text{GetThread } () \text{ as } () \text{ as } t_id \text{ p } (wp, spk) \ h \ \sigma \\
\text{such that } (\text{empty } tp) \lor (\text{max\_prob wp } < p') \\
\Rightarrow & \text{Enter } a \text{ as } () \text{ as } (\text{EndThread}) \text{ as } () \text{ as } t_{new\_id} \text{ p' } (wp, spk') \ h' \ \sigma \\
\text{where } & (a, p') : spk' = spk \\
& h' = h[t_{new\_id} \mapsto \text{TSO } p' \text{ init\_tso\_state}]
\end{align*}
\]

Observe that a new thread is only created when either the work pool is empty, i.e. all existing threads have either blocked or finished, or if a higher-priority spark is available. The \textit{TSO} closure is used to preserve the thread’s local state when suspending the thread.
Black holes and thread synchronisation

To prevent duplication of work, whenever a thread enters a potentially shared thunk it updates the closure with a BlackHole:

\[
\text{Enter } a \text{ as } rs \quad us \quad t_{id} \quad p \quad \text{wp} \quad h\left[ a \mapsto (vs \mapsto u \mapsto e, ws) \right] \quad \sigma
\]

\[
\text{Eval } e \quad \rho \quad \langle \langle \quad (a, as, rs) : us \quad t_{id} \quad p \quad \text{wp} \quad h\left[ a \mapsto \text{BlackHole } t_{id} \langle \rangle \right] \quad \sigma
\]

where \( \rho = \{ v_1 \mapsto w_1, \ldots, v_n \mapsto w_n \} \) and \((v_i, w_i) = (vs, ws))

Whenever another thread enters the black hole, its local state is saved, the thread is added to the closure’s list of blocked threads, the importance of the thread evaluating the closure is increased, and a new thread is scheduled:

\[
\text{Enter } a \text{ as } rs \quad us \quad t_{id_1} \quad p_1 \quad \text{wp} \quad h
\]

\[
\Rightarrow \text{GetThread as } rs \quad us \quad t_{id_1} \quad p_1 \quad \text{wp} \quad h
\]

where \( ts' = \text{enqueue} (t_{id_1}, p_1) \quad ts \)

\( state'_{t_2} = (\text{Enter } a, as, rs, us) \)

\( p'_{t_2} = \max(p_1, p_2) \)

Notice that the importance of any threads that \( t_{id_2} \) may have sparked, or any threads upon which \( t_{id_2} \) may be waiting, are not increased – Mattson [1993a, section 3.2.4] calls this the low-impact model of speculative evaluation. Furthermore, there is no mechanism for reverting \( t_{id_2} \)’s priority once the closure has been evaluated (the required changes would be significant and add little to the presentation.)
When the black hole is updated all of the blocked threads are added to the work pool:

\[
\text{Return}_\chi c ws \emptyset \emptyset (a_u, a_{su}, rs_u) : us t_id p (tp, spk) h \sigma
\]

such that \( h[a_u \mapsto \text{BlackHole} \ t_id \ ts] \)

\[
\Rightarrow \text{Return}_\chi c ws asu rsu us t_id p (tp', spk) h' \sigma
\]

where \( tp' = q_{\text{append}} ts tp \)

\[
h' = h[a_u \mapsto (us x \rightarrow c vs, ws)]
\]

\[
\text{length} vs = \text{length} ws
\]

vs is a sequence of arbitrary distinct variables

Terminating a thread

Once a thread has evaluated and updated its target closure, the \text{EndThread} continuation, pushed by the \text{SCHED}_1 rule, will be invoked:

\[
\text{Return}_\chi c ws \emptyset \emptyset \emptyset t_id p wp h \sigma
\]

\[
\Rightarrow \text{GetThread} \emptyset \emptyset \emptyset t_id p wp h \sigma
\]

The memory occupied by the thread’s TSO closure will eventually be reclaimed by the garbage collector, so explicit de-allocation is not necessary.

Scheduling

Whenever a thread terminates, blocks on a black hole, or a timer interrupt occurs (see section 6.2.2), a new thread is selected from the current work pool (see also \text{SCHED}_1):

\[
\text{GetThread} \emptyset \emptyset \emptyset t_id p (tp, spk) h \sigma
\]

such that \((\neg \text{empty} tp) \land (\text{max-prob} spk \leq p')\)

\[
\Rightarrow \text{code as rs us tnewid p'} hp' \sigma
\]

where \( (t_{\text{newid}}, p, tp') = \text{dequeue} tp \)

\[
h' = h[t_{\text{newid}} \mapsto \text{TSO} p' (\text{code}, as, rs, us)]
\]

The \text{dequeue} function determines the style of scheduling amongst equal-priority threads, whether it be FIFO/LIFO (first/last in, first out) [Hammond and Peyton Jones, 1992, section 5.2], or round robin [Trinder et al., 1996, section 2.2]. Parrott [1993] outlines a system which combines risk aversion and stochastic learning which significantly outperforms a random schedule for most workloads. If no work is available, the processor busy waits:

\[
\text{GetThread} \emptyset \emptyset \emptyset t_id p wp h \sigma
\]

such that \text{empty} tp

\[
\Rightarrow \text{GetThread} \emptyset \emptyset \emptyset t_id p wp h \sigma
\]
Initialisation and termination

The initial state of an an n-processor system is defined as follows:

\[ G = (P_1, \ldots, P_n) \, wp \, h \, \sigma \]

where

\[ P_i = (GetThread, (), (), t_{\text{none}}, 0) \]

\[ wp = (tp, ()) \text{ where } tp = (t_{\text{main}}, 100) \]

\[ h = \begin{cases} t_{\text{main}} \rightarrow TSO \, 100 \, (Enter \, a_{\text{main}}, (), (\text{Finished}), ()), \\ a_{\text{status}} \rightarrow Active \\ a_1 \rightarrow (v_{S1} \, \pi_1 \, v_{S1} \rightarrow \text{exp}_1, \sigma \, v_{S1}) \\ \ldots, \\ a_n \rightarrow (v_{Sn} \, \pi_n \, v_{Sn} \rightarrow \text{exp}_n, \sigma \, v_{Sn}) \end{cases} \]

\[ \sigma = \begin{cases} g_1 \rightarrow a_1, \\ \ldots, \\ g_n \rightarrow a_n \end{cases} \]

Note that all of the processors will be in competition to steal the main thread of computation. This is acceptable in a GMSV system as all messaging is implicit and guaranteed, so there is no risk of losing important data due to one processor starting before another has finished its initialisation.

The computation is finished whenever the main thread terminates:

\[ (\text{FINISH}_1) \]

\[ \text{Return} \, ws \, (\text{Finish}) \, t_{\text{main}} \, 100 \, wp \, h \, \sigma \]

\[ \Rightarrow \text{Stop} \, (\) \, () \, t_{\text{main}} \, 100 \, wp \, h' \, \sigma \]

where \( h' = h[a_{\text{status}} \rightarrow \text{Stopped}] \)

By signalling that the computation has ended (via the flag stored at \( a_{\text{status}} \)), the other processors can finish what they’re doing and exit cleanly (using, for example, the broadcast tree outlined in section 6.3.2). As heap allocations are a frequent occurrence, and are also comparatively expensive operations, they provide a convenient place to check the current status:

\[ (\text{FINISH}_2) \]

\[ \text{Eval} \, (\text{let} \, \text{bindings} \, \text{exp}) \, \rho \, as \, rs \, us \, t_{\text{id}} \, p \, wp \, h \, \sigma \]

such that \( h[a_{\text{status}} \rightarrow \text{Stopped}] \)

\[ \Rightarrow \text{Stop} \, as \, rs \, us \, t_{\text{id}} \, p \, wp \, h \, \sigma \]

9.2.3 Compilation rules

The following sections outline the changes that need to be made to the compilation rules presented in chapter 8 to support the new operational rules.

The register map

The register map is shown in figure 9.3 and is superficially similar to that used in a purely sequential context (see section 8.3.2). The status register, \( \text{Sts} \), caches the address \( a_{\text{status}} \) as it is used in every \text{let} expression. The work pool is accessed infrequently, and so there is no need to waste a register caching its static address. Note that replacing both \( \text{Hp} \) and \( \text{HpLimit} \) with just \( \text{HpVar} \) is an optimisation that is only possible at the assembly-language level.
<table>
<thead>
<tr>
<th>register</th>
<th>use</th>
<th>1–22</th>
<th>23</th>
<th>24</th>
<th>25</th>
<th>26</th>
<th>27</th>
<th>28</th>
<th>29</th>
<th>30</th>
<th>31</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>general purpose</td>
<td>Ret</td>
<td>Sts</td>
<td>Np</td>
<td>StkA</td>
<td>StkB Base</td>
<td>StkB Base</td>
<td>Tp</td>
<td>HpVar</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Ret**  
stores the address of the return handler for the current evaluation (which may be a generic update-handler, when evaluating a polymorphic thunk).

**Sts**  
stores the address $a_{status}$, which is used during heap allocation to determine if the computation has finished.

**Np**  
points to the closure which is currently being evaluated, and is used to access an expression’s free variables.

**StkB**  
points to the next available slot on the B stack.

**StkBBase**  
points to the upper limit of the B stack, and is used to detect stack underflow.

**Tp**  
points to the current thread’s TSO closure, and, can be used, indirectly, to access the thread’s priority.

**HpVar**  
replaces both $Hp$ and $HpLimit$ by pointing to the address where the actual heap pointer is stored. This extra indirection is necessary as any processor can extend the heap at any time, so caching the last value seen by the local processor is not safe. The heap limit is stored in the address directly after that heap pointer, and can therefore also be accessed via the HpVar register.

Table 9.3: The register map for compiling speculative expressions
Figure 9.5: Closure layouts for a speculative GMSV system

**Closure layout**

Figure 9.5 shows the layout of the TSO and BlackHole closures, plus a heap-allocated stack object required to support dynamic thread creation. Note that there is no need to store the Tp or HpVar registers in the TSO closure, as the information they contain are trivial to compute. Furthermore, the standard garbage collection mechanisms can be used to reclaim both TSO closures and the associated stack space.

**Communication and synchronisation**

The main consideration with regards to synchronisation is access to the shared resources, namely the work pool and the global heap. All access to the former will have to be mutually exclusive[Axford, 1989, chapter 3], while only updates to the latter will need to be controlled. To illustrate the basic mechanisms, figure 9.6 shows the instruction sequence used to implement the heap allocation. As mentioned in section 6.2.4, specifying this level of detail in the operational rules would severely limit their usefulness.

**New compilation rules**

One new rule needs to be introduced to handle the letspec expression, and this is a modification of the let rule (see rule 3 in appendix H). The main difference between the two is that after creating the closure, the letspec rule generates code to add the closure’s address to the spark pool:

<table>
<thead>
<tr>
<th>LETSPEC</th>
<th>CEval (letspec prob v = e₁ e₂) ρ code rs es conts pending b σ</th>
</tr>
</thead>
<tbody>
<tr>
<td>→</td>
<td>CEval e₂ ρ' code' rs es conts pending' b σ</td>
</tr>
<tr>
<td>where</td>
<td>ρ' = ρmoves \ varsdead</td>
</tr>
<tr>
<td></td>
<td>code' = code ++ moves ++ add.spark</td>
</tr>
<tr>
<td></td>
<td>pending' = {v₁,e₁} \∪ pending</td>
</tr>
<tr>
<td></td>
<td>(moves,ρmoves) = allocate_closure v e₁ ρ σ</td>
</tr>
<tr>
<td></td>
<td>varsdead = FV[e₁] \ FV[e₂]</td>
</tr>
</tbody>
</table>

Note that code for checking the status flag has already been added to the heap-allocation routine, so there is no need to update the let compilation routines to implement the
<table>
<thead>
<tr>
<th>system call</th>
<th>inputs</th>
<th>outputs</th>
</tr>
</thead>
<tbody>
<tr>
<td>alloc</td>
<td>R21</td>
<td>R21 address of the allocated memory</td>
</tr>
<tr>
<td></td>
<td>R22</td>
<td>R20 corrupted</td>
</tr>
<tr>
<td></td>
<td></td>
<td>R19 corrupted</td>
</tr>
</tbody>
</table>

\(\text{label\_alloc} :\)

- `load (regStsk), reg19` // load the current status
- `branch_{x \leq 0} reg19, label\_exit` // terminate the computation if necessary
- `load + 4(regHpVar), reg19` // load the heap limit
- `load\_linked (regHpVar), reg20` // (link) load the current heap pointer
- `subtract reg19, reg20, reg19` // has there been a heap overflow
- `branch_{x < 0} reg19, label\_GC` // if so, invoke the garbage collector
- `add reg21, reg20, reg19` // otherwise, increase the heap pointer
- `store\_linked reg19, (regHpVar)` // attempt to update the heap pointer
- `branch_{x = 0} reg19, label\_alloc` // retry if the allocation failed
- `move reg19, reg21` // otherwise, set the result parameter
- `jump reg22` // return to the caller

Figure 9.6: Heap allocation in a GMSV system

Garbage collection

There are two issues related to garbage collection that need to be considered in a GMSV system. Firstly, the root set of the collector (see section 6.3.3) should be extended to include the thread pools, and scavenging and evacuation routines must be specified for the new TSO and BlackHole closures. Secondly, and more difficultly, a strategy for coordinating the collection phase needs to be adopted. A simple, yet inefficient, approach is for each processor to enter a barrier (see section 6.3.2 or [Almasi and Gottlieb, 1993]) once it detects that the heap has been exhausted. As heap allocation is inevitable, all processor must eventually enter the barrier. When all of the processors have entered the barrier, one processor garbage collects the entire memory as for a uniprocessor machine, and afterwards computation continues as before (see section 6.3.3 for further details). More complex schemes [North and Reppy, 1987; Lester, 1989] are beyond the scope of this thesis.

9.2.4 Performance

The STG'-equivalents of the parallel (conservative) benchmark programs, fib and queens, are shown in figures 9.8 and 9.10 respectively. Both are derived from the optimised sequential version presented in appendix B (see sections B.2 and B.4 for further details).

The fib benchmark

The fib program is often described as embarrassing parallel as it produces a large number of tasks related via a simple tree structure. As such, it is often used to assess the
Figure 9.7: Example code generated by the LETSPEC compilation rule
**Figure 9.8: The parallel STG' fib -O benchmark**

The performance of an implementation under near optimal conditions. The tasks are very fine grained, typically involving just two additions. If necessary, the grain size of the computation can be controlled by restricting the depth of the tree. The relative speedups for the fib program running under the GMSV STG machine are shown in figure 9.9. The curves show that the system achieves near linear speedups for the larger problem sizes. For the smaller problem sizes, the amount of available work is sufficiently small that the speedup reaches a plateau after a fixed number of processors [Gustafson, 1988].

These results are consistent with those observed by Mattson Jr. [1993a, section 5.2, pages 93-94] for larger problems sizes. The problem sizes examined here had to be restricted to fifteen and under in order for the simulations to complete on a standard desktop machine (a 266MHz PII PC with 65M of memory, running Windows NT 4.0, and using GHC 4.03). Each run completed in less than a minute. There is no reason why larger problem sizes could not be attempted on a more powerful machine.

The relative speedups for the unoptimised version of the fib program were very similar to those for the optimised version, and even tended to be slightly better due to the increased grain size of the computations. However, there can be no justification for using sub-optimal algorithms when evaluating parallel performance.

**The queens benchmark**

The queens benchmark is more demanding than the fib program described in the previous section. Firstly, far fewer tasks are generated: fib 15 creates approximately 2000 tasks, while queens 6 generates just over 150. Secondly, the dependencies between threads is more complex, with the output of one task typically depending upon the outputs of a number of other tasks. Finally, the grain size is variable and difficult to control. Unsurprisingly, this program is often used to demonstrate an implementation’s performance under more challenging conditions. The relative speedups for the queens program running under the GMSV STG machine are shown in figure 9.11. The curves start almost linearly, but then quickly reach a plateau due to the limited number of tasks available.

Looking at the results taken by Mattson Jr. [1993a, section 5.2.2, pages 90-93], the initial parts of the curve are similar. However, with Mattson’s implementation the speedup drops off with increasing processors after the maximum speedup has been achieved. The
Figure 9.9: Relative speedups for the conservative fib -O benchmark

---

**STG' code**

```haskell
main = [] \u [] -> nsoln.wrk int 5#

gen.wrk = [] \[nq n] ->
  case n of {
    0# -> nil nil ;
    _ -> let# dec_n' = minusInt# [n, 1#] in
      letstrict bs = gen.wrk nq dec_n' in
      let { qs = [nq] \u [] -> const.Int.enumFromTo one nq; } in
      gen_comprehension nq qs bs ;
  }

gen_comprehension = [] \r [nq one_to_nq dss] -> case dss of
  { Nil -> Nil [] ;
    Cons d ds -> letpar a = gen_comprehension nq one_to_q ds
      in g a d nq one_to_q ;
  }
```

Figure 9.10: The core of the parallel STG' queens -O benchmark
Figure 9.11: Relative speedups for the conservative queens -O benchmark

STG results, however, remain perfectly flat. The reason for this discrepancy is due to the STG simulation not modelling resource contention (caused by the locking mechanisms described in section 6.2.4). The RISC simulation, on the other hand, does model the first-order effects of locking, and figure 9.12 compares the STG and RISC speedups for queens 3.

9.2.5 Extensions

Unlike the sequential STG machine, the speculative rule set does not detect erroneously cyclic definitions of the form:

\[
\text{STG' code}
\]

\[
x = \square \land [\square] \rightarrow \text{const.Int.} + x \text{ one};
\]

One solution would be for the BH1 to record the id of the thread responsible for evaluating a black-holed closure. The BH2 rule could then check if the newly blocked thread is directly or indirectly responsible for the evaluation upon which it is waiting. However, the simplest solution would be to debug the algorithm on a sequential implementation which can easily detect the presence of such cycles.

The rule set also uses a very crude priority-upgrade mechanism (as did Mattson’s implementation), whereby a blocked thread can boost the priority of a speculative thread currently evaluating the associated thunk (see rule BH2). Currently, this increase in status is permanent. Mattson Jr. [1993a, section 3.2.4, pages 54-56] proposes a number of alternatives, including tracking the stack depth at which the speculative task entered the thunk. While the STG animation would provide an excellent environment for testing these strategies, none of Mattson’s benchmark programs suffered due to the false upgrading of speculative threads.
### 9.2.6 Assessment

Overall, despite being the first case study, the development of the static semantics, STG-machine rules and corresponding animation were straightforward extensions of their sequential counterparts. The compilation and RISC animations were more problematic, but this was mainly due to the large number of support routines that needed to be developed and tested (subsequent studies simply made use of this groundwork). While each new phase of the development process introduced additional details and complexity, the tools and descriptions developed during the previous phase provided a strong foundation upon which to build. The animations helped to test the correctness of the semi-formal specifications, and also provided valuable insight into the system dynamics. Indeed, the operational specification and STG animation were developed iteratively (as was the case with the sequential compilation rules and the RISC compiler described in chapter 8).

As previously mentioned, the development of the RISC animation was probably the most time consuming phase of the development. Fortunately, the STG animation was sufficiently accurate to allow different strategies to be compared and tested, such that only the successful candidates needed to proceed to the final (expensive) phase. However, the RISC animation does not model cache effects, and so can only be used as a rough guide.

The performance results obtained from both the STG and RISC simulations broadly agree with those observed by Mattson Jr. [1993a], although both simulators are only capable of handling significantly smaller problem sizes. The primary limitation of the STG simulations is that they ignore resource contention, and therefore do not exhibit the classic degradation of performance with increasing numbers of surplus processors.

![Figure 9.12: Comparing the STG' and RISC animations for the queens -O benchmark](image-url)
9.3 GUM: Graph reduction for a Unified Machine

GUM [Trinder et al., 1996] (Graph reduction for a Unified Machine) is a DMMP implementation of Haskell, using the classic par operator to identify parallel threads. GUM is built on top of the PVM communication system [Beguelin et al., 1993]) and is therefore portable to a range of architectures, including both GMSV and DMMP machines. Notably, absolute speedups over the best sequential compilers have been observed for both high-performance shared-memory machines and clusters of workstations operating over Ethernet.

9.3.1 The static semantics

The static semantics are very similar to those presented in the previous case study, and the necessary details can be found in figure 9.13.

9.3.2 The operational model

This section presents a state-transition model of the asynchronous message-passing features of GUM. The abstract states for the processor, $P$, and communication system, $S$, are shown in table 9.4, and the relationship between the code field and the new rules is illustrated in figure 9.14. An overview of the rules can be found in table 9.5.

Sending and receiving messages

Unlike the previous study, all communication has to be explicit declared in a DMMP system. The model used here is based closely on that presented in section 6.2.5: all sends are asynchronous, and all receives are blocking. This section presents three of the STG rules – SEND, RECV, and BCAST. These provide convenient abstractions for use by the other rules, hiding the details of the actual network interface. Furthermore, by centralising access to the network, it is possible to modify or enhance the communication system without changing the other rules. For example, it would be straightforward to re-implement the GUM rule set using a shared-memory implementation of the messaging
<table>
<thead>
<tr>
<th>specification</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$G$</td>
<td>$(P_1, \ldots, P_n) S$ a collection of processors, $P_i$, which have to communicate via the message-passing system, $S$.</td>
</tr>
<tr>
<td>$P$</td>
<td>$(\text{code, } \ldots, h, t_{id}, wp, \sigma)$ the standard STG abstract state extended to include support for a local work pool, $wp$ (see table 9.7 for the $wp$-related definitions).</td>
</tr>
<tr>
<td>$S$</td>
<td>$(\text{buffers}_1 \cdots \text{buffers}_n, \text{network})$ the message-passing system, which comprises the processor-network interfaces and a model of the communication hardware.</td>
</tr>
<tr>
<td>buffers</td>
<td>$(\text{buffer}<em>{\text{in}}, \text{buffer}</em>{\text{out}})$ the input and output message buffers for a single processor</td>
</tr>
<tr>
<td>buffer</td>
<td>queue of $(i, \text{message})$ $i$ is either the source or destination of the message</td>
</tr>
<tr>
<td>probe buffer</td>
<td>probe $\text{buffer}_{\text{in}}, \text{message}$ search for an entry that matches the message pattern</td>
</tr>
<tr>
<td>code</td>
<td>$\text{Send message code}$ sends the specified message and then invokes the continuation code. Table 9.6 details the messages used by the GUM system.</td>
</tr>
<tr>
<td></td>
<td>$\text{Receive message code}$ indicates the arrival of a message, which interrupted the execution of the specified code.</td>
</tr>
</tbody>
</table>

Table 9.4: State components of a message-passing system
<table>
<thead>
<tr>
<th>category</th>
<th>rule</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>evaluation</td>
<td>PAR</td>
<td>evaluates the <code>letpar</code> expression, creating new sparks for use by the scheduler.</td>
</tr>
<tr>
<td>communications</td>
<td>SEND</td>
<td>send a message to a remote processor.</td>
</tr>
<tr>
<td></td>
<td>BCAST</td>
<td>broadcast a message to all other remote processors.</td>
</tr>
<tr>
<td></td>
<td>RECV</td>
<td>receive a message from a remote processor.</td>
</tr>
<tr>
<td>synchronisation</td>
<td>BH₁</td>
<td>black holes thunks upon entry</td>
</tr>
<tr>
<td></td>
<td>BH₂</td>
<td>suspends the current thread upon entry to a black hole</td>
</tr>
<tr>
<td></td>
<td>BH₃</td>
<td>update a black hole.</td>
</tr>
<tr>
<td></td>
<td>UNBLOCK</td>
<td>re-activate suspended threads and blocked Fetch messages.</td>
</tr>
<tr>
<td>scheduling</td>
<td>SCHED₁</td>
<td>converts a spark to a thread.</td>
</tr>
<tr>
<td></td>
<td>SCHED₂</td>
<td>schedules an existing thread.</td>
</tr>
<tr>
<td></td>
<td>SCHED₃</td>
<td>busy-wait for new work.</td>
</tr>
<tr>
<td>load balancing</td>
<td>FISH₁</td>
<td>request work from a neighbouring processor.</td>
</tr>
<tr>
<td></td>
<td>FISH₂</td>
<td>forward a work-request message to another processor.</td>
</tr>
<tr>
<td></td>
<td>FISH₃</td>
<td>receive a work-request message which originated from the local processor.</td>
</tr>
<tr>
<td></td>
<td>SEND_WORK</td>
<td>send surplus work to a remote processor.</td>
</tr>
<tr>
<td></td>
<td>RECV_WORK</td>
<td>receive work from a remote processor.</td>
</tr>
<tr>
<td></td>
<td>RECV_ACK</td>
<td>receive acknowledgment of the safe arrival of a work packet.</td>
</tr>
<tr>
<td>partitioning</td>
<td>FETCH₁</td>
<td>request the value of a remote closure.</td>
</tr>
<tr>
<td></td>
<td>FETCH₂</td>
<td>return the value of a local closure to a remote processor.</td>
</tr>
<tr>
<td></td>
<td>FETCH₃</td>
<td>receive the value of a closure from a remote processor.</td>
</tr>
<tr>
<td></td>
<td>FETCH₄</td>
<td>suspends a Fetch message when requesting the value of either a BlackHole or FetchMe closure.</td>
</tr>
<tr>
<td>initialisation/termination</td>
<td>INIT</td>
<td>static partitioning of the STG machine state.</td>
</tr>
<tr>
<td></td>
<td>FINISH₁</td>
<td>signal the end of the computation.</td>
</tr>
<tr>
<td></td>
<td>FINISH₂</td>
<td>detect the end of the computation.</td>
</tr>
</tbody>
</table>

Table 9.5: Overview of the GUM STG rules
Another possibility would be to piggy-back status information onto all outgoing messages, as discussed in section 6.3.3.

As shown in table 9.4, the network-interface comprises two buffers per processor, \((b_{in}, b_{out})_i\). The first contains all messages that have been delivered to processor \(i\) but have not yet been received (messages are added by the network and removed by the processor). The second contains all outgoing messages from processor \(i\) that have not yet been injected into the network (messages are added by the processor and removed by the network). The SEND rule, therefore, manipulates \(b_{out}\), and the code continuation indicates what actions should be taken after the message has been sent:

\[
\text{(SEND)}
\]

\[
\begin{align*}
\text{Send message code as } & rs \ us \ h \ t_id \ wp \ \sigma \ (b_{in}, b_{out})_i \\
\Rightarrow & \text{code as } rs \ us \ h \ t_id \ wp \ \sigma \ (b_{in}, b_{out}')_i \\
& \text{where } b'_{out} = \text{enqueue message } b_{out}
\end{align*}
\]

The BCAST rule is similar, but sends a copy of the specified message to all of the other processors:

\[
\text{(BCAST)}
\]

\[
\begin{align*}
\text{Broadcast body code as } & rs \ us \ h \ t_id \ wp \ \sigma \ (b_{in}, b_{out})_i \\
\Rightarrow & \text{code as } rs \ us \ h \ t_id \ wp \ \sigma \ (b_{in}, b_{out}')_i \\
& \text{where } b'_{out} = \text{enqueue messages } b_{out} \\
& \text{messages} = \langle \forall \ j \in \{1, \ldots, n\} \land j \neq i \bullet (i, j, \text{body}) \rangle
\end{align*}
\]

As stated previously, all receives are blocking. Rather than committing to a potentially infinite delay, the GUM architecture continually polls the network to determine if a
In addition to its content, a message also records both its sender and receiver.

request work from another processor. The \textit{age} field denotes the number of times the message can be forwarded before aborting the request and returning it to the \textit{originator}.

send work to another processor. The \textit{mask} differentiates between addresses and literals contained within the free variables of the \textit{closure}.

acknowledge receipt of work.

request the value of a closure stored on a remote processor.

return a value requested by a \textit{Fetch} message.

shutdown the system when either evaluation is complete or an error has occurred.

message has arrived. If this is the case, then it is safe to invoke the receive operation:

\[
\text{(RECV)} \quad \text{code} \quad \text{as rs us h t}_{id} \quad \text{wp} \quad \sigma \quad (b_{in}, b_{out});
\]

such that \( \text{probe } b_{in} \text{ wild-card} \)

\[
\Rightarrow \quad \text{Receive message code as rs us h t}_{id} \quad \text{wp} \quad \sigma \quad (b'_{in}, b_{out});
\]

where \((\text{message}, b'_{in}) = \text{dequeue } b_{in}\)

In effect, this rule will be triggered as soon as a message arrives, thereby overriding the normal sequence of transitions (this is analogous to a microprocessor interrupt handler – see section 6.2.3). GUM then invokes a specialised message handler, based on the type of the message received (table 9.6 lists the various message types). The handler is passed the \textit{code} continuation to allow it to resume the interrupted task (if appropriate). The GUM message handlers (and generators) are as follows:

<table>
<thead>
<tr>
<th>category</th>
<th>message</th>
<th>SEND/BCAST rules</th>
<th>RECV rules</th>
</tr>
</thead>
<tbody>
<tr>
<td>load balancing</td>
<td>\textit{Fish}</td>
<td>\textit{FISH}, \textit{FISH2}</td>
<td>\textit{SEND_WORK}</td>
</tr>
<tr>
<td></td>
<td>\textit{Schedule}</td>
<td>\textit{SEND_WORK}_1</td>
<td>\textit{RECV_WORK}</td>
</tr>
<tr>
<td></td>
<td>\textit{Ack}</td>
<td>\textit{RECV_WORK}</td>
<td>\textit{RECV_ACK}</td>
</tr>
<tr>
<td>remote references</td>
<td>\textit{Fetch}</td>
<td>\textit{FETCH}_1</td>
<td>\textit{FETCH}_2</td>
</tr>
<tr>
<td></td>
<td>\textit{Resume}</td>
<td>\textit{FETCH}_2, \textit{UNBLOCK}</td>
<td>\textit{FETCH}_3, \textit{FETCH}_4</td>
</tr>
<tr>
<td>termination</td>
<td>\textit{Exit}</td>
<td>\textit{FINISH}_1</td>
<td>\textit{FINISH}_2</td>
</tr>
</tbody>
</table>

The three communication rules are widely used by the rest of the GUM rule set, as
shown by the following table:

<table>
<thead>
<tr>
<th>action</th>
<th>associated rules</th>
</tr>
</thead>
<tbody>
<tr>
<td>SEND</td>
<td>FISH₁, FETCH₁, UNBLOCK</td>
</tr>
<tr>
<td>RECV</td>
<td>FETCH₄, FINISH₂</td>
</tr>
<tr>
<td>SEND &amp; RECV</td>
<td>FISH₃, SEND_WORK₁, RECV_WORK, RECV_ACK, FETCH₂</td>
</tr>
<tr>
<td>BCAST</td>
<td>FINISH₁</td>
</tr>
</tbody>
</table>

The rules that both send and receive messages are akin to Culler’s active messages [Culler, Goldstein, Schauer and von Eicken, 1992].

**Scheduling**

As with the previous study, GUM uses the evaluate-and-die thread model, and therefore the work-pool definitions are very similar (see table 9.7). Note, however, that each processor has a local pool, as opposed to the centralised structure used by the speculative system. Due to their similarity with the rules described previously, the GUM scheduling rules are presented together in figure 9.15. Note that the SCHED₂ busy-wait can only be broken by the the RECV_WORK rule. One alternative to this busy-wait would be to perform a local garbage collection.

**Synchronisation**

Again, as with scheduling, the synchronisation rules are very similar to those used in the previous case study (see figure 9.16). The main difference occurs with the BH₃ rule, which is responsible for updating a shared closure. In addition to releasing any blocked
<table>
<thead>
<tr>
<th>specification</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>code</td>
<td>GetThread</td>
</tr>
<tr>
<td>wp</td>
<td>(threads, sparks, fishing)</td>
</tr>
<tr>
<td>threads</td>
<td>queue of tid</td>
</tr>
<tr>
<td>sparks</td>
<td>sequence of a</td>
</tr>
<tr>
<td>fishing</td>
<td>true</td>
</tr>
<tr>
<td>tid</td>
<td>a</td>
</tr>
<tr>
<td>closure</td>
<td>BlackHole blocked</td>
</tr>
<tr>
<td></td>
<td>TSO (code, as, rs, us)</td>
</tr>
<tr>
<td>continuation</td>
<td>EndThread</td>
</tr>
<tr>
<td></td>
<td>Finished</td>
</tr>
</tbody>
</table>

Table 9.7: State components of GUM's work pool
threads, it must also reply to any blocked Fetch request. This is handled by the Unblock mechanism, which is described in greater detail in the the remote-referencing section.

Load balancing – an overview

As discussed in section 6.3.3, GUM uses a passive load-balancing strategy: when a processor runs out of work, it sends a Fish message to one of its neighbours requesting additional work. If the receiver has any spare work then it packages it up and returns it. Figure 9.17 shows the necessary interactions for an unemployed processor to receive work. If the receiver had no spare work, then the Fish would be forwarded on, until either a suitable processor is found or the message becomes stale. Stale messages are returned to their originator, as shown in figure 9.18. In summary, upon arrival of a Fish message, there are four possible outcomes:

1. there is sufficient local work, with at least one spare spark available, which is therefore packed and returned to the source of the fish message. (rule SEND_WORK)

2. there is no work, but the fish message is not stale, in which case it is forwarded to another processor. (rule FISH2)

3. the out-of-work processor receives its own fish message, (and assuming no local work has become unblocked) then the fish is regenerated after a suitable timeout period. (rule FISH3)

4. there is no work and the fish has become stale, i.e. has visited too many processors, in which case a stale-fish message is returned to the source of the fish message. (rule
The `FISH_1`, `SEND_WORK`, `RECV_WORK`, and `RECV_ACK` rules form the backbone of the load-balancing mechanism and are discussed in the subsequent sections. The remaining rules, `FISH_2` and `FISH_3`, are shown in figure 9.23.

Load balancing – asking for work

The load-balancing mechanism is activated whenever a processor becomes idle. This situation typically arises because all local threads have either been fully evaluated or are currently blocked awaiting the arrival of a remote reference. Also, at the start of the computation, only the main processor will have any work (see the `INIT` rule). The initial phases of the evaluation will therefore entail a large number of `Fish` messages. The rule for generating `Fish` messages is shown in figure 9.19. In the real GUM implementation, the `Fish` message is sent to a random processor, rather than to its right-hand side neighbour (the HDG machine employs a neighbour-first strategy [Kingdon, Lester and Burn, 1991, section 3.2, page 293].)

Load balancing – receiving work

Having sent the `Fish` message, the processor will remain idle until either a `Schedule`, `Resume`, or `Exit` message is received. The `RECV_WORK` rule is the handler for `Schedule` messages, and is shown in figure 9.20. Upon arrival of a `Schedule` message, the sequence of events is as follows: the message is unpacked (see figure 9.21) and the closure contained therein is stored at heap address `a_\text{local}`; next, an acknowledgement is sent to the donor processor; and, finally, the closure's standard-entry method is invoked. When the evaluation returns, the `EndThread` continuation will place the system into the `GetThread` mode, thereby re-starting the work-request cycle.
Figure 9.18: GUM load balancing: an unsuccessful work-request cycle

GetThread \( \text{as rs us h t_id wp } \sigma b_i \)

such that \( \text{is\_empty wp and } \neg \text{is\_fishing wp} \)

\( \Rightarrow \) Send request GetThread \( \text{as rs us h t_id wp'} \sigma b_i \)

where \( \text{request } = (i, \text{neighbour, Fish}) \)

\( \text{neighbour } = 1 + (i \mod n) \)

\( \text{wp'} = \text{set\_fishing true wp} \)

Figure 9.19: Initiating a GUM work-request cycle

Receive message code \( \text{as rs us h t_id wp } \sigma b_i \)

such that \( \text{message } = (j, i, \text{Schedule a}_{\text{remote}} (\text{closure, mask})) \)

\( \Rightarrow \) Send ack code \( \text{as rs us h' t_id wp'} \sigma b_i \)

where \( \text{ack } = (i, j, \text{Ack a}_{\text{local}} a_{\text{remote}}) \)

\( \text{wp } = (\text{threads, sparks, fishing}) \)

\( \text{wp'} = (\text{threads, sparks', false}) \)

\( \text{sparks'} = \text{insert}_{\text{spark}} a_{\text{local}} \text{sparks} \)

\( (a_{\text{local}}, h') = \text{unpack j closure mask h} \)

Figure 9.20: Receiving work from a remote processor
pack a \( j h[a \mapsto (vs \pi xs \rightarrow exp, ws)] = (data, h') \)

where \( h' = \begin{cases} h[a \mapsto \text{Exported } j \text{ closure } bk_{\text{empty}}], & \text{if } (\pi = u) \\ h, & \text{otherwise} \end{cases} \)

\[ \text{data} = (a, vs \pi xs \rightarrow exp, mask, ws) \]

\[ \text{mask} = \text{mask}_1 \cdots \text{mask}_n \]

\[ \text{mask}_i = \begin{cases} 0, & \text{if } \vdash (vs ! i) : \nu \\ 1, & \text{otherwise} \end{cases} \]

\[ n = \text{length } vs \]

\[ bk_{\text{empty}} = ((\text{threads}, \emptyset) \text{fetches blocked}) \]

unpack \( j \) closure \( \text{mask } h_0 = (a_{\text{local}}, h'_1) \)

where

\[ h'_n = h_n[a_{\text{local}} \mapsto (\lambda \text{form}, w'_1 \cdots w'_n)] \]

\( \begin{cases} (w_i, h_{i-1}), & \text{if } \text{mask}_i = 0 \\ (a_i, h_{i-1}[a_i \mapsto \text{FetchMe } j \text{ w}_i bk_{\text{empty}}]), & \text{otherwise} \end{cases} \)

\[ \text{closure} = (\lambda \text{form}, w_1 \cdots w_n) \]

\[ \text{mask} = \text{mask}_1 \cdots \text{mask}_n \]

\[ bk_{\text{empty}} = ((\text{threads}, \emptyset) \text{fetches blocked}) \]

Figure 9.21: Incremental fetching: packing and unpacking Schedule messages

The unpacking process replaces all heap references contained within the new closure with local pointers to FetchMe closures. In addition to the main closure, GUM also packs some of the “nearby” reachable graph into each Schedule message [Trinder et al., 1996, section 2.4]. This improves the locality of reference, and reduces the impact of the fixed overhead of sending the message.

Load balancing – answering a request for work

When a processor receives a Fish request, and has surplus work, it returns a Schedule message containing a thunk for evaluation on the unemployed processor:

\[
\text{Receive message code as } rs \ us \ h \ t_id \ wp \ \sigma \ b_i
\]

such that \( \text{message} = (j, i, \text{Fish age origin}) \)

and \( \text{is_empty}_{\text{sparks}} wp \)

\( \implies \text{code'} as \ rs \ us \ h' \ t_id \ wp' \ \sigma \ b_i \)

where \( \text{code'} = \begin{cases} \text{send\_work}, & \text{if } (\pi = u) \\ \text{retry}, & \text{otherwise} \end{cases} \)

\( \text{send\_work} = \text{Send work code} \)

\( \text{work} = (i, j, \text{Schedule a data}) \)

\( (\text{data}, h') = \text{pack spark } j \ h \)

\( (vs \pi xs \rightarrow e, ws) = h \text{ spark} \)

\( (\text{spark}, wp') = \text{enqueue}_{\text{spark}} wp \)

\( \text{retry} = \text{Receive message code} \)
The **pack** routine converts the local thunk into a form suitable for transmission (see figure 9.21 for details). Furthermore, to avoid duplicating work, **pack** will convert thunks into **Exported** closures. This is a temporary measure until the destination processor acknowledges receipt of the **Schedule** message. If the message is lost, or the recipient cannot unpack the message for any reason, the original closure can be recovered.\(^1\) Assuming that nothing does go wrong, the acknowledgement is handled by rule **RECV_ACK**, as shown in figure 9.22.

**Representing and requesting remote references**

Values stored on remote processors are represented by **FetchMe** closures [Trinder et al., 1996, figure 2, section 2.3]. The related STG definitions are show in table 9.8. Upon entry to a remote-reference, the processor will send the owner a **Fetch** request, and suspend the current thread pending arrival of the value. Figure 9.25 shows these interactions, which are initiated by the **FETCHi** rule, shown in figure 9.24.

How are these remote references created in the first place? There are three main sources: the partitioning of the top-level bindings as specified by the **INIT** rule; closure migration as a result of a **Schedule** message; and the directed allocation of dynamic values by, for example, para-functional Haskell’s **on** expression [Mirani and Hudak, 1995, section 4].

**Replying to a **Fetch** request**

The reply to a request for a local value is very similar to that used to send work to an unemployed processor (see the **SEND_WORK** rule). Essentially, both messages contain a packaged closure, although **Schedule** messages will contain thunks, while **Resume** messages will typically contain closures in head-normal form (i.e. they will be re-entrant, and

\(^1\)Mattson’s *grey hole* [Mattson Jr., 1993a, figure 4.3, page 79] is another example of a reversible update.
Receive message code as $rs \ us \ h \ t_{id} \ wp \ \sigma \ b_i$

such that $message \equiv (j, i, Fish \ age \ origin)$
and $is_{-}empty_{sparks} \ wp$

$\Rightarrow$ Send fish’ code as $rs \ us \ h \ t_{id} \ wp \ \sigma \ b_i$
where $fish’ = \begin{cases} (i, j, Fish \ (age + 1) \ origin), & \text{if } age < age_{state} \\ (i, origin, Fish \ age \ origin), & \text{otherwise} \end{cases}$
$j = 1 + (i \ mod \ n)$

Receive message code as $rs \ us \ h \ t_{id} \ wp \ \sigma \ b_i$

such that $message \equiv (j, i, Fish \ age \ origin)$
and $(i = origin)$

$\Rightarrow$ code as $rs \ us \ h \ t_{id} \ wp’ \ \sigma \ b_i$
where $wp’ = set_{-}fishing \ false \ wp$

Figure 9.23: GUM load balancing: the other FISH rules

<table>
<thead>
<tr>
<th>specification</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>code</td>
<td>Unblock blocked code used to re-activate blocked threads and Fetch messages upon arrival of a remote-closure’s value (also used when updating black holes).</td>
</tr>
<tr>
<td>closure</td>
<td>FetchMe i a_{remote} blocked a reference to a value stored on a remote processor</td>
</tr>
<tr>
<td>Exported i closure blocked</td>
<td>work that has been exported to processor $j$ in response to a Fish message, but receipt of which has not yet been acknowledged</td>
</tr>
<tr>
<td>blocked</td>
<td>(threads, fetches) used to store details of any threads and FetchMe messages which have become blocked on a closure. When the closure is updated, the threads will be re-awakened, and replies made to the FetchMe messages.</td>
</tr>
<tr>
<td>threads</td>
<td>queue of $t_{id}$ an unordered collection of threads.</td>
</tr>
<tr>
<td>fetches</td>
<td>queue of $(i, a_{remote})$ an unordered collection of FetchMe requests</td>
</tr>
</tbody>
</table>

Table 9.8: Representing remote references with GUM
Enter $a_{local}$ as $rs$ us $h$ $tid$ $wp$ $\sigma$ $b_i$

such that $h[a_{local} \mapsto FetchMe i a_{remote} blocked]$

$\Rightarrow code as rs us h' tid wp' \sigma b_i$

where

$h' = h$
$$\begin{bmatrix}
  a_{local} & \mapsto & FetchMe i a_{remote} blocked' \\
  tid & \mapsto & TSO (Enter a_{local}, as, rs, us)
\end{bmatrix}$$

blocked' = enqueueThread $tid$ blocked

code = $\begin{cases}
  Send fetch GetThread, & \text{if is\_empty blocked} \\
  GetThread, & \text{otherwise}
\end{cases}$

fetch = $(i, j, Fetch a_{remote} a_{local})$

Figure 9.24: Handling remote references in a distributed-memory architecture

Figure 9.25: Accessing remote references with GUM
so their update flags will be \( r \). The following rule handles the packing and reply:

\[
\text{Receive message code as } rs \ us \ h \ t_id \ wp \ \sigma \ b_i \\
\text{such that } \text{message} \equiv (j, i, \text{Fetch } a_{\text{local}} a_{\text{remote}}) \text{ and } (\pi = r)
\]

\[
\Rightarrow \text{Send resume code as } rs \ us \ h' \ t_id \ wp \ \sigma \ b_i
\]

\[
\text{where resume} = (i, j, \text{Resume } a_{\text{remote}} \text{ data})
\]

\[
(data, h') = \text{pack } a_{\text{local}} j \ h
\]

\[
(vs \ \pi \ xs \to e, ws) = h a_{\text{local}}
\]

**Remote references – receiving remote values**

Upon arrival of the Resume message, the remote-value is unpacked, and the FetchMe closure updated with an indirection to the new closure:

\[
\text{Receive message code as } rs \ us \ h \ t_id \ wp \ \sigma \ b_i \\
\text{such that } \text{message} \equiv (j, i, \text{Resume } a_{\text{local}} (\text{closure, mask)})
\]

\[
\text{and } h[a_{\text{local}} \mapsto \text{FetchMe } j a_{\text{remote blocked}}]
\]

\[
\Rightarrow \text{Unblock blocked code as } rs \ us \ h'' \ t_id \ wp \ \sigma \ b_i
\]

\[
\text{where } h'' = h'[a_{\text{local}} \mapsto \text{Ind } a']
\]

\[
(a', h') = \text{unpack } j \text{ closure mask } h
\]

The Unblock phase is responsible for awakening any threads that were waiting for the remote value (there will be at least one, otherwise the Fetch would never have been sent). In addition, it also replies to any blocked fetches (the following section detail how this can happen):

\[
\text{Unblock blocked code}_0 \ as \ rs \ us \ h_0 \ t_id \ wp \ \sigma \ b_i
\]

\[
\Rightarrow \text{code}_n \ as \ rs \ us \ h_n \ t_id \ wp' \ \sigma \ b_i
\]

\[
\text{where wp}' = \text{insert threads threads wp}
\]

\[
\text{code}_k = \text{Send resume}_k \text{ code}_{k-1}
\]

\[
\text{resume}_k = (i, \text{source}_k, \text{Resume } a_k \text{ data}_k)
\]

\[
(data_k, h_k) = \text{pack } a_u \text{ source}_k \ h_{k-1}
\]

\[
(\text{source}_k, a_k) = \text{fetches}! k
\]

\[
n = \text{length fetches}
\]

\[
(\text{threads, fetches}) = \text{blocked}
\]

The rule for updating shared thunks, \( BH_3 \), uses the unblock rule to re-awaken the threads and fetches which have been waiting for the local evaluation to complete.

**Remote references – requesting black-holed values**

Having described the basic mechanism for dealing with remote references, one complication remains. It is possible that a processor is asked for a value which is still being evaluated, i.e. it has been black holed. In this case, the Fetch message is simply added to the black
holes blocking pool:

\[
\text{(FETCH}_4) \quad \text{Receive message code as } rs \; us \; h \; t_id \; wp \; \sigma \; b_i \\
\text{such that } \quad \text{message } \equiv (j, i, \text{Fetch } a_{\text{local}} \; a_{\text{remote}}) \\
\text{and } \quad h[a_{\text{local}} \mapsto \text{BlackHole blocked}] \\
\implies \quad \text{code as } rs \; us \; h' \; t_id \; wp \; \sigma \; b_i \\
\text{where } \quad h' = h[a_{\text{local}} \mapsto \text{BlackHole blocked}'] \\
\text{blocked}' = \text{insert}_{\text{fetches}} (j, a_{\text{remote}}) \text{ blocked}
\]

When the closure is finally updated, via the BH\textsubscript{3} rule, the UNBLOCK rule will ensure the suspended Fetch messages are replied to. Figure 9.26 provides an example of this sort of interaction.

**Initialisation**

As discussed in section 6.3.4, GUM replicates all top-level closures on all processors. While this is expensive in terms of memory, it does improve locality, thereby avoiding processors becoming inundated with requests for "popular" global values. By copying constant applicative forms (CAFs [Peyton Jones, 1987, section 13.2, page 224])), there is a risk of duplicating work. However, if this should become a problem, it is straightforward to re-write an STG' program such that the CAF becomes a local shared value. GUM's INIT rule is shown in figure 9.27.
\[\begin{align*}
G &= (P_1, \ldots, P_n) (b_1, \ldots, b_n) \\
\text{where} & \quad P_i = (\text{GetThread}, \emptyset, \emptyset, \emptyset, \text{t\_none}, \text{wp}_i, h_i, \sigma) \\
b_i &= (\emptyset, \text{in}, \text{out})_i \\
\text{wp}_i &= (\emptyset, \text{sparks}_i, \text{false}) \\
\text{sparks}_i &= \begin{cases}
\langle a_{\text{main}} \rangle, & \text{if } i = 1 \\
\emptyset, & \text{otherwise}
\end{cases} \\
h_i &= \begin{cases}
\text{h}_{\text{at\_main}} \mapsto \text{FetchMe} 1 a_{\text{main}}, & \text{if } i = 1 \\
h_i &\quad \text{otherwise}
\end{cases} \\
h &= \begin{cases}
a_1 &\mapsto (\text{vs}_1 \pi_1 \text{vs}_1 \rightarrow \text{exp}_1, \sigma \text{vs}_1) \\
\cdots \\
a_n &\mapsto (\text{vs}_n \pi_n \text{vs}_n \rightarrow \text{exp}_n, \sigma \text{vs}_n)
\end{cases} \\
\sigma &= \begin{cases}
g_1 &\mapsto a_1, \\
\cdots \\
g_n &\mapsto a_n
\end{cases}
\end{align*}\]

Figure 9.27: GUM initialisation

Termination

As with the previous case study, the computation is finished whenever the main thread terminates:

\[\begin{align*}
&\text{Return}_{\chi} \text{ctws} \quad \emptyset \quad \langle \text{Finish} \rangle \quad \emptyset \\
\Rightarrow &\quad \text{Broadcast Exit Stop} \\
\text{where } h' &= h[\sigma_{\text{status}} \mapsto \text{Stopped}]
\end{align*}\]

However, rather than relying on a global variable to indicate the end of the evaluation, an Exit message is broadcast to all other processors:

\[\begin{align*}
&\text{Receive Exit code as } rs \quad us \quad h \quad t_{\text{id}} \quad \text{wp} \quad \sigma \quad b_i \\
\Rightarrow &\quad \text{Stop as } rs \quad us \quad h \quad t_{\text{id}} \quad \text{wp} \quad \sigma \quad b_i
\end{align*}\]

Distributed garbage collection

The remote-reference mechanisms described in the previous sections completely ignored the implications of global garbage collection. While it would be possible to develop a distributed collector that could handle this situation, it is likely that it would be horribly inefficient. The real GUM implementation provides better support for its global collector by maintaining three tables [Trinder, Hammond, Partridge, Peyton Jones and others, 1996, sections 2.3.1 and 2.3.2]:

\textbf{GIT} the global indirection table identifies all local closures that are globally visible.

\textbf{GA}\textsubscript{\rightarrow}\textbf{LA} this maps a remote reference to a local closure.

\textbf{LA}\textsubscript{\rightarrow}\textbf{GA} this maps local address to their global addresses.

This information allows the collector to identify all global addresses and to efficiently determine whether any of them can be reclaimed. While it would be possible to extend the GUM model to record this information, it is beyond the scope of this thesis.
178

Table 9.9: The register map for compiling GUM expressions

<table>
<thead>
<tr>
<th>register</th>
<th>use</th>
<th>1–22</th>
<th>23</th>
<th>24</th>
<th>25</th>
<th>26</th>
<th>27</th>
<th>28</th>
<th>29</th>
<th>30</th>
<th>31</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ret</td>
<td>stores the address of the return handler for the current evaluation (which may be a generic update-handler, when evaluating a polymorphic thunk).</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Np</td>
<td>points to the closure which is currently being evaluated, and is used to access an expression’s free variables.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>StkA</td>
<td>points to the next available slot on the A stack. This is used in conjunction with StkB to detect stack overflow.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>StkABase</td>
<td>points to the lower limit of the A stack, and is used to detect stack underflow.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>StkB</td>
<td>points to the next available slot on the B stack.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>StkBBase</td>
<td>points to the upper limit of the B stack, and is used to detect stack underflow.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Tp</td>
<td>points to the current thread’s TSO closure.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>HLimit</td>
<td>identifies the maximum extent of the local heap, and is used in conjunction with Hp to determine if the garbage collector should be invoked.</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Hp</td>
<td>points to the next word of available memory in the local heap. Allocation simply involves incrementing the pointer and the using the space reserved (plus the necessary heap-overflow check).</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

9.3.3 Compilation rules

Developing the GUM compilation rules was straightforward as it primarily involved minor modifications to the compilation rules and run-time support developed as part of the previous case study. The only significant changes included the addition of a number of extra entry points in the info tables (to support packing and fetching), and the integration of the message-passing routines. These are discussed in the following sections, and use the register map shown in table 9.9.

Sending and receiving messages

The API for the send, receive, and poll primitives used by the architecture simulator are shown in table 7.3. These deal with blocks of words, onto which GUM imposes the following structure:

<table>
<thead>
<tr>
<th>word</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3+</th>
</tr>
</thead>
<tbody>
<tr>
<td>content</td>
<td>source</td>
<td>destination</td>
<td>message tag</td>
<td>message-specific content</td>
</tr>
</tbody>
</table>

This format is slightly inefficient in that the sender/receiver pair occupies two words, when it could be packed into one or two bytes (depending upon the total number of processors). However, this change would increase the complexity of the message-handling routines for only a small return in space saved. Note that each message is tagged with its type, allowing the receiver to efficiently dispatch the message to the correct handler. Figure 9.28 shows
the RISC code which implements the operational RECV rule, and the corresponding tags are shown below:

<table>
<thead>
<tr>
<th>message</th>
<th>Fish</th>
<th>Schedule</th>
<th>Ack</th>
<th>Fetch</th>
<th>Resume</th>
<th>Exit</th>
</tr>
</thead>
<tbody>
<tr>
<td>tag</td>
<td>0</td>
<td>4</td>
<td>8</td>
<td>12</td>
<td>16</td>
<td>20</td>
</tr>
</tbody>
</table>

Packing and fetching

The behaviour of the fetching and packing mechanisms is closure-specific. For example, the STG FETCH\(_2\) and FETCH\(_4\) rules specify how to fetch standard closures and black holes respectively. Rather than tagging each closure, the standard approach to handling closure-specific code is to add a new entry method (see section 6.4.3). To this end, figure 9.29 shows the info table for a standard closure. Note that the pack and fetch methods are bundled together with the garbage collection operations (see section 6.3.3). While each STG\(^2\) binding will generate a unique info table (and associated entry code), they can share a small collection of GC and communication methods\(^2\). Figure 9.30 shows the pack operation for handling re-entrant closures.

\(^2\)The literal- and boxed-counts stored in the main info table make this sharing possible. Given these two pieces of information, it is possible to infer the exact layout of the closure.
Figure 9.29: Layout of the GUM info tables for a standard closure

---

_RISC code_

```risc
// pack API:
// R19 return continuation
// R21 buffer
// R22 buffer length

Lpack_reentrant_closure:
  load (RNp), R1;       // load the info table
  store R1, +4(R21);   // and store it in the buffer
  load +12(R1), R2;    // load the number of literals
  load +16(R1), R1;    // and the number of boxed values
  add R1, R2, R1;      // find the total size
  store R1, (R21);     // store it in the buffer
  add R21, +8, R21;    // move the index forward
  subtract R22, +4, R22; // decrement the space remaining
  add RNp, +4, R2;     // point to the free vars

Lpack_re_loop:
  branch_x>0 R1, Lfinish_re_pack; // exit if no more values
  load (R2), R3;          // obtain the next value
  store R2, (R21);       // pack it
  add R2, +4, R2;        // move the index forward
  add R21, +4, R21;      // and the buffer index
  subtract R1, +1, R1;   // decrement the counter
  subtract R22, +4, R22; // and the space remaining
  branch Lpack_re_loop;  // and repeat.

Lfinish_re_pack:
  jump R19;
```

Figure 9.30: The RISC implementation of the _pack_ method for re-entrant closures
9.3.4 Other message-passing systems

Although a quarter of the 92 $\nu$-STG machine rules involve some form of message passing, the abstract state [Hwang and Rushall, 1992, section 3] does not include a communications component. Instead, sending is specified via a (side-effecting) auxiliary function, `sendMessage`, and a dedicated mode handles the implicit reception of each kind of message. The capabilities of the resulting system are, however, similar to those outlined in this section.

The Alfalfa system [Goldberg and Hudak, 1987, section 4.6, pages 106–107] uses three types of messages: `system messages`, containing load information and other administrative details used by the scheduling system; `reducer messages`, similar to GUM’s `Schedule`, except they are sent pro-actively; and `storage messages`, which are the main component of the reference-counting garbage collector, and are generated whenever a reference to a closure is either replicated or deleted (GUM uses a `Free` message to implement a similar system.)

Concurrent Clean [Nöcker, Smetsers, Plasmeijer and van Eekelen, 1991, section 5.0, pages 215–216] uses `channel` nodes to handle remote references. These are almost identical to GUM’s `FetchMe` closures.

9.3.5 Performance

As with the previous case study, the `fib` and `queens` benchmarks are used to evaluate the performance of the GUM model. Both the STG and RISC animations allows the costs of sending and receiving messages to be modified (using the `LogP` communication model [Culler et al., 1993] – see section 2.2.1). As such, the performance evaluations consider the effect of transmission time on the models performance. To simplify the presentation of the results, the following categories are used to describe the various costs:

<table>
<thead>
<tr>
<th>cost</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0–50</td>
<td>very low</td>
</tr>
<tr>
<td>50–150</td>
<td>low</td>
</tr>
<tr>
<td>150–500</td>
<td>medium</td>
</tr>
<tr>
<td>500+</td>
<td>high</td>
</tr>
</tbody>
</table>

As the STG and RISC animations produce very similar results, only those for the STG simulation are presented here.

The `fib` benchmark

The relative speedup curve for the optimised `fib` 15 STG benchmark is shown in figure 9.32 (with medium communication costs). The system achieves a maximum speedup of just over two, which compares poorly with the speculative GMSV model (this runs approximately eighteen times faster on twenty processors). However, the STG animation does exhibit the performance trail off associated with adding surplus processors. With the speculative system, only the RISC animation was accurate enough to reproduce this phenomena.

Upon further analysis of the animation traces, it quickly became obvious that the work-distribution mechanism was stripping processors of their `fib_n_less1` and `fib_n_less2` sparks, leaving them with just the addition operation. This couldn’t proceed until the two sparks were evaluated, so many processors spent significant portions of their time idling. The solution was to re-write the `fib` benchmark, resulting in the `fib2` program shown in
Figure 9.31: The parallel STG' fib2 -O benchmark

Figure 9.31. This simple change ensures that a processor retains a significant portion of the work for itself, irrespective of the distribution mechanism. This produced the speedup curves shown in figure 9.32, which exhibit far better scalability than the fib benchmark. By re-writing the SEND_WORK rule to retain sufficient local work, it is possible to achieve similar results for the fib benchmark. However, it is easy to envisage programs where this policy would be equally damaging.

The sensitivity of the system to changes in the grain size and ordering is not surprising. The communication overheads are sufficiently high such that a spark has to represent a significant amount of work before it is worth distributing it. It is therefore not surprising that there is a significant body of work dealing with estimating the grain size of general expressions – Sands [1990] provides an excellent introduction to this field.

Figure 9.34 shows the speedup curves for fib2 15 for a range of communication costs (the key details the message-latency parameter, $L$). All curves achieve significant speedups, with similar results being observed for low numbers of processors. The best overall result is obtained when the costs are very low, and performance is only slightly inferior to that for the GMSV system. However, the results for the low-cost situation are poor when compared to both the medium- and high-cost situations. Further investigation revealed the source of this unexpected result: the load-distribution mechanism. The passive load-balancing works well when there is sufficient work available for all of the processors. This is the situation in the early and mid phases of the computation, or when the number of processors is low. However, as soon as work becomes scarce, a large number of Fish messages are injected into the system. This hinders the processors that are performing useful computations. For the low communication costs, more Fish messages can be generated and re-spawned within a fixed period than for the mid- and high-cost scenario. Table 9.35 lists the total number of messages sent during two particular runs, and figure 9.36 histograms the number of Fish messages for a range of costs (each run used 15 processors). The figures shows that, on average, a processor will receive over ten times the number of Fish messages with low-cost communications. As the overhead parameter, $o$, is comparable for all of the runs (except for the very low cost model), this causes the observed poor performance.
Figure 9.32: Relative speedups for the conservative fib -O benchmark

Figure 9.33: Relative speedups for the conservative fib2 -O benchmark
Figure 9.34: The impact of message latency on the fib2 -O benchmark

<table>
<thead>
<tr>
<th>message type</th>
<th>number of messages</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$L = 50$</td>
</tr>
<tr>
<td>Fish</td>
<td>7020</td>
</tr>
<tr>
<td>Schedule</td>
<td>217</td>
</tr>
<tr>
<td>Ack</td>
<td>217</td>
</tr>
<tr>
<td>Fetch</td>
<td>264</td>
</tr>
<tr>
<td>Resume</td>
<td>264</td>
</tr>
<tr>
<td>Exit</td>
<td>14</td>
</tr>
</tbody>
</table>

Figure 9.35: Total messages sent during the fib2 15
Figure 9.36: Communication costs and GUM load-balancing messages
The queens benchmark

Figure 9.37 shows the speedup curve for the queens 6 benchmark. The results are worse than even those for the unmodified fib benchmark. As with fib2, queens2 is a modified version of the benchmark, which attempts to increase the grain size of the computation by strictly evaluating the list comprehensions for the subproblems:

```
gen_comprehension' = \r [nq ds] ->
  case ds of {
    Nil  -> Nil [];  
    : x xs -> letpar tl = gen_comprehension' nq xs in
      let { qs = [nq] \u [] -> const.Int.enumFromTo one nq; } in
      let { n = [tl x nq qs] \u [] -> sc.TBSn tl x nq qs; } in
      case length v of { Int x -> n; });
```

The length method is used to force the computation of the entire list, and therefore plays the role of an evaluation transformer, as described by Burn [1991] (see section 2.4.4). The speedup curve for queens2 is also shown in figure 9.37, and is almost twice as efficient, despite performing some unnecessary work. However, these results are still poor. The combination of the low number of available threads and the high degree of interaction between them is such that GUM’s unstructured placement and scheduling of tasks is sub-optimal. It is likely, however, that increasing the problem size would significantly improve the speedup curves - queens 10 is typically used to for benchmarking real GMSV and DMMP implementations. The HDG-machine [Kingdon, Lester and Burn, 1991] is one of the few DMMP exceptions, achieving a speedup of just under three for queens 6 on four processors. However, as the HDG-machine uses a primitive model of graph-reduction, combined with the fast Transputer communication network, any comparison would be unfair.
9.3.6 Assessment

GUM's operational model is considerably more complex than that of the speculative GMSV case study. Fortunately, the scheduling and synchronisation rules from the previous study could be re-used after only minor modifications. However, designing and testing the load-balancing and remote-reference mechanisms was sufficiently involved that it would have been almost impossible without the use of the UML interaction diagrams and the STG animation. Using these tools, most of the complexity disappeared, and the techniques described in chapter 6 proved satisfactory. Indeed, the animation quickly revealed unthought of run-time interactions, and led directly to the development of the FETCH and UNBLOCK rules. Throughout the development, the denotational semantics provided reference points against which the operational model could be tested for correctness.

The main limitation with the testing was the problem size that could be handled by the STG animation (the RISC animation was used only to confirm that the STG animation was producing credible results). While this is a common problem with simulations, the STG animation could cope with larger problems sizes if used on a more powerful workstation (see table 4.8, which includes sequential results for queens 8). Furthermore, larger problem sizes tend to hide the effect of inefficient use of the communication network, and it could be argued that small problems are therefore better for debugging performance.

9.4 Para-functional Haskell: data placement

Para-functional Haskell [Hudak, 1991] is the composition of two meta-languages, one for defining what is to be computed (Haskell), and the other for specifying how it is to be evaluated, i.e. a co-ordination language [Gelernter and Carriero, 1992]. The co-ordination operators fall into two categories: data placement, for controlling where the evaluation should take place; and scheduling, for specifying the order of evaluation. This case study is concerned with data placement, and, more specifically, the on operator.

While para-functional Haskell has only been implemented on a GMSV architecture (the Encore Multimax), ParAfl (a pre-cursor to para-functional Haskell, [Hudak, 1988]) was ported to the Intel iPSC (DMMP). Both implementations exhibited significant relative speedups [Hudak, 1988, figure 2, page 57] for a matrix-multiplication benchmark. The GMSV system achieved almost linear speedup with up to twelve processors, while the DMMP implementation managed a maximum speedup of just under three on fifteen processors. More recently, Mirani and Hudak [1995] have used monads [Peyton Jones and Wadler, 1993] to structure and enhance the communication language. The GMSV implementation (running on a sixteen processor Silicon Graphics Challenge) has demonstrated relative speedups on a range of benchmarks: matrix multiplication, fib, queens, and a sorting algorithm.

9.4.1 Static semantics

This section looks at the static semantics of a para-functional STG' language. The following informal description of para-functional Haskell will serve as the motivator for the remainder of the section.

The informal semantics of para-functional Haskell

Para-functional Haskell uses mapped expressions to control the placement (and evaluation) of expressions: exp on proc. This declares that exp should be evaluated on the processor
identified by proc. Consider the following example:

```haskell
-- para-functional Haskell
let x = f 2
     y = f 3
     f a = a * a
in (+ (x on 1) (y on 2)) on 3
```

The (simplified) allocation of tasks to processors is as follows:

<table>
<thead>
<tr>
<th>processor</th>
<th>task</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2 * 2</td>
</tr>
<tr>
<td>2</td>
<td>3 * 3</td>
</tr>
<tr>
<td>3</td>
<td>4 + 6</td>
</tr>
</tbody>
</table>

However, the evaluation will still proceed sequentially. Para-functional Haskell used sched expressions to spark parallel tasks. As a simple example, \( f \ x \ \text{sched} \ \text{D}x \), denotes that \( x \) can be evaluated in parallel with the application \( f \ x \). To allow the construction of topologies, the `self` operator returns the id of the local processor:

```haskell
-- para-functional Haskell
divconq split combine endtest endval = f
where
    f x | endtest x = endval x
    | otherwise = combine left right sched Dleft|Dright
        where (l, r) = split x
        left  = f l on self - 1
        right = f r on self + 1
```

Note that these are virtual topologies, as para-functional Haskell does not provide access to the current number of real processors. The run-time system is responsible for the mapping between real and virtual processors.

**The para-functional STG' language**

Having had an informal look at the on, sched, and self expressions, there are two possible approaches to developing an STG' variant. The first would be to preserve the distinction between mapping and scheduling expressions, while the second would combine them. Both would result in new variants of the let expression. However, looking at the limited collection of para-functional programs, the on operator never appears apart from the sched construct. The second approach was therefore selected, and figure 9.38 shows the abstract syntax, and free-variable, and type-inference rules of the let-on expression.

With regards to the self construct, either an expression or primitive function could be used. However, section 5.2 clearly recommends the use an expression, and figure 9.39 shows the necessary definitions. Note that processor ids are represented by unboxed integers, allowing the full complement of arithmetic operators to be used to manipulate them. While the type system could have been extended to include a `Pid#` type, this would complicate the specification of virtual topologies for no real benefit.

The new expressions are related to the traditional letpar as follows:

```haskell
letpar simple_bind exp ≡ let# pid = self in
                  leton pid simple_bind exp
```
Figure 9.38: The \texttt{leton} expression: abstract syntax, free variables, and type inference

\[
\begin{align*}
\text{abstract syntax} & \quad exp & \rightarrow & \text{leton atom simple\_bind exp } & \cdots & \text{task-mapping expression} \\
\text{free variables} & \quad FV_{\text{atom}}[\text{leton pid var = exp}_{\text{rhs}} \text{exp}_{\text{body}}] g & = & FV_{\text{atom}}[\text{pid}] g \cup FV_{\text{exp}}[\text{exp}_{\text{rhs}}] g \cup (FV_{\text{exp}_{\text{body}}}[exp] g \setminus \{\text{var}\}) \\
\text{type inference} & \quad \overline{\text{LETON-EXP}} & \quad \begin{array}{l}
TE \vdash \text{pid : Int}\# \\
\text{simplebind} \\
TE \vdash \text{simplebind : (var, } \chi \pi_1 \ldots \pi_v) \\
LVE = \{\text{var } \mapsto \chi \pi_1 \ldots \pi_v\} \\
\end{array} \\
& \quad \Rightarrow \\
& \quad \begin{array}{l}
\text{exp} \vdash \text{LVE} \vdash \text{exp : } \tau_{\text{exp}} \\
\text{LETON-EXP} & \quad \begin{array}{l}
TE \vdash \text{leton pid simplebind exp : } \tau_{\text{exp}} \\
\end{array} \\
\end{array}
\end{align*}
\]

Figure 9.39: The \texttt{self} expression: abstract syntax, free variables, and type inference

\[
\begin{align*}
\text{abstract syntax} & \quad exp & \rightarrow & \text{self } \cdots & \text{processor id} \\
\text{free variables} & \quad FV_{\text{exp}}[\text{self}] g = \{\} \\
\text{type inference} & \quad \begin{array}{l}
\text{SELF-EXP} & \quad \begin{array}{l}
TE \vdash \text{self : Int}\# \\
\end{array} \\
\end{array}
\end{align*}
\]
Denotational semantics

Based on the informal description given at the start of the section, it would appear that the self expression has introduced non-determinism into the language. To avoid this unwanted situation, Hudak [1986] extended the denotational semantics of a simple functional language to include the notion of location. Most of the semantic functions are extended to take the current processor id as an additional argument:

\[
\mathcal{E}[\text{letrec binds main}] \{\}_{env} 0_{pid}
\]

At the start of evaluation, the current processor is set to zero:

\[
\begin{align*}
\text{Program}[\text{program}] : & \text{ Val} \\
\text{Program}[\text{typedecls binds}] : & = \mathcal{E}[\text{letrec binds main}] \{\}_{env} 0_{pid}
\end{align*}
\]

The majority of the semantic equations either ignore the processor id, or simply thread it through their sub-expressions. Only the letón and self expressions actually manipulate the new parameter:

\[
\begin{cases}
\mathcal{E}[\text{self}] \rho cpid = cpid \\
\mathcal{E}[\text{leton pid var} = \text{exprhs} \ \text{expbody}] \rho cpid = \text{case } \mathcal{E}[\text{exprhs}] \rho cpid' \text{ of } \\
\quad \bot \rightarrow \bot \\
\quad \epsilon \rightarrow \mathcal{E}[\text{expbody}] \rho' cpid \\
\quad \text{where } \rho' = \rho \oplus \{\text{var} \mapsto \epsilon\} \\
\quad \text{cpid}' = \text{Atom}[\text{pid}] \rho
\end{cases}
\]

The following para-functional STG' program, for example, denotes the value \texttt{Int 0}:

```stg'
main = [] [] -> let# x = self# in \\
       let# right = plusInt# [x, int 1#] in \\
       let# left = minusInt# [x, int 1#] in \\
       letón right a = let# here = self# in Int [here] in \\
       letón left b = let# here = self# in Int [here] in \\
       const.Int.+ a b;
```

Execution-tree semantics

In addition to the standard denotational semantics, Hudak [1986, section 5, pages 113–119] also provided an execution-tree semantics for a simple para-functional language. Loosely based on his work on pomsets [Hudak and Anderson, 1987], the semantics generates an evaluation history for the program. However, it does not model sharing within the programs, so the execution path will always be a tree. This history can be visualised and used to ensure the operational model and compilation rules are behaving correctly. For example, figures 9.40 and 9.41 show the execution trees for \texttt{fib 5} and \texttt{queens 2}. Notice that the \texttt{queens} benchmark has a significantly more complex and irregular tree than that for the \texttt{fib} benchmark. It is highly likely that the execution-tree semantics could be refined to provide programmers with a tool for identifying and controlling potential parallel tasks.

The execution-tree semantics, \(T[\text{exp}]\), builds upon the denotational semantics, using it to determine which case alternatives will be selected during evaluation. Figure 9.42 shows the additional domain equations used by the semantics. A behaviour comprises
Figure 9.40: The execution tree for fib 5
Figure 9.41: The execution tree for queens 2

\[
\begin{align*}
\text{Beh} & \quad = \quad \text{Etree} \times \text{AbsBeh} \\
\text{Etree} & \quad = \quad \text{Pid} \\
\cup (\text{Pid} \times \text{Etree}) & \quad = \quad \text{Simple evaluation} \\
\cup (\text{Pid} \times \text{Etree} \times \text{Etree}) & \quad = \quad \text{Sub-expression evaluation} \\
\text{Pid} & \quad \equiv \quad \text{I}^\# \\
\text{AbsBeh} & \quad = \quad \text{Beh} \to \text{Val} \to \text{Pid} \to \text{Beh} \\
\cup \text{Id} \to \text{Beh} & \quad = \quad \text{Function behaviours} \\
\text{BEnv} & \quad = \quad \text{Id} \to \text{Beh} \\
\text{BEnv} & \quad = \quad \text{Behaviour env.}
\end{align*}
\]

Figure 9.42: Recursive execution-tree domain equations
Figure 9.43: Execution-tree semantics of para-functional STG' programs and bindings

two parts, the first is the execution tree, and the second is an abstract behaviour. The abstract behaviour is used by let and case expressions to model the non-strict evaluation ordering. Using Hudak’s notation, for a behaviour $b$, $bt$ denotes its execution tree, and $bf$ denotes its abstract behaviour.

Each semantic equation typically returns a behaviour, and, in addition to the value environment, $\rho$, and current processor id, $cp$, maintains a behaviour environment, $be$:

$$T[exp] : BEnv \rightarrow Env \rightarrow Pid \rightarrow Beh$$

Figures 9.43, 9.44, and 9.45 show the various equations used by the execution-tree semantics. In most instances, there is a close correspondence between Hudak’s definitions and those used here (for a comprehensive description of the ideas and techniques, the interested reader is referred to [Hudak, 1986]). The main extension corresponds to the STG’ language’s use of algebraic constructors. While let(rec) expressions are the only source of delayed evaluation within the STG’ language, constructors and case expressions need to maintain the associated non-strict behaviours across function calls. To this end, the domain of abstract behaviours has been extended to include a constructor map (see the constructor and algebraic-alternative equations in figures 9.44 and 9.45). Further modifications were required to model the strict evaluation of let#, letstrict, and leton expressions.

9.4.2 The operational semantics

The operational semantics presented here builds upon those used in the previous section. The most obvious strategy is to modify the LETPAR rule so that it sends a Schedule message to the targeted processor, as shown in figure 9.46. The LETON1 rule ensures that work targeted for the local processor is directly added to the work pool. The LETON2 rules packs up the work and sends it to the specified processor. The SELF rule returns the processor’s id when evaluating a self expression. Note, however, that this simple approach is only correct with regards to the denotational semantics when there are more physical processors than virtual processors. Consider, for instance, the following code:
\[ T[\text{let binds exp}] \mathbin{=} T[\text{exp}] \mathbin{\circ} \mathbin{\text{binds}} \]  
where \( b' = b \mathbin{\mathbin{\oplus}} T[\text{binds}] b \)  
and \( \rho' = \rho \mathbin{\mathbin{\oplus}} B[\text{binds}] \rho \)

\[ T[\text{letrec binds exp}] \mathbin{=} T[\text{exp}] \mathbin{\circ} \mathbin{\text{binds}} \]  
where \( (b', \rho') = \text{fix} (\lambda(b', \rho'). (T[\text{binds}] b' \mathbin{\mathbin{\circ}} b') (\rho \mathbin{\mathbin{\oplus}} \rho')) \)

\[ T[\text{let} \# \text{ var } = \text{exprhs} \text{ exp}] \mathbin{=} T[\text{exprhs}] \mathbin{\mathbin{\circ}} \mathbin{\text{var}} \mathbin{\circ} cp \mathbin{=} (cp : (b_1, b_2), b_2)
\]

\[ T[\text{letstrict} \text{ var } = \text{exprhs} \text{ expbody}] \mathbin{=} T[\text{exprhs}] \mathbin{\mathbin{\circ}} \mathbin{\text{var}} \mathbin{\circ} cp \mathbin{=} (cp : (b_1, b_2), b_2)
\]

\[ T[\text{let} \text{ pid var } = \text{exprhs} \text{ expbody}] \mathbin{=} T[\text{exprhs}] \mathbin{\mathbin{\circ}} \mathbin{\text{var}} \mathbin{\circ} cp \mathbin{=} (cp : (b_1, b_2), b_2)
\]

\[ T[\text{case} \text{ exp alts default}] \mathbin{=} T[\text{exprhs}] \mathbin{\mathbin{\circ}} \mathbin{\text{var}} \mathbin{\circ} cp \mathbin{=} (cp : (b_1, b_2), b_2)
\]

\[ T[\text{cons atom}_1 \ldots \text{atom}_n] \mathbin{=} T[\text{atom}_1 \ldots \text{atom}_n] \mathbin{\circ} cp \mathbin{=} (cp : (b_1, b_2), b_2)
\]

\[ T[\text{literal}] \mathbin{=} T[\text{var}] \mathbin{\circ} cp \mathbin{=} (cp, \text{err})
\]

\[ T[\text{self}] \mathbin{=} T[\text{var}] \mathbin{\circ} cp \mathbin{=} (cp, \text{err})
\]

\[ T[\text{default}] \mathbin{=} T[\text{var}] \mathbin{\circ} cp \mathbin{=} (cp, \text{err})
\]

\[ T[\text{atom}] \mathbin{=} T[\text{atom}] \mathbin{\circ} cp \mathbin{=} (cp, \text{err})
\]
\[ \textcolor{red}{\text{Figure 9.45: Execution-tree semantics of para-functional STG' case alternatives}} \]

\[ \textcolor{red}{\text{Eval (letón pid v = e₁ e₂) ρ as rs us h ti_d wp σ b₁ such that pid' = i}} \]
\[ \Rightarrow \text{Eval e₂ (ρ ⊕ \{v → a\}) as rs us h' ti_d wp' σ b₁} \]
\[ \text{where h' = h[a \mapsto \text{create\_closure e₁ ρ}]} \]
\[ \text{wp' = insert\_spark a wp} \]
\[ \text{pid' = (val ρ σ pid)'\%n} \]

\[ \textcolor{red}{\text{(LET0N₁)}} \]

\[ \textcolor{red}{\text{Eval self ρ as rs us h ti_d wp σ b₁}} \]
\[ \Rightarrow \text{Return(int i as rs us h ti_d wp σ b₁)} \]

\[ \textcolor{red}{\text{(SELF)}} \]
The denotational semantics specify that the correct result is \texttt{Int 1000}, while a one-processor operational model will produce the result \texttt{Int 0}. The load-balancing mechanism also causes problems by transporting thunks to other processors. There are three possible solutions to this problem:

1. extend the denotational semantics to take an extra argument: the maximum number of available processors. An attempt to create a virtual topology with more processor would result in an error. The STG' language would also have to be extended to provide a way for the programmer to determine the number of available processors.\(^3\)

2. modify the operational behaviour so that it correctly implements the denotational semantics.

3. accept the fact that the operational model is a weak model of the denotational semantics and is also non-deterministic (due to the load balancer).

From a language-design perspective, the second option is the only acceptable alternative as both of the other solutions fall short of the goals set out in table 5.1.

In order to provide support for virtual topologies a number of minor modification need to be made to the GUM operational model. Essentially, each mapped thunk is annotated with its virtual processor id. Whenever such a thunk is entered, the processor updates its virtual id, and it is this value that the \texttt{self} expression will equate to. This strategy copes with a mismatch between the number of virtual and real processors, and is also not affected by dynamic load balancing. However, one problem remains: after evaluation of a mapped thunk, the processor needs to revert its virtual id to the value it was using before the thunk was entered. The solution is to store the original virtual id inside the thunk’s update frame, where it can be recovered after evaluation is complete. The modifications to the GUM’s state components are shown in table 9.10, and a selection of the associated rules are shown in figure 9.47.

\(^3\)Such an extension could also be useful in an implementation supporting truly virtual topologies. It would allow programmers to generate topologies optimised for the available number of processor. It would be almost identical in form to the \texttt{self} expression, and figures 9.39 and 9.46 could serve as a template.
(LETON'2)

\[
\text{Eval } (\text{leton } \text{pid } v = e_1 \langle e_2 \rangle) \rho \text{ as } rs \ us \ h \ t_{id} \ wp \ \sigma \ vp_{id} \ b_i \\
\Rightarrow \text{Send schedule afterwards as } rs \ us \ h' \ t_{id} \ wp \ \sigma \ vp_{id} \ b_i
\]

where

- \text{afterwards } = \text{Eval } e_2 (\rho \oplus \{v \mapsto a\})
- \text{schedule } = (i, \text{pid}', \text{Schedule } a (\text{pack closure}))
- \text{closure } = \text{VThunk } vp_{id} \ e_1 \ \rho'
- \text{dom}(\rho') = \mathcal{FV}[e_1] \ \sigma
- \text{h'} = h[a \mapsto \text{Exported pid' closure bk_empty}]
- \text{vp}_{id}' = \text{val } \rho \sigma \ \text{pid}
- \text{pid}' = \text{vp}_{id}' \% \ n
- \text{bk_empty} = (\langle \text{threads}, \langle \text{fetches} \rangle \text{blocked})

(BH3)

\[
\text{Enter } a \text{ as } rs \ us \ h \ t_{id} \ wp \ \sigma \ vp_{id} \ b_i
\]

\[
\Rightarrow \text{Eval } \exp(\rho) (\langle \text{stack} \rangle \langle \text{stack} \rangle \text{us'} h' t_{id} \ wp \ \sigma \ vp_{id}' \ b_i)
\]

where

- \text{us'} = (as, rs, a, vp_{id}) : us
- \text{h'} = h[a \mapsto \text{BlackHole bk_empty}]
- \text{bk_empty} = (\langle \text{threads}, \langle \text{fetches} \rangle \text{blocked})

(BH'3)

\[
\text{Return}_\chi c\ ws \ \langle \rangle \ \langle \rangle : us \ h \ t_{id} \ wp \ \sigma \ vp_{id} \ b_i
\]

\[
\Rightarrow \text{unblock } as_u \ r_{sa} \ us \ h' \ t_{id} \ wp \ \sigma \ vp_{id}' \ b_i
\]

where

- \text{unblock } = \text{Unblock blocked (Return}_\chi c\ ws\)
- \text{h'} = h[a_u \mapsto (vs x \mapsto c \ vs, \ ws)]
- \text{length } vs = \text{length } us
- vs is a sequence of arbitrary distinct variables

(SELF')

\[
\text{Eval } \langle \text{self} \rangle \rho \text{ as } rs \ us \ h \ t_{id} \ wp \ \sigma \ vp_{id} \ b_i
\]

\[
\Rightarrow \text{Return}_{\text{int}} \ vp_{id} \text{ as } rs \ us \ h \ t_{id} \ wp \ \sigma \ vp_{id} \ b_i
\]

Figure 9.47: Operational rules for supporting virtual topologies
9.4.3 Compilation rules

The compilation rules should follow almost directly from those used with the GUM system. However, a more detail exposition is beyond the scope of this thesis.

9.4.4 Performance

If the leton expression is used only to schedule tasks on the local processor, then the performance is exactly the same as for the GUM system. However, by using the execution-tree semantics to structure the computation, improvements over the GUM system are possible. For example, figure 9.48 shows a specialised version of the fib benchmark, explicitly using three virtual processors and relying on load balancing to provide work for any remaining physical processors. The results for this benchmark and a four-processor variant, sfib4, are shown in figure 9.49. Both variants achieve modest improvements over their unstructured counterpart, and do not suffer from a performance degradation when the number of surplus processors is increased. This benefit is solely down to a reduced initialisation phase, where the available work spreads to the idle processors more quickly under the para-functional scheme. In effect, the first round of Fish messages is avoided, and the targeted processors are able to make their sparks available sooner. It is worth noting that to benefit from this speedup, the burden of supplying the mapping directives is placed upon the programmer (or, possibly, generated via an automatic mapping algorithm). Furthermore, with more complex benchmarks, running for greater periods of time, it is likely that the reduction in initialisation would only account for a minor fraction of the total runtime. However, as the complexity of the program increases, so does the scope to use the mapping directives – unfortunately a comprehensive study of the benefit of mapped expressions is beyond the scope of this thesis.
9.4.5 Related work

Burton [1987] was probably one of the first in the functional-programming community to look at imposing greater structure onto the traditional par combinator. However, his work was purely theoretical, and did not directly result in any real implementations. More practically, Hammond, Loidl and Partridge [1995a] GranSim simulator implemented the parAt operator, which has very similar semantics to Hudak’s on expressions. However, this was not the primary focus of their work, and no results were reported. For the benchmark programs they were using, and the respective problem sizes, the dynamic load-balancing was sufficiently effective it is likely that further annotations were deemed unnecessary.

As previously mentioned, Mirani and Hudak [1995] had extended the work on para-functional Haskell to incorporate the recent advances in monads. This has allowed them to promote their scheduling annotations into first-class citizens, and provides access to run-time values (such as the current processor load) without compromising determinacy. It would be straightforward to modify the models presented here to reflect their advances.

Finally, Parrott [1993] avoided the problems associated with hand annotation by using profiling information to drive a heuristic scheduling algorithm. While his work concentrated upon when to run particular tasks, there is no obvious reason why it could not also take task locality into account. This could provide a pragmatic approach to generating initial mappings, which could then be refined by a programmer, significantly reducing the burden associated with such schemes.

9.4.6 Assessment

Following on from the GUM case study, incorporating mapped expressions into the STG' language was straightforward: the denotational semantics are very similar, and the majority of the operational rules were used unmodified. Furthermore, the development of the execution-tree semantics provided an insight into the structure of the benchmark programs, and could be used to provide a useful tool for the parallel functional programmer.
Finally, the STG animation demonstrated the benefits in reduced initialisation offered by the simple letón mapped expression. However, a question does remain as to the scalability and general applicability of mapped expressions.

9.5 Kelly’s Skeletons

This section concentrates on three particular skeletons used by Darlington, Field, Harrison, Kelly and others [1993]: pipe, farm, and de (see section 2.4.3 for a general overview of algorithmic skeletons). These provide a representative sample of the current skeletal population, and Kelly’s work is one of the few to attempt to integrate skeletons into a non-strict language. The following Haskell definitions are used to provide an informal sequential semantics of the three skeletons:

```
Haskell

pipe :: [a -> a] -> (a -> a)
pipe = foldr1 (.)

farm :: (a -> b -> c) -> a
farm f env = map . (f env) -> ([b] -> [c])

de :: (a -> Bool) -> (a -> b) -> (a -> (a, a)) -> ((b, b) -> b) -> a -> b
de endtest endval split combine x
    | endtest x = endval x
    | otherwise = let (l, r) = split x
                   in combine (de endtest endval split combine l) (de endtest endval split combine r)
```

9.5.1 Static semantics

Following the guidelines laid down in section 5.2.1, the first attempt at incorporating skeletons into the STG language is shown below:

```
| exp         | — skeletal expression |
| skeleton    | — processor farm      |
| farm var fun exp       | — divide and conquer |
| dc var end var single var divide var combine exp | — process parallelism |
| pipe var stage1 · · · var stage n exp |
```

Notice that the farm skeleton, unlike Kelly’s version, does not need to take the env parameter as the STG language’s sharing mechanism can be used to define an appropriate replacement function:

```
| STG’ code |
| let \{ f’ = [env] \ r [x] -> f env x; \} in farm f’ xs |
```

Also, the dc skeleton looks suspiciously complex, and can, in fact, be implemented as successfully using the mapping expressions from the previous case study – see figure 9.50. The benefit of maintaining this skeletal expression cannot be justified at the intermediate level – it is merely an artifact to be used during the initial phases of the compilation process. Having decided that it may not be necessary to represent all skeletons within the intermediate language, the farm expression warrants further investigation. The denotational semantics is presented in figure 9.51, and is, in effect, a highly strict version of the map function. Again, it is easy to generate para-functional code to duplicate this behaviour:
```plaintext
STG' code

divacon = [] \r [endtest endval split combine] ->
  letrec {
    f = [split combine endtest endval] \r [x] -> case endtest x of
      {True -> endval "x;" False -> case split x of {Tup2 l r -> let# here = self# in
        let# left_neighbour = minusInt# [here, l#] in
        let# right_neighbour = plusInt# [here, l#] in
        let on left_neighbour = f l in
        let on right_neighbour right = f r in
        combine left right;
      };}
  } in f;
```

Figure 9.50: A para-functional STG' replacement for the dc skeleton

```
Skeleton[\textbf{farm} \ varf^n \ exp] \rho = \textbf{let function} = \rho \ varf^n
    \hline
    \textbf{function}' = \textbf{compose} \xi_\infty(\chi_{\pi_1...\pi_n}) \text{function}
    \hline
    \textbf{arguments} = \xi_\infty(\text{List } \pi) (\text{map function arguments})

\xi_\infty(\chi_{\pi_1...\pi_n}) :: \text{Val} \rightarrow \text{Val}

\xi_\infty(\chi_{\pi_1...\pi_n}) \epsilon = \text{case } \epsilon \text{ of }
\hline
| \bot & \rightarrow \bot |
| \epsilon & \rightarrow \epsilon |
\hline
\xi_\infty(\text{List } \pi) :: \text{Val} \rightarrow \text{Val}

\xi_\infty(\text{List } \pi) \epsilon = \text{case } \epsilon \text{ of }
\hline
| \bot & \rightarrow \bot |
| \langle \text{Nil} \rangle & \rightarrow \langle \text{Nil} \rangle |
| \langle \text{Cons}, x, xs \rangle & \rightarrow \langle \text{Cons}, x, \xi_\infty(\text{List } \pi) \ xs \rangle |
\hline
```

Figure 9.51: Denotational semantics of the farm skeleton
Finally, the following transformation shows how pipe expressions can also be removed from the intermediate language:

\[
pipe \text{var}_{stage_1} \cdots \text{var}_{stage_n} \text{ exp} \quad \Rightarrow \quad \text{farm fun exp}
\]

where

\[
\begin{align*}
\text{fun} &= \left[ \text{r} \left[ \text{var}_{arg_1} \right] \rightarrow \text{fun'} \text{ var}_{arg_1} \right] \\
\text{fun'} &= \left[ \text{r} \left[ \text{var}_x \right] \rightarrow \text{compose}_n \text{ var}_{stage_1} \cdots \text{ var}_{stage_n} \text{ var}_x \right]
\end{align*}
\]

9.5.2 Assessment

Perhaps surprisingly, it turns out that skeletal expressions should not form part of a parallel functional intermediate language. They are simply high-level constructs which provide the programmer with continent abstractions for developing parallel algorithms. Any skeleton-based compiler will certainly manipulate skeletal expressions, but they will have been reduce to more basic operations before the operational semantics will have to be considered.

9.6 Summary

This chapter has presented four case studies of the prototyping framework:

**Mattson’s speculative evaluation** The first case study dealt with development of low-level synchronisation and scheduling constructs for a GMSV architecture. The performance results showed the expected near-ideal speedups associated with shared-memory architectures. However, only the RISC animation exhibited the second-order effects introduced by the necessary locking operations.

**GUM Haskell** This study built upon the work of the previous investigation, and extended it to DMMP architectures. The primary focus was the use of explicit communication to implement load-balancing and resource-sharing mechanisms. UML interaction diagrams, therefore, became an essential part of the development process. The performance results again closely agreed with those observed in real implementations. However, the STG animation was only capable of simulating relatively small benchmark problems. It was suggested that this should not be a cause for concern: larger problem sizes tend to exhibit better performance on parallel architectures, therefore hiding any inefficiencies of the operational model. The RISC animation was only used to verify that the STG animation was producing credible performance estimates.

**Para-functional Haskell** This extended the GUM model to include explicit mapped expressions. It was shown that such annotations can reduce the initialisation phase of the evaluation, and therefore lead to moderate improvements in performance. However, the added burden placed upon the programmer, combined with the inability
to animate larger problems, raised some doubt as to the general applicability of mapped expressions. Finally, Hudak's execution-tree semantics was identified as a potentially useful tool to aid the parallel functional programmer.

**Kelly’s skeletons** The final case study considered the representation of skeletal operators in the context of a parallel functional intermediate language. It was decided that the role of such expressions should be limited to the initial phases of the compilation process and should not interfere with the operational model of the intermediate language.
Chapter 10

Summary, evaluation and further work

10.1 Introduction

This chapter will begin with a brief summary of the work which has been presented in this thesis (section 10.2), followed by an evaluation of this work in section 10.3. Finally, in section 10.4, the limitations of this work are briefly discussed, and further potential avenues of exploration are suggested.

10.2 Summary

The contributions of this thesis are as follows:

- the presentation of a framework for rapidly prototyping parallel functional intermediate languages, driven by the development of semantic models for the three phases of a traditional compiler – the source, intermediate, and target languages.

- a number of prescriptive methods for animating denotational semantics, Hindley–Milner type-inference algorithms, and state-transition systems in the functional programming language Haskell.

- the development and informal validation of a static semantics for the sequential STG' language. In addition, the development of an execution-tree semantics [Hudak, 1986] for a para-functional variant of the STG' language.

- the use of a state-transition system to model multiprocessor systems, using shared-memory and/or message passing as the primary communication mechanisms.

- a state-transition model of an optimising compilation system for the STG' language, closely based on the operational model.

10.3 Evaluation of the prototyping framework

This evaluation starts by comparing and contrasting the prototyping framework with the related techniques reviewed in section 2.5.3. It then goes on to attempt to evaluate the success of the framework both as a prototyping tool for parallel functional intermediate languages and as an animation system for static and operational semantics.
10.3.1 Comparison with other relevant work

The Haskell approach

pH [Nikhil et al., 1995, section 1, page 1], a Haskell derivative extended to include explicit parallelism, has as one of its goals:

“To share infrastructure (compilers, systems, application programs), and to facilitate interesting research topics, such as comparing lazy evaluation vs. lenient evaluation...”

However, by necessity, the resulting compilers are written primarily for speed and efficiency, possibly at the expense of clarity – based on personal experience, this is certainly true of GHC! Moreover, the system will be sufficiently complex that familiarisation and development will take a significant amount of time.

Direct implementations

A number of compilation systems, based upon explicitly parallel versions of the STG language, have been developed, and each of the extension techniques described in section 2.4 is represented: Hill’s data-parallel Haskell [Hill, 1994] introduces the POD (parallel object with arbitrary dimensions) data type and associated primitive functions; Chakravarty’s Jump* machine [Chakravarty, 1994] extends the exp rule with the letpar construct; Hwang and Rushall [1992] alter the semantics of the case expression in their n-STG machine (this corresponds to the if construct in the language we have presented); and Hammond, Mattson Jr. and Peyton Jones [1994] add par and seq as primitive functions. The primary aim of these systems has been to demonstrate the usefulness of the implementors favourite language extension or run-time algorithm. In each case, little justification is provided as to why a particular approach was taken, and no real effort has gone into comparing and contrasting the features offered by each of these systems.

The approach outlined in this thesis enables the rapid prototyping a wide range of languages, which, in turn, allows one to examine these very issues. The case studies from chapter 9 demonstrated how a system could be developed incrementally, allowing competing approaches to be evaluated fairly. Furthermore, the generation of the semantic descriptions will serve as excellent documentation of the various design decisions and experiments.

Prototyping versus simulation

To date, simulation has been widely used by the community as a substitute for prototyping [Joy and Axford, 1992; Bennett, 1993; Hammond, Loidl and Partridge, 1995a]. However, as acknowledged by Deschner [1990, section 1, page 227], such systems tend to allow only a limited design to be explored:

“Although initially the system is only capable of simulating conservative parallelism, with major adjustment it could also be used to analyse speculative evaluation strategies.”

The work of Hammond, Loidl and Partridge [1995b] bears the closest resemblance to this work. They use an accurate multi-architecture simulation system, based on the Glasgow Haskell compiler, to study the effects of language annotations [Burton, 1984] on task granularity. Their overall aim is to develop heuristics for use with an automatically
parallelising compiler. This work, on the other hand, encompasses the derivation of parallelism using both implicit and explicit techniques. In addition, their intermediate language identifies parallelism only through the use of primitive functions, and ignores the many alternatives.

10.3.2 Evaluation of the success of the prototyping framework

The four case studies from chapter 9 demonstrated the utility and sophistication of the framework. Based upon the experience gained during these case studies, it is not unreasonable to claim that the prototyping framework could be effectively applied to re-engineering any of the current crop of parallel implementations. Furthermore, the work of chapters 5 and 6 showed how the semantics models could cope with advances both in terms of language idioms and implementation techniques.

10.3.3 Animating static and operational semantics

As outlined by Goodman [1995, section 7.3.3], there are two distinct approaches to evaluating the success of an animation system: firstly, rating the framework against a number of theoretical concerns, including coverage, efficiency, and sophistication; and, secondly, evaluating its practical success as a method of software development. However, Goodman's assessment of the difficulties in implementing the latter approach is sufficiently complete (and relevant) for the purpose of this evaluation that only the former approach is pursued here.

Goodman uses the following eight concerns to rate an animation system: coverage, efficiency, sophistication, interactivity, transparency, operational equivalence, usability and utility. The animations used by the prototyping framework are scored as follows:

- **Coverage.** Good, as Haskell’s semantics are very close to those used to model the static and operational semantics.

- **Efficiency.** The primary aim of the animation techniques is to maintain a close correspondence between the semantic descriptions and the Haskell code. While the efficiency of the resulting programs could be improved, this would interfere with the method’s operational equivalence.

- **Sophistication.** Good. Examples include the case studies from chapter 9, and the work of Booth, Bruce and Ben-Dyke [1996] on the animation of an imperative parallel object-oriented language.

- **Interactivity.** The animation of the static semantics results in a non-interactive program, and must therefore score poorly. However, the state-transition animations are highly interactive, and the overall score can therefore deemed to be fair.

- **Transparency.** Reasonable. Haskell’s pattern-matching semantics, combined with the libraries developed during the case studies, simplify the conversion process.

- **Operational equivalence.** As for transparency.

- **Usability.** Good. Few problems were encountered during the animation of the various case studies.
Utility. Good. The framework used for the state-transition animations has been used successfully on a range of applications including the STG machine, a hybrid RISC architecture, and a compilation system.

10.4 Limitations and further work

The following important issues have been largely ignored throughout this thesis, but are deserving of further attention:

- the development of accurate yet concise models of the behaviour of shared-memory systems with respect to locking and concurrency control (see section 6.2.4).

- the development of one or more domain-specific languages for simplifying the construction of the various semantic models (with the possibility of automatically animating the results). Relevant research includes Navel [Michaelson, 1993] and Actress [Brown, Moura and Watt, 1992].

- the verification of the presented semantics, and the development of a prescriptive approach to proving the correctness of the various rule sets.

- the expansion of the framework to include imperative languages, or using a different approach to type inference. (An initial feasibility study has already been carried out [Booth, Bruce and Ben-Dyke, 1996].)
Appendix A

On the design of parallel functional intermediate languages

This paper [Ben-Dyke and Axford, 1995] was originally presented at HiPC '95, the International Conference on High Performance Computing [Sahni, Prasanna and Bhatkar, 1995]. The contribution of the two authors was as follows: Andy Ben-Dyke, 80%; Tom Axford, 20%. 
A.1 Introduction

A.2 Defining the language
A.3 Developing the parser

A.4 Code generation

A.4.1 Operational semantics
A.4.2 Compilation rules
A.4.3 Architecture simulator
A.5 Relationship to other work

A.6 Concluding remarks
Appendix B

Example STG' programs

This section presents a number of example STG' programs, as generated by the Glasgow Haskell compiler (see section C.2 for further qualification). The Haskell source code is primarily taken from either the Haskell standard prelude [Hudak et al., 1992, appendix A], or the imaginary subset of the nofib benchmark suite (see appendix G).

Section B.1 looks at some of the prelude operations used to support integers, booleans, and lists. Three nofib programs, fib, primes, and queens, are then presented in sections B.2 through B.4. Finally, a solution to Hamming’s problem, as developed by Hudak and Anderson [1988, section 3], is converted into STG' code.

B.1 Prelude operations

This section looks at the STG' definitions needed to support the three main data types of the Haskell language, namely integers, booleans, and lists. Where applicable, the equivalent Haskell code is also included. All of the STG' bindings have been taken directly from the library of test routines used by the prototyping system (see section 3.4).

B.1.1 Integers

Primitive wrappers

The following data declaration and selected associated operations make up the interface for the primitive integer type, \texttt{Int#}:

```
<table>
<thead>
<tr>
<th>STG' code</th>
</tr>
</thead>
</table>
data Int = Int Int#;
zero = [] \ r [] -> Int [0#];
one = [] \ r [] -> Int [1#];
const.Int.+ = [] \ r [x y] -> case x of  
{ Int x' -> case y of  { Int y' -> let# xy = plusInt# [x', y'] in Int [xy] ; } ;  }
const.Int.> = [] \ r [x y] ->
case x of  { Int x' -> case y of  { Int y' -> gtInt# [x', y'] ; } ; } ;
```

Obviously, no Haskell equivalent exist for any of these definitions.
Quotients and signs

The STG' definitions given below are based on Int#-specialised versions of the following prelude functions:

**Haskell**

```haskell
quotRem :: Int -> Int -> (Int, Int)
quotRem n d = (n 'quot' d, n 'rem' d)

signum :: Int -> Int
signum x | x == 0   = 0
          | x > 0    = 1
          | otherwise = -1
```

const. Int. quotRem is a straightforward conversion of quotRem, but const. Int. signum makes use of the wrapper/worker optimisation [Peyton Jones and Launchbury, 1991, section 5.1]:

**STG' code**

```haskell
const. Int.quotRem = [] \[n d\] ->
  let { q = [n d] \u \[] -> const. Int .quot n d;
          r = [n d] \u \[] -> const. Int .rem n d;
      } in Tup2 [q, r];

const. Int.signum = [] \r [x] -> case x of {Int x' -> const. Int .signum .wrk x'};

const.Int .signum .wrk = [] \r [x] -> case x of
  {0# -> Int [0#];
   -> case gtint# [x, 0#] of { True -> Int [ 1#]; False -> Int [-1#]; 1
  }
};
```

B.1.2 Booleans

**Haskell**

```haskell
data Bool = False | True

otherwise :: Bool
otherwise = True

(&&) :: Bool -> Bool -> Bool
(&&) False x = False
(&&) True x = x

not :: Bool -> Bool
not True = False
not False = True
```

Again, after the pattern-matching syntactic sugar is removed, the STG' code is similar to the Haskell versions:

**STG' code**

```haskell
data Bool = True | False;

otherwise = [] \r [] -> True [];

&& = [] \r [x y] -> case x of { False -> False [] ; True -> y ; } ;

not = [] \r [x] -> case x of { True -> False [] ; False -> True [] ; };
B.1.3 Lists

Rather than introducing special syntactic support, the following STG' declaration is used to define the List algebraic data type:

```
STG' code
  data List a = Cons a (List a) | Nil;
```

The following sections look at some of Haskell's PreludeList [Hudak et al., 1992, section A.5, pages 106-114] functions.

Nill, null, head, and tail

```
Haskell
nil :: [a]
nil = []

null :: [a] -> Bool
null [] = True
null (_:_ ) = False

head :: [a] -> a
head (x:_) = x

tail :: [a] -> [a]
tail (_:xs) = xs
```

```
STG' code
nil = [] \ r [] -> Nil [];
null = [] \ r [xss] -> case xss of { Nil -> True [] ; Cons x xs -> False [] ; };
head = [] \ r [xss] -> case xss of { Cons x xs -> x ; Nil -> error# [] ; };
tail = [] \ r [xss] -> case xss of { Cons x xs -> xs ; Nil -> error# [] ; };
```

Append

```
Haskell
(++) :: [a] -> [a] -> [a]
[] ++ ys = ys
(x:xs) ++ ys = x:(xs++ys)
```

```
STG' code
++ = [] \ r [yss yss] -> case xss of
{ Nil -> yss ;
  Cons x xs -> let { xs' = [yss xs] \ u [] -> ++ xs yss ; } in Cons [x, xs'] ;
};
```

Length

Rather than use the foldl-based Haskell version, the more traditional version is used:

```
Haskell
length :: [a] -> Int
length [] = 0
length (_:xs) = 1 + length xs
```
STG' code

```haskell
length = [] \ r [xss] -> case xss of
    { Nil -> Int [0#] ;
      Cons x xs -> case length xs of { Int 1 -> let# 1' = plusInt# [1#, 1] in Int [1'] ; ;
    ;};
```

Map

Haskell

```haskell
map :: (a -> b) -> [a] -> [b]
map f [] = []
map f (x:xs) = f x : map f xs
```

STG' code

```haskell
map = [] \ r [f xss] -> case xss of
    { Nil -> Nil [] ;
      Cons x xs -> let { x' = [f x] \ u [] -> f x ;
        xs' = [f xs] \ u [] -> map f xs ; } in Cons [x', xs'] ;
    ;};
```

Foldl

Haskell

```haskell
foldl :: (a -> b -> a) -> a -> [b] -> a
foldl f z [] = z
foldl f z (x:xs) = foldl f (f z x) xs
```

STG' code

```haskell
foldl = [] \ r [f z xss] -> case xss of
    { Nil -> z;
      Cons x xs -> let { x' = [] \ u [] -> f z x ; } in foldl f x' xs;
    ;};
```

Filter

Again, the foldr-based version is ignored in favour of the traditional definition:

Haskell

```haskell
filter :: (a -> Bool) -> [a] -> [a]
filter p [] = []
filter p (x:xs) | p x = x : (filter xs)
    | otherwise =     filter xs
```

STG' code

```haskell
filter = [] \ r [p xss] -> case xss of
    { Nil -> Nil [] ;
      Cons x xs -> case p x of { True -> let (xs' = [p xs] \ u [] -> filter p xs ;) in Cons [x, xs'] ;
        False -> filter p xs;
      };
    ;};
```
B.2 Generating Fibonacci numbers

```haskell
fib :: Int -> Int
fib n = if n <= 1 then 1 else fib (n-1) + fib (n-2) + 1
```

Unoptimised version

```stg'
fib = [] \x [n] -> case const.Int.<= n one of
{ True -> one ;
    False -> let { sum_2_fibs = [n] \u [] ->
        let { fib_n_less_2 = [n] \u [] ->
            let { n_less_2 = [n] \u [] -> const.Int.- n two ; }
            in fib n_less_2 ;
            fib_n_less_1 = [n] \u [] ->
            let { n_less_1 = [n] \u [] -> const.Int.- n one ; }
            in fib n_less_1 ;
        in const.Int.+ fib_n_less_1 fib_n_less_2 ; }
        in const.Int.+ sum_2_fibs one ;
};
```

Optimised version

```stg'
fib = [] \x [n] -> case n of ■ { Int n' -> fib.wrk n' ; };
fib.wrk = [] \x [n'] -> case leInt# [n', 1#] of
{ True -> Int [1#];
    False -> let# n'_less_1 = minusInt# [n', 1#] in
    case fib.wrk n'_less_1 of { Int fib_n'_less_1 ->
        let# n'_less_2 = minusInt# [n', 2#] in
        case fib.wrk n'_less_2 of { Int fib_n'_less_2 ->
            let# sum_2_fibs' = plusInt# [fib_n'_less_1, fib_n'_less_2] in
            let# result = plusInt# [sum_2_fibs', 1#] in Int [result];
    }; });
```

B.3 Generating prime numbers – the sieve of Eratoshenes

```haskell
test :: Int -> Int
test a = let primes = map head (iterate the_filter (iterate succ 2))
in primes !! a

the_filter :: [Int] -> [Int]
the_filter (n:ns) = filter (isdivs n) ns

isdivs :: Int -> Int -> Bool
isdivs n x = mod x n /= 0

succ :: Int -> Int
succ x = x + 1
```
Unoptimised version

```haskell
STG' code

test = [] \r [a] -> let { primes = [] \u [] ->
  let { xs = [] \u [] ->
    let { from_2 = [] \u [] -> iterate succ two;}
    in iterate the_filter from_2; }
    in map head xs; }
  in !! primes a;

the_filter = [] \r [nss] -> case nss of { Cons n ns ->
  let { isdivs_n = [n] \r [x] -> isdivs n x; } in filter isdivs_n ns; }

isdivs = [] \r [n x] -> let { mod_x_n = [n x] \u [] -> const.Int.mod x n; } in
  const.Int./= mod_x_n zero;

succ = [] \r [x] -> const.Int.+ x one;
```

Optimised version

```haskell
STG' code

test = [] \r [a] -> case a of { Int a' -> test.wrk a'; }

test.wrk = [] \r [a'] -> let { from_2 = [] \u [] -> iterate succ two; } in
  letstrict forced.xs = iterate the.filter from_2 in
  letstrict forced.primes = map head forced.xs in
  !!.wrk forced.primes a';

the_filter = [] \r [nss] -> case nss of { Cons n ns ->
  let { isdivs_n = [n] \r [x] -> case n of { Int n' -> case x of { Int x' ->
    isdivs.wrk n' x'; }; }; }
  in filter isdivs_n ns; }

isdivs = [] \r [n x] ->
  case n of • Int n' -> case x of { Int x' ->
    isdivs.wrk n' x'; }; }

succ = [] \r [x] ->
  case X of { Int x' -> let# succ.x = plusint# [x', 1#] in Int [succ.x]; }

isdivs.wrk = [] \r [n' x'] -> case const.Int.mod.wrk x' n' of { Int mod' ->
  case mod' of { 0# -> False [] ;
    _ -> True [] }; }
```

B.4 The queens problem

```haskell
Haskell

nsoln :: Int -> Int
nsoln nq = length (gen nq nq)

safe :: Int -> Int -> [Int] -> Bool
safe x d [] = True
safe x d (q:l) = x /= q && x /= q+d && x /= q-d && safe x (d+1) l

gen :: Int -> Int -> [[Int]]
gen nq 0 = [[]]
gen nq n = [(q:b) | b <- gen nq (n-1), q <- [1..nq], safe q 1 b]
```

Unoptimised version

```haskell
STG' code

nsoln = [] \r [nq] -> let { solutions = [nq] \u [] -> gen nq nq; } in length solutions;
```
To improve readability, the definition of \( g \), given below, was removed from the body of \( \text{gen} \).

```haskell
-- \_STG' code \_

safe = [] \r [x \ d \ ds] -> \text{case ds of}
{ \text{Nil} -> \text{True} [];
  \text{Cons q l} -> \text{let}
    \{ c1 = [x \ d q \ l] \u [] \rightarrow \text{let}
     \{ c2 = [x \ d q \ l] \u [] \rightarrow \text{let}
       \{ c3 = [x \ d l] \u [] \rightarrow \text{let}
         \{ d\_plus\_l = [d] \u [] \rightarrow \text{const.}\text{Int.}\_+ d \text{ one};
           \} \in \text{safe x d\_plus\_l} 1;
       \} \in \text{let}
     \{ c4 = [x \ d q] \u [] \rightarrow \text{let}
       \{ q\_less\_d = [d q] \u [] \rightarrow \text{const.}\text{Int.}\_\_q d;
         \} \in \text{const.}\text{Int.}\_/= x q\_less\_d;
       \} \in \&\& c4 c3;
       \} \in \text{let}
     \{ c5 = [x \ d q] \u [] \rightarrow \text{let}
       \{ q\_plus\_d = [d q] \u [] \rightarrow \text{const.}\text{Int.}\_\_q d;
         \} \in \text{const.}\text{Int.}\_/= x q\_plus\_d;
       \} \in \&\& c5 c2;
       \} \in \text{let}
    \{ c6 = [x q] \u [] \rightarrow \text{const.}\text{Int.}\_/= x q;
       \} \in \&\& c6 c1;
  \};
}

-- \_STG' code \_

zero_soln = [] \r [] \rightarrow \text{Cons [nil, nil]};
gen = [] \r [nq \ ds] \rightarrow \text{case ds of}
{ \text{Int ds' } \rightarrow \text{case ds' of}
  { 0# \rightarrow \text{zero_soln};
    _ \rightarrow \text{letrec \{ f = [f nq] \u [] \rightarrow \text{let}
      \{ one\_to\_nq = [nq] \u [] \rightarrow \text{const.}\text{Int.}\text{enumFromTo one nq}; \} \in
      \text{let}
    \{ g = ** see below ** } \in \text{g}; \} \in
      \text{let}
    \{ d = [ds nq] \u [] \rightarrow \text{let}
      \{ ds\_less\_1 = [ds] \u [] \rightarrow \text{const.}\text{Int.}\_\_ds one; \} \in
      \text{gen nq ds\_less\_1}; \}
      \text{in f d} \};
  \};
}

\text{g = \{f one\_to\_nq nq\} \r [xss] \rightarrow \text{case xss of}
{ \text{Nil} \rightarrow \text{Nil} [];
  \text{Cons x xs} \rightarrow
    \text{letrec \{ h = [f h x nq xs] \u [] \rightarrow
      \text{let}
    \{ a = [f xs] \u [] \rightarrow f xs; \} \in
      \text{let}
    \{ i = [x h nq a] \r [yss] \rightarrow \text{case yss of}
      { \text{Nil} \rightarrow a ;
        \text{Cons y ys} \rightarrow \text{case safe y one x of}
        { \text{True} \rightarrow \text{let}
          \{ b = [] \r [] \rightarrow \text{Cons [y, x];
            c = [ys h] \u [] \rightarrow h ys; \} \in
          \text{Cons [b, c]);
            } \};
          \} \text{in i} ;
          \} \text{in h one\_to\_nq;}
      \};
    \};
  \});
```
STG' code

Optimised version

nsoln = [] 'r [nq] -> case nq of { Int nq' -> nsoln.wrk nq'; };

nsoln.wrk = [] 'r [nq'] -> let { nq = [] 'r [] -> Int [nq']; } in
let
strict solutions = gen.wrk nq nq'
in
length solutions;

STG' code

safe = [] 'r [x d ds] -> case ds of
{ Nil
  -> True [];
  Cons q 1
  -> case x of { Int x' ->
                 case neint# [x', q'] of
                 { False
                   -> False [];
                 True
                   -> case d of { Int d' ->
                                  let# q_plus_d = plusint# [q', d'] in
                                  case neint# [x', q_plus_d] of
                                  { False
                                    -> False [];
                                  True
                                    -> let# q_less_d = minusint# [q', d'] in
                                    case neint# [x', q_less_d] of
                                    { False
                                      -> False [];
                                    True
                                      -> let# d_plus_l = plusint# [d', 1#] in
                                      let{ d_plus_l = [] 'r [] ->
                                          Int [d_plus_l];
                                      } in safe x d_plus_l;
                                  }};
                 };};};
};

STG' code

gen = [] 'r [nq ds] -> case ds of { Int ds' -> gen.wrk nq ds'; };

gen.wrk = [] 'r [nq upk] -> case upk of
{ 0#
  -> zero_soln ;
  _
  -> let# upk_less_1 = minusInt# [upk, 1#] in
  let
  strict bs = gen.wrk nq upk_less_1 in
  let
  { one_to_nq = [nq] 'u [] -> Int [one_to_nq]; }
  in
gen_comprehension nq one_to_nq bs
};

gen_comprehension = [] 'r [nq one_to_nq dss] -> case dss of
{ Nil
  -> Nil [];
  Cons d ds
  -> let { a = [nq one_to_nq ds] 'u [] ->
           gen_comprehension nq one_to_nq ds; }
  in
  g a d nq one_to_nq;
};
go = [] 'r [a d nq one_to_nq] -> case one_to_nq of
{ Nil
  -> a ;
  Cons x xs
  -> let
  { False
    -> g a d nq xs;
  True
    -> let
    { b = [] 'r [] -> Cons [x, d];
    c = [a d xs nq] 'u [] -> g a d nq xs; }
  in
  Cons [b, c];
};
B.5 Hamming’s problem

The following program is directly based on that presented by Hudak and Anderson [1988, section 3].

```haskell
hamming :: [Int] -> [Int]
hamming primes = 1 : (foldl f [] primes)
    where f xs p = let h = merge (scale p (1 : h)) xs in h

merge :: [Int] -> [Int] -> [Int]
merge [] bss = bss
merge ass [] = ass
merge ass@(a:as) bss@(b:bs)
    | a < b = a : (merge as bss)
    | otherwise = b : (merge ass bs)

scale :: Int -> [Int] -> [Int]
scale p xs = map (* p) xs

isdivs :: Int -> Int -> Bool
isdivs n x = mod x n /= 0

the_filter :: [Int] -> [Int]
the_filter (n:ns) = filter (isdivs n) ns

test :: Int -> Int -> Int
test cut_off no_primes = length sequence
    where sequence = takeWhile (< cut_off) (hamming few_primes)
        primes = map head (iterate the_filter (iterate succ 2))
        few_primes = take no_primes primes
```
Unoptimised version

---

**STG' code**

```haskell
hamming = [] \r \ [primes] \r \ u [] \r \ foldl f Nil primes; \
        in Cons [one, as];

f = [] \r \ [xs p] \r \ 
     letrec { h = [h xs p] \u \ [] \r \ 
             let { a = [h p] \u \ [] \r \ 
                          let { xs = [] \r \ [] \r \ Cons [one, h];} \
                              in merge a xs; } \
                    in scale p xs; }
     in h ;

merge = [] \r \ [ass bss] \r \ case ass of 
     { Nil -> bss; 
       Cons a as -> case bss of 
         { Nil -> Cons [a, as]; 
           Cons b bs -> case const.Int.< a b of 
             { True -> let { xs = [bss as] \u \ [] \r \ merge as bss; } \
                           in Cons [a, xs]; 
             False -> case otherwise of 
               { True -> let { ys = [ass bs] \u \ [] \r \ merge ass bs; } \
                           in Cons [b, ys]; 
               False -> error; } } } ;
}

scale = [] \r \ [p xs] \r \ let {g = [p] \r \ [a] \r \ const.Int.* a p;} \
            in map g xs;

isdivs = [] \r \ [n x] \r \ let { a = [n x] \u \ [] \r \ const.Int.mod x n; } \
            in const.Int./= a zero;

the_filter = [] \r \ [ds] \r \ case ds of 
    { Nil -> error; 
      Cons n ns -> let { a = [n] \r \ [x] \r \ isdivs n x; } \
                      in filter a ns; }
    }
```

---

**STG' code**

```haskell
test = [] \r \ [cut_off no_primes] \r \ 
        let { primes = [] \r \ u [] \r \ 
             let { as = [] \r \ u [] \r \ let { from_two = [] \r \ [] \r \ iterate succ two; \
                              in iterate the_filter from_two; } \
                  in map head as; } \
             in let { few_primes = [primes no_primes] \u \ [] \r \ take no_primes primes; } \
                  in let { sequence = [few_primes cut_off] \u \ [] \r \ 
                         let { bs = [few_primes] \u \ [] \r \ hamming few_primes; 
                              p = [cut_off] \r \ [x] \r \ const.Int.< x cut_off; } \
                         in takeWhile p bs; } \
                  in length sequence;
```
Optimised version

---

STG' code

```haskell
hamming = □ \r [primes] -> let { as = [primes] \u [] -> foldl f Nil primes; } in Cons [one, as];
f = □ \r [xs p] -> letrec { h = [p h xs] \u [] ->
let { as = [] \r [] -> Cons [one, h]; } in
letstrict ys = scale p as
in merge ys xs; }

merge = [] \r [ass bss] -> case ass of
{ Nil -> bss; Cons a as -> case bss of
  Nil -> Cons [a, as]; Cons b bs -> case a of
  { Int a' -> case b of
    { Int b' ->
      case Itlnt# [a', b'] of
      ( True -> let { cs = [bss as] \u [] -> merge as bss; }
       in Cons [a, cs];
      False -> let { cs = [ass bs] \u [] -> merge ass bs; }
       in Cons [b, cs];
    }; }; }; }; }

scale = [] \r [p xs] ->
let { g = [p] \r [a] -> case a of
{ Int a' -> case p of
{ Int p' ->
let# a_times_p = timesint# [a', p']
in Int [a_times_p];
}; }; }
in map g xs;

isdivs = [] \r [n x] -> case n of
{ Int n' -> case x of
{ Int x' ->
isdivs.wrk n' x'; }
};
isdivs.wrk = [] \r [n' x'] -> case const.Int.mod.wrk x' n' of
{ Int mod_x_n ->
case mod_x_n of
{ 0# -> False [];
_ -> True []
};
};

the_filter = [] \r [nss] -> case nss of
{ Nil -> error; Cons n ns -> let { is_divs_n = [n] \r [x] -> case n of
  { Int n' ->
case x of
  { Int x' ->
isdivs.wrk n' x'; }
      in filter is_divs_n ns; }
};

---

STG' code

```
Appendix C

The STG' language and the nofib benchmark suite

When developing a compiler for a particular language it is often helpful to have a feel for the types of usage of the constructs. To provide this empirical information for the STG' language, elements of the nofib benchmark suite were compiled to STG' code and statically analysed (the dynamic aspects of the benchmark suite have been explored by Santos [1995].) The collected data includes the distribution: arguments and free variables, algebraic data types, case and letrec expressions etc.

Sections C.1 and C.2 look at the nofib benchmark suite and the gathering of the data, sections C.3 to C.6 present the results, and section C.7 points out the limitations of the method.

C.1 The nofib benchmark suite

The nofib benchmark suite [Partain, 1993] is a publically available collection of small to large Haskell programs, split into three categories: the imaginary subset, containing toy programs useful for testing the correctness of a compilation system but of no real benchmarking worth; the real subset, made up of programs written to perform a useful task; and the spectral subset, which contains everything else, and includes the benchmark programs used by Hartel [1994]. For the purpose of this study, only the real subset of the suite has been used, a brief overview of which is given in table C.1.

C.2 Gathering the data

In order to generate the required statistics, version 0.23 of the Glasgow Haskell compiler was modified to bring its concrete syntax (of the STG language) into line with that used within this report, and the -flet-no-escape option removed from the optimisation package. Each of the benchmarks were then compiled to STG' code (ghc -O -ddump-stg) and the resulting programs analysed using a combination of Unix shell scripts and Emacs Lisp macros.
<table>
<thead>
<tr>
<th>program</th>
<th>description</th>
<th>STG' lines</th>
</tr>
</thead>
<tbody>
<tr>
<td>anna</td>
<td>strictness analyser</td>
<td>53 220</td>
</tr>
<tr>
<td>bspt</td>
<td>BSP-tree modeller</td>
<td>17 288</td>
</tr>
<tr>
<td>compress</td>
<td>text compression</td>
<td>2 224</td>
</tr>
<tr>
<td>compress2</td>
<td>text compression</td>
<td>2 026</td>
</tr>
<tr>
<td>ebnf2ps</td>
<td>BNF grammar to postscript utility</td>
<td>23 311</td>
</tr>
<tr>
<td>fluid</td>
<td>fluid-dynamics program</td>
<td>20 616</td>
</tr>
<tr>
<td>fulsom</td>
<td>solid modelling</td>
<td>13 223</td>
</tr>
<tr>
<td>gg</td>
<td>graphs from GRIP statistics</td>
<td>12 449</td>
</tr>
<tr>
<td>grep</td>
<td>simple version of the Unix command</td>
<td>861</td>
</tr>
<tr>
<td>hidden</td>
<td>hidden-line removal</td>
<td>7 415</td>
</tr>
<tr>
<td>hpg</td>
<td>Haskell program generator</td>
<td>7 485</td>
</tr>
<tr>
<td>infer</td>
<td>Hindley-Milner type inference</td>
<td>3 090</td>
</tr>
<tr>
<td>lift</td>
<td>fully-lazy lambda lifter</td>
<td>4 707</td>
</tr>
<tr>
<td>maillist</td>
<td>mailing-list generator</td>
<td>5 96</td>
</tr>
<tr>
<td>mkhprog</td>
<td>Haskell program skeletons</td>
<td>1 759</td>
</tr>
<tr>
<td>parser</td>
<td>partial Haskell parser</td>
<td>14 537</td>
</tr>
<tr>
<td>pic</td>
<td>particle in cell</td>
<td>5 285</td>
</tr>
<tr>
<td>prolog</td>
<td>“mini-Prolog” interpreter</td>
<td>2 606</td>
</tr>
<tr>
<td>reptile</td>
<td>Escher tiling program</td>
<td>12 527</td>
</tr>
<tr>
<td>rsa</td>
<td>RSA encryption/decryption</td>
<td>1 016</td>
</tr>
<tr>
<td>symalg</td>
<td>variable-precision calculator</td>
<td>11 114</td>
</tr>
<tr>
<td>veritas</td>
<td>theorem-prover</td>
<td>32 551</td>
</tr>
</tbody>
</table>

Table C.1: The real subset of the nofib benchmark suite
C.3 Algebraic data types

The number of constructors per data type for the 155 non-prelude definitions is distributed as follows:

<table>
<thead>
<tr>
<th>constructors</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>54</td>
<td>26</td>
<td>17</td>
<td>19</td>
<td>11</td>
<td>18</td>
<td>8</td>
<td>2</td>
<td>36</td>
</tr>
<tr>
<td>percentage</td>
<td>34.8</td>
<td>16.8</td>
<td>11.9</td>
<td>12.3</td>
<td>7.1</td>
<td>11.6</td>
<td>5.2</td>
<td>1.3</td>
<td></td>
</tr>
</tbody>
</table>

For the 21 prelude types used by the benchmark programs (Array, Assoc, Bool, IOError, List, Ratio, Request, Response, Tup0, Tup2–Tup10, Tup12, and Tup19) and the Glasgow specific _CMP_.TAG, the distribution is as follows:

<table>
<thead>
<tr>
<th>constructors</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>15</td>
<td>2</td>
<td>1</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>18</td>
</tr>
<tr>
<td>percentage</td>
<td>71.4</td>
<td>9.5</td>
<td>4.8</td>
<td>0</td>
<td>9.5</td>
<td>0</td>
<td>4.8</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

The lifted versions of the primitive types, such as Int and Char, are not included in this data.

The distribution of the number of arguments of the 600 non-prelude constructor is as follows:

<table>
<thead>
<tr>
<th>arguments</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>233</td>
<td>203</td>
<td>85</td>
<td>42</td>
<td>15</td>
<td>10</td>
<td>9</td>
<td>3</td>
<td>20</td>
</tr>
<tr>
<td>percentage</td>
<td>38.8</td>
<td>33.8</td>
<td>14.2</td>
<td>7.0</td>
<td>1.9</td>
<td>1.3</td>
<td>1.1</td>
<td>0.4</td>
<td></td>
</tr>
</tbody>
</table>

The distribution for the 50 prelude constructors is:

<table>
<thead>
<tr>
<th>arguments</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>10</td>
<td>18</td>
<td>12</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>5</td>
<td>2</td>
<td>19</td>
</tr>
<tr>
<td>percentage</td>
<td>20.0</td>
<td>26.0</td>
<td>24.0</td>
<td>2.0</td>
<td>2.0</td>
<td>2.0</td>
<td>10.0</td>
<td>4.0</td>
<td></td>
</tr>
</tbody>
</table>

C.4 Bindings and let(rec) expressions

There are a total of 15 878 global definitions, 17 354 let bindings, and 127 letrec bindings, in addition to the 2 245 letstrict expressions and 2 125 let# expressions. The relative mixture of functions, constructors and thunks is shown below:

<table>
<thead>
<tr>
<th>closure</th>
<th>constructor</th>
<th>function</th>
<th>thunk</th>
<th>other</th>
<th>total</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>13 235</td>
<td>6 571</td>
<td>12 861</td>
<td>692</td>
<td>33 359</td>
</tr>
<tr>
<td>percentage</td>
<td>39.7</td>
<td>19.7</td>
<td>38.6</td>
<td>2.1</td>
<td></td>
</tr>
</tbody>
</table>

The other category is primarily niladic functions.

Of the 127 letrec expressions, the distribution of the number of bindings is as follows:

<table>
<thead>
<tr>
<th>bindings</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>82</td>
<td>21</td>
<td>10</td>
<td>4</td>
<td>1</td>
<td>4</td>
<td>5</td>
<td>20</td>
</tr>
<tr>
<td>percentage</td>
<td>61.2</td>
<td>15.7</td>
<td>7.5</td>
<td>3.0</td>
<td>0.7</td>
<td>3.0</td>
<td>3.7</td>
<td></td>
</tr>
</tbody>
</table>

The distribution of the length of allocation chains (any uninterrupted series of let and letrec expressions) is shown below:

<table>
<thead>
<tr>
<th>bindings</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>5 006</td>
<td>1 395</td>
<td>599</td>
<td>313</td>
<td>120</td>
<td>276</td>
<td>69</td>
<td>30</td>
<td>114</td>
</tr>
<tr>
<td>percentage</td>
<td>64.1</td>
<td>17.9</td>
<td>7.7</td>
<td>4.0</td>
<td>1.5</td>
<td>3.5</td>
<td>0.9</td>
<td>0.4</td>
<td></td>
</tr>
</tbody>
</table>

In addition to explicit allocation, it may be necessary to heap allocate large constructors.
C.4.1 Free variables

By definition, global definitions do not have free variables (see section 4.5.4), so the information presented here relates to only the non-global bindings.

The distribution for the 1208 function bindings is as follows:

<table>
<thead>
<tr>
<th>free variables</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>651</td>
<td>314</td>
<td>110</td>
<td>37</td>
<td>25</td>
<td>47</td>
<td>19</td>
<td>5</td>
<td>36</td>
</tr>
<tr>
<td>percentage</td>
<td>53.9</td>
<td>26.0</td>
<td>9.1</td>
<td>3.1</td>
<td>2.1</td>
<td>3.9</td>
<td>1.6</td>
<td>0.4</td>
<td></td>
</tr>
</tbody>
</table>

For the 8401 thunks, the free-variable distribution is:

<table>
<thead>
<tr>
<th>free variables</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>4078</td>
<td>2336</td>
<td>1063</td>
<td>414</td>
<td>226</td>
<td>253</td>
<td>29</td>
<td>2</td>
<td>21</td>
</tr>
<tr>
<td>percentage</td>
<td>48.5</td>
<td>27.8</td>
<td>12.7</td>
<td>4.9</td>
<td>2.7</td>
<td>3.0</td>
<td>0.3</td>
<td>0.0</td>
<td></td>
</tr>
</tbody>
</table>

The Glasgow Haskell compiler treats constructor bindings (anything of the form \( \text{var} = \text{r} \to \text{cons atoms} \)) as a special case, so free-variable information was not recorded for these cases.

C.4.2 Function arguments

The distribution of the number of arguments for the 6571 functions is given below:

<table>
<thead>
<tr>
<th>arguments</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>3213</td>
<td>1856</td>
<td>732</td>
<td>320</td>
<td>152</td>
<td>264</td>
<td>32</td>
<td>2</td>
<td>22</td>
</tr>
<tr>
<td>percentage</td>
<td>48.9</td>
<td>28.2</td>
<td>11.1</td>
<td>4.9</td>
<td>2.3</td>
<td>4.0</td>
<td>0.5</td>
<td>0.0</td>
<td></td>
</tr>
</tbody>
</table>

C.5 case expressions

Note that for the purpose of this section, the original definition of the case expression is used (i.e. let\# and letstrict are considered to be case expressions using named defaults).

Of the 12709 case expressions which scrutinise prelude constructors, the number of constructors associated with the data type of the scrutinee is distributed as follows:

<table>
<thead>
<tr>
<th>constructors</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>6991</td>
<td>5626</td>
<td>78</td>
<td>0</td>
<td>14</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>percentage</td>
<td>55.0</td>
<td>44.3</td>
<td>0.6</td>
<td>0</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

With regards to the unary constructors, 3687 of the case expressions are used to de-construct the lifted primitive types Int, Char, etc, and 492 de-construct the tuples used to support type classes.

The 3279 case expressions which scrutinise the user-defined algebraic data types have the following distribution:

<table>
<thead>
<tr>
<th>constructors</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>1473</td>
<td>497</td>
<td>263</td>
<td>364</td>
<td>108</td>
<td>325</td>
<td>231</td>
<td>18</td>
<td>36</td>
</tr>
<tr>
<td>percentage</td>
<td>44.9</td>
<td>15.2</td>
<td>8.0</td>
<td>11.1</td>
<td>3.3</td>
<td>9.9</td>
<td>7.0</td>
<td>0.5</td>
<td></td>
</tr>
</tbody>
</table>

The distribution for both prelude-defined and user-defined data types (15988 case expressions in total) is given, below:

<table>
<thead>
<tr>
<th>constructors</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6–10</th>
<th>11–20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>8464</td>
<td>6123</td>
<td>341</td>
<td>364</td>
<td>122</td>
<td>325</td>
<td>231</td>
<td>18</td>
<td>36</td>
</tr>
<tr>
<td>percentage</td>
<td>52.9</td>
<td>38.3</td>
<td>2.1</td>
<td>2.3</td>
<td>0.8</td>
<td>2.0</td>
<td>1.4</td>
<td>0.1</td>
<td></td>
</tr>
</tbody>
</table>
Of these, 2245 (14.0 percent) are letstrict expressions.

The 2245 literal case expressions take the following types:

<table>
<thead>
<tr>
<th>type</th>
<th>Char#</th>
<th>Int#</th>
<th>Float#</th>
<th>Double#</th>
<th>Word#</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>209</td>
<td>1407</td>
<td>266</td>
<td>360</td>
<td>3</td>
</tr>
<tr>
<td>percentage</td>
<td>9.3</td>
<td>62.7</td>
<td>11.8</td>
<td>16.0</td>
<td>0.1</td>
</tr>
</tbody>
</table>

Of these, 2125 (94.7 percent) are let# expressions.

C.6 Constructor application

The distribution of the number of arguments of the 3161 user-defined constructor applications is given below:

<table>
<thead>
<tr>
<th>arguments</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6-10</th>
<th>11-20</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>197</td>
<td>1348</td>
<td>928</td>
<td>355</td>
<td>122</td>
<td>131</td>
<td>74</td>
<td>6</td>
<td>20</td>
</tr>
<tr>
<td>percentage</td>
<td>6.2</td>
<td>42.6</td>
<td>29.4</td>
<td>11.2</td>
<td>3.9</td>
<td>4.1</td>
<td>2.3</td>
<td>0.2</td>
<td></td>
</tr>
</tbody>
</table>

As for the prelude constructors, of which there are 7998, the distribution is:

<table>
<thead>
<tr>
<th>arguments</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6-10</th>
<th>11-20</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>1725</td>
<td>12</td>
<td>1683</td>
<td>4257</td>
<td>193</td>
<td>24</td>
<td>101</td>
<td>3</td>
<td>19</td>
</tr>
<tr>
<td>percentage</td>
<td>21.6</td>
<td>0.2</td>
<td>21.0</td>
<td>53.2</td>
<td>2.4</td>
<td>0.3</td>
<td>1.3</td>
<td>0.0</td>
<td></td>
</tr>
</tbody>
</table>

The following distribution illustrates the total number of constructors that belong to the same user-defined type as the constructor being applied:

<table>
<thead>
<tr>
<th>constructors</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6-10</th>
<th>11-20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>716</td>
<td>661</td>
<td>334</td>
<td>218</td>
<td>289</td>
<td>542</td>
<td>375</td>
<td>26</td>
<td>36</td>
</tr>
<tr>
<td>percentage</td>
<td>22.7</td>
<td>20.9</td>
<td>10.6</td>
<td>6.9</td>
<td>9.1</td>
<td>17.1</td>
<td>11.9</td>
<td>0.8</td>
<td></td>
</tr>
</tbody>
</table>

The same distribution for the prelude constructors is:

<table>
<thead>
<tr>
<th>constructors</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6-10</th>
<th>11-20</th>
<th>21+</th>
<th>max.</th>
</tr>
</thead>
<tbody>
<tr>
<td>absolute</td>
<td>2283</td>
<td>5612</td>
<td>81</td>
<td>1</td>
<td>21</td>
<td>0</td>
<td>21</td>
<td>0</td>
<td>18</td>
</tr>
<tr>
<td>percentage</td>
<td>28.5</td>
<td>70.2</td>
<td>1.0</td>
<td>0.0</td>
<td>0.3</td>
<td>0</td>
<td>0.3</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

C.7 Limitations of the analysis

A number of criticisms can be levelled at the data presented in the previous sections:

- static analysis is a poor indicator of what would actually happen during execution
- the style of STG' code generated is an artifact of the Glasgow Haskell compiler, and should not be used to infer general patterns
- most of the nofib benchmarks were coded with a sequential architecture in mind, so the data has no meaning in a parallel context
- larger programs, such as anna and veritas, will dominate the results

To a certain extent, all of these points are valid. But as the data is only intended to serve as a rough guide, the problems of the collection method can be overlooked.
Appendix D

Polymorphic type rules for the STG' language

This chapter presents the type rules discussed in section 4.5.3, with the order of presentation closely following that of the abstract syntax (see figure 4.1).

D.1 Terminology

The notation adopted here is based on that used by Peyton Jones and Wadler [1992].

Type rules

All of the rules take the following form:

<table>
<thead>
<tr>
<th>type signature</th>
</tr>
</thead>
<tbody>
<tr>
<td>( \text{premise}_1 )</td>
</tr>
<tr>
<td>( \vdots )</td>
</tr>
<tr>
<td>( \text{Name} ) ( \text{premise}_n )</td>
</tr>
<tr>
<td>( \text{conclusion} )</td>
</tr>
</tbody>
</table>

Usually, both the premises and conclusion will be judgement forms:

\[
\text{environment} \quad \vdash \quad \text{construct} : \text{type}
\]

More generally, for rules which make use of or generate more complex data, the judgement form will look like:

\[
\text{inherited} \quad \vdash \quad \text{construct} : \text{synthesised}
\]
Environments

An environment is a finite mapping, usually from identifiers to types, either explicitly constructed, e.g. \( \{x \mapsto \tau_1, y \mapsto \tau_2\} \), or created by merging two existing environments:

\[
(env_1 \oplus env_2) \, var = \begin{cases} 
env_1 \, var, & \text{if } var \in \text{dom}(env_1) \text{ and } var \notin \text{dom}(env_2) \\
env_2 \, var, & \text{if } var \in \text{dom}(env_2) \text{ and } var \notin \text{dom}(env_1)
\end{cases}
\]

\[
(env_1 \mapsto env_2) \, var = \begin{cases} 
env_1 \, var, & \text{if } var \in \text{dom}(env_1) \text{ and } var \notin \text{dom}(env_2) \\
env_2 \, var, & \text{if } var \in \text{dom}(env_2)
\end{cases}
\]

where \( \text{dom}(env) \) returns the domain of the environment. As shown previously, an identifier’s value can be retrieved by applying it to the environment \( (env \, var) \), but the preferred method is to treat the mapping as a set of tuples, and test for membership i.e. \( (id, value) \in env \).

The environments used by the algorithm, as summarised in table 4.2, are as follows:

- **constructor environment** for the purpose of typing, a constructor \( \text{cons} \tau_1 \ldots \tau_n \), belonging to the algebraic data type \( \chi \, \pi_1 \ldots \pi_m \), is treated as a function of type: \( \tau_1 \rightarrow \ldots \tau_n \rightarrow \chi \, \pi_1 \ldots \pi_m \).

- **primitive environment** rather than providing an explicit rule for every primitive function, this environment maps primitives to polymorphic types.

- **general type environment** is used to store the polytype of all bound polymorphic variables currently in scope, including top-level definitions and all variable defined by \texttt{let} or \texttt{let rec} expressions.

- **local type environment** stores the monotype of the formal arguments of the current binding, and any additional variables introduced by case alternatives, and \texttt{let strict} or \texttt{let#} expressions.

- **type-constructor environment** records the arity (the number of type arguments required) of each type constructor, along with the number and sequence of its constituent constructors.

Free variables

The free type variables of either a language construct or a whole environment may be determined using a named rule of the \( \mathcal{FV}[\_] \) algorithm.

Implicit conditions

To reduce the size of the presented rules, a number of conditions have been left implicit:

- where a type attribute has been inferred by two or more different rules, each of the resulting values must unify. The unified type is then used as the final result.

- occurrences of \( env_1 \oplus env_2 \) require that: \( \text{dom}(env_1) \cap \text{dom}(env_2) = \emptyset \)
The partition function first constructs a dependency graph of the mutually recursive definitions and uses this to break the bindings up into strongly-connected components. The resulting groups are then sorted into topological order. The total effect of this ordering is to convert the top-level bindings into a series of nested letrec expressions. This minimises the impact of the monomorphism restriction as described in section 4.5.3.

D.3 Algebraic data types

Type declarations

\[
\frac{\text{tygendecls}}{\vdash \text{tygendecls} : (TCE, CE)}
\]

\[
\frac{TCE \vdash \text{tygendecl}_i : (TCE_i, CE_i)}{\vdash \text{tygendecl}_1 \ldots \text{tygendecl}_i : (TCE, CE)}
\]

Individual type declarations

\[
\frac{TCE \vdash \text{tygendecl}_i : (TCE, CE)}{\mathcal{FV}_{\text{tygendecls}}[\text{tygendecls}] = \{ \alpha_1, \ldots, \alpha_v \}}
\]

\[
\frac{TCE ; \chi \alpha_1 \ldots \alpha_v \vdash \text{tygendecls} : (CE, n, \langle \text{cons}_1, \ldots, \text{cons}_n \rangle)}{TCE' = \{ \chi \mapsto (v, n, \langle \text{cons}_1, \ldots, \text{cons}_n \rangle) \}}
\]

\[
\frac{TCE \vdash \text{tygendecl}_i : (TCE, CE)}{TCE \vdash \text{data } \chi \alpha_1 \ldots \alpha_v = \text{tygendecls} : (TCE', CE)}
\]
Constructor declarations

\[
\begin{align*}
\text{TCE}; \tau_\chi & \vdash \text{condecls} : (\text{CE}, n, \{\text{cons}\}) \\
\text{CONDECLS} & \quad \begin{array}{l}
\text{TCE}; \tau_\chi \vdash \text{condecl}_i : (\text{CE}_i, \text{cons}_i) \\
\text{CE} = \bigoplus_{1 \leq i \leq n} \text{CE}_i \\
\text{TCE}; \tau_\chi & \vdash \text{condecl}_1 \ldots \text{condecl}_n : (\text{CE}, n, \{\text{cons}_1, \ldots, \text{cons}_n\})
\end{array}
\end{align*}
\]

Individual constructor declarations

\[
\begin{align*}
\text{TCE}; \tau_\chi & \vdash \text{condecl} : (\text{CE}, \text{cons}) \\
\text{CONDECL} & \quad \begin{array}{l}
\text{TCE} \vdash \tau_i \quad (0 \leq i \leq f) \\
\emptyset \vdash \tau_1 \rightarrow \cdots \rightarrow \tau_f \rightarrow \tau_\chi : \sigma \\
\text{CE} = \{\text{cons} \mapsto (f, \sigma)\} \\
\text{TCE}; \tau_\chi & \vdash \text{cons} \tau_1 \ldots \tau_f : (\text{CE}, \text{cons})
\end{array}
\end{align*}
\]

Monotypes

\[
\begin{align*}
\text{TCE} & \vdash \tau \\
\text{BOXED-MONO} & \quad \begin{array}{l}
\text{TCE} \vdash \pi \\
\text{TCE} & \vdash \nu
\end{array}
\end{align*}
\]

\[
\begin{align*}
\text{UNBOXED-MONO} & \quad \begin{array}{l}
\text{TCE} \vdash \nu
\end{array}
\end{align*}
\]
Boxed types

\[
\begin{align*}
\text{Boxed types} & \quad TCE \vdash \pi \\
\text{BOXED-VAR} & \quad TCE \vdash \alpha \\
\text{BOXED-FUN} & \quad TCE \vdash \tau_i \quad i \in \{1, 2\} \\
\text{BOXED-CON} & \quad TCE \vdash \pi_i \quad (0 \leq i \leq v)
\end{align*}
\]

D.4 Bindings and lambda forms

Recursive bindings

\[
\begin{align*}
\text{REC-BINDS} & \quad \text{recbinds} \\
TE & \vdash \text{binds} : \text{GVE} \\
LVE & \vdash \text{binds} : \text{GVE} \\
\text{REC-BINDS} & \quad \text{spec} \\
LVE & = \{ \text{var}_i \mapsto \tau_i \mid \text{var}_i, \sigma_i \in \text{GVE}, \vdash \sigma_i : \tau_i \} \\
\text{REC-BINDS} & \quad \text{rebinds} \\
TE & \vdash \text{binds} : \text{GVE}
\end{align*}
\]

Bindings

\[
\begin{align*}
\text{BINDS} & \quad \text{binds} \\
TE & \vdash \text{binds} : \text{GVE} \\
TE & \vdash \text{bind}_{i} : (\text{var}_i, \tau_i) \\
TE & \vdash \tau_i : \sigma_i \\
GVE & = \bigoplus_{i \leq n} \{ \text{var}_i \mapsto \sigma_i \} \\
\text{BINDS} & \quad \text{binds} \\
TE & \vdash \text{bind}_{i} \ldots \text{bind}_{n} : \text{GVE}
\end{align*}
\]

Individual bindings

\[
\begin{align*}
\text{BIND} & \quad \text{bind} \\
TE & \vdash \text{bind} : (\text{var}, \tau) \\
\text{lambda} & \quad \tau \leq \alpha \\
TE & \vdash \text{lambda}_\text{form} : \tau \\
\text{BIND} & \quad \text{bind} \\
TE & \vdash \text{var} = \text{lambda}_\text{form} : (\text{var}, \tau)
\end{align*}
\]
Simple bindings

\[
\text{SIMPLE-BIND} \quad \frac{TE \vdash \text{simplebind} : (\text{var}, \tau)}{TE \vdash \text{exp} : \tau}
\]

\[
\frac{TE \vdash \text{simplebind} : (\text{var}, \tau)}{TE \vdash \text{var} = \text{exp} : (\text{var}, \tau)}
\]

Lambda forms

\[
\text{LAMBDA} \quad \frac{\lambda \text{lambda} : \tau}{TE \vdash \lambda \text{lambda} : \tau}
\]

\[
\frac{LVE = \bigoplus_{i \leq n} \{ \text{arg}_i \mapsto \tau_i \}}{TE \circ LVE \vdash \text{exp} : \tau_{\text{exp}}}
\]

\[
\frac{\text{vars}_f r e e \ \pi \ \text{arg}_1 \ldots \text{arg}_n \rightarrow \text{exp} : \tau_1 \rightarrow \cdots \rightarrow \tau_n \rightarrow \tau_{\text{exp}}}{TE \vdash \text{vars}_f r e e \ \pi \ \text{arg}_1 \ldots \text{arg}_n \rightarrow \text{exp} : \tau_1 \rightarrow \cdots \rightarrow \tau_n \rightarrow \tau_{\text{exp}}}
\]

D.5 Expressions

\[
TE \vdash \text{exp} : \tau
\]

The let expression

\[
\text{LET-EXP} \quad \frac{\text{binds}}{TE \vdash \text{bindings} : \text{GVE}}
\]

\[
\frac{TE \circ \text{GVE} \vdash \text{exp} : \tau_{\text{exp}}}{TE \vdash \text{let} \ \text{bindings} \ \text{exp} : \tau_{\text{exp}}}
\]

The letrec expression

\[
\text{LETREC-EXP} \quad \frac{\text{rebinds}}{TE \vdash \text{bindings} : \text{GVE}}
\]

\[
\frac{TE \circ \text{GVE} \vdash \text{exp} : \tau_{\text{exp}}}{TE \vdash \text{letrec} \ \text{bindings} \ \text{exp} : \tau_{\text{exp}}}
\]

The let# expression

\[
\text{LET#-EXP} \quad \frac{\text{simplebind}}{TE \vdash \text{simplebind} : (\text{var}, \nu)}
\]

\[
\frac{LVE = \{ \text{var} \mapsto \nu \}}{TE \circ \text{LVE} \vdash \text{exp} : \tau_{\text{exp}}}
\]

\[
\frac{TE \vdash \text{let#} \ \text{simplebind} \ \text{exp} : \tau_{\text{exp}}}{TE \vdash \text{let#} \ \text{simplebind} \ \text{exp} : \tau_{\text{exp}}}
\]
### The letstrict expression

<table>
<thead>
<tr>
<th>Simple bind</th>
<th>( \text{LETSTRUCT-EXP} )</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>TE \vdash \text{simplebind}: (\text{var}, \chi \pi_1 \ldots \pi_v)</code></td>
<td></td>
</tr>
<tr>
<td><code>LVE = \{\text{var} \mapsto \chi \pi_1 \ldots \pi_v\}</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{exp}: \tau_{\exp}</code></td>
<td></td>
</tr>
<tr>
<td><code>TE \oplus LVE \vdash \text{exp}: \tau_{\exp}</code></td>
<td></td>
</tr>
<tr>
<td><code>TE \vdash \text{letstrict simplebind exp}: \tau_{\exp}</code></td>
<td></td>
</tr>
</tbody>
</table>

### The case expression

<table>
<thead>
<tr>
<th>Case expression</th>
<th>( \text{CASE-EXP} )</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>exp</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{exp}: \tau_{\exp}</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{alts}: \tau_{\exp} \rightarrow \tau_{\text{result}} \land \text{no_overlap alts}</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{default}: \tau_{\exp} \rightarrow \tau_{\text{result}}</code></td>
<td></td>
</tr>
<tr>
<td><code>TE \vdash \text{case exp of alts default}: \tau_{\text{result}}</code></td>
<td></td>
</tr>
</tbody>
</table>

The `no_overlap` function examines the left-hand side of each alternative making sure that there is no repetition.

### Variable application

<table>
<thead>
<tr>
<th>Variable application</th>
<th>( \text{APPLY-EXP} )</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>var</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{varfun}: \tau_1 \rightarrow \cdots \rightarrow \tau_n \rightarrow \tau_{\text{result}}</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{atom_i}: \tau_i \ (0 \leq i \leq n)</code></td>
<td></td>
</tr>
<tr>
<td><code>TE \vdash \text{varfun atom_1 \ldots atom_n}: \tau_{\text{result}}</code></td>
<td></td>
</tr>
</tbody>
</table>

### Constructor application

<table>
<thead>
<tr>
<th>Constructor application</th>
<th>( \text{CONS-EXP} )</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>(\text{cons}, (n, \sigma)) \in CE</code></td>
<td></td>
</tr>
<tr>
<td><code>spec</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \sigma: \tau_1 \rightarrow \cdots \rightarrow \tau_n \rightarrow \chi \pi_1 \ldots \pi_v</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{atom_i}: \tau_i \ (0 \leq i \leq n)</code></td>
<td></td>
</tr>
<tr>
<td><code>TE \vdash \text{cons atom_1 \ldots atom_n}: \chi \pi_1 \ldots \pi_v</code></td>
<td></td>
</tr>
</tbody>
</table>

### Primitive functions

<table>
<thead>
<tr>
<th>Primitive functions</th>
<th>( \text{PRIM-EXP} )</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>(\text{primitive}, (n, \sigma)) \in PE</code></td>
<td></td>
</tr>
<tr>
<td><code>spec</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \sigma: \tau_1 \rightarrow \cdots \rightarrow \tau_n \rightarrow \tau_{\text{result}}</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{atom_i}: \tau_i \ (0 \leq i \leq n)</code></td>
<td></td>
</tr>
<tr>
<td><code>TE \vdash \text{primitive atom_1 \ldots atom_n}: \tau_{\text{result}}</code></td>
<td></td>
</tr>
</tbody>
</table>

### Literal values

<table>
<thead>
<tr>
<th>Literal values</th>
<th>( \text{LIT-EXP} )</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>literal</code></td>
<td></td>
</tr>
<tr>
<td><code>\exp \vdash \text{literal}: \nu</code></td>
<td></td>
</tr>
<tr>
<td><code>TE \vdash \text{literal}: \nu</code></td>
<td></td>
</tr>
</tbody>
</table>
D.6 case alternatives

General alternatives

\[
\frac{\text{TE} \vdash \text{alts : } \tau \rightarrow \tau}{\text{LIT-ALTS}}
\]

\[
\frac{\text{TE} \vdash \text{lalt}_i : \nu \rightarrow \tau \quad (1 \leq i \leq n)}{\text{TE} \vdash \text{lalt}_1 \ldots \text{lalt}_n : \nu \rightarrow \tau}
\]

\[
\frac{\text{TE} \vdash \text{alts : } \chi \pi_1 \ldots \pi_v \rightarrow \tau \quad (1 \leq i \leq n)}{\text{TE} \vdash \text{aalt}_1 \ldots \text{aalt}_n : \chi \pi_1 \ldots \pi_v \rightarrow \tau}
\]

Literal alternatives

\[
\frac{\text{TE} \vdash \text{lalt} : \tau \rightarrow \tau}{\text{LIT-ALT}}
\]

\[
\frac{\text{ literal} \vdash \text{ literal : } \nu}{\text{TE} \vdash \text{ exp : } \tau_{\text{exp}}}
\]

\[
\frac{\text{TE} \vdash \text{ literal : } \nu \rightarrow \tau_{\text{exp}}}{\text{TE} \vdash \text{ exp : } \tau_{\text{exp}}}
\]

Algebraic alternatives

\[
\frac{\text{TE} \vdash \text{ aalt : } \tau \rightarrow \tau}{\text{ALG-ALT}}
\]

\[
\frac{\text{TE} \vdash \text{ pattern : } (\chi \pi_1 \ldots \pi_v, LVE)}{\text{TE} \vdash \text{ pattern : } \chi \pi_1 \ldots \pi_v \rightarrow \tau_{\text{exp}}}
\]

Constructor patterns

\[
\frac{\text{TE} \vdash \text{ cons vars : } (\chi \pi_1 \ldots \pi_v \rightarrow \tau, LVE)}{\text{PATTERN}}
\]

\[
\frac{(\text{cons}, (n, \sigma)) \in CE}{\text{TE} \vdash \sigma : \tau_1 \rightarrow \cdots \rightarrow \tau_n \rightarrow \chi \pi_1 \ldots \pi_v}
\]

\[
\frac{LVE = \bigoplus_{i \leq n} \{ \text{var}_i \mapsto \tau_i \}}{\text{TE} \vdash \text{ cons var}_1 \ldots \text{var}_n : (\chi \pi_1 \ldots \pi_v, LVE)}
\]
Default expressions

\[
\begin{align*}
\text{TE} & \vdash \text{default} : \tau \rightarrow \tau \\
\text{DEFAULT} & \quad \frac{\text{exp} \vdash \text{exp} : \tau_{\text{exp}}}{\text{TE} \vdash \text{exp} : \tau \rightarrow \tau_{\text{exp}}} \\
\end{align*}
\]

D.7 Atoms, variables and literals

Atoms

\[
\begin{align*}
\text{TE} & \vdash \text{atom} : \tau \\
\text{VAR-ATOM} & \quad \frac{\text{atom} \vdash \text{var} : \tau}{\text{var} : \tau} \\
\text{LIT-ATOM} & \quad \frac{\text{var} : \tau}{\text{literal} : \nu} \\
\end{align*}
\]

Variables

\[
\begin{align*}
\text{var} & \vdash \text{var} : \tau \\
\text{LOCAL-VAR} & \quad \frac{(\text{var}, \tau) \in \text{LVE}}{\text{var} : \tau} \\
\text{GENERAL-VAR} & \quad \frac{(\text{var}, \sigma) \in \text{GVE}}{\text{var} : \sigma : \tau} \\
\end{align*}
\]

Literal values

\[
\begin{align*}
\text{TE} & \vdash \text{literal} : \nu \\
\text{INT-LIT} & \quad \frac{\text{literal}}{\text{int} : \text{Int#}} \\
\text{FLOAT-LIT} & \quad \frac{\text{literal}}{\text{float} : \text{Float#}} \\
\text{CHAR-LIT} & \quad \frac{\text{literal}}{\text{char} : \text{Char#}} \\
\text{STRING-LIT} & \quad \frac{\text{literal}}{\text{string} : \text{String#}} \\
\text{MACH-LIT} & \quad \frac{\text{literal}}{\text{mach} : \text{Mach#}} \\
\text{ADDR-LIT} & \quad \frac{\text{literal}}{\text{addr} : \text{Addr#}} \\
\end{align*}
\]
D.8 Generalisation and specialisation

\[
\begin{align*}
\text{SPEC} & \quad \frac{\tau \leq \sigma}{\text{spec} \vdash \tau} \\
& \quad \frac{}{\text{gen} \vdash \tau : \sigma} \\
& \quad \frac{}{\text{TE} \vdash \tau : \sigma}
\end{align*}
\]

\[
\begin{align*}
\text{GEN} & \quad \frac{}{\forall \alpha_i \bullet \alpha_i \in (FV_{\text{monotype}}[\tau] + FV_{\text{type-env}}[\text{TE}]) \quad (1 \leq i \leq n)} \\
& \quad \frac{}{\text{gen} \vdash \alpha_1 \ldots \alpha_n \cdot \tau} \\
& \quad \frac{}{\text{TE} \vdash \tau : \forall \alpha_1 \ldots \alpha_n \cdot \tau}
\end{align*}
\]
Appendix E

Free variables of the STG language

This chapter presents the free-variable algorithm discussed in section 4.5.4, with the order of presentation closely following that of the abstract syntax (see figure 4.1).

E.1 Programs

\[ \mathcal{F}_\text{program} \] : program \rightarrow \{ \text{var} \}

\[ \mathcal{F}_\text{program} \left[ \begin{array}{l} \text{var}_1 = \text{lambda}_1 \\ \vdots \\ \text{var}_n = \text{lambda}_n \end{array} \right] = \{ \} \quad (\text{definition}) \\
\mathcal{F}_\text{lambda} \{ \text{lambda}_1 \} \{ \text{var}_1, \ldots, \text{var}_n \} \quad (\text{derived})

E.2 Algebraic data types

Constructor declarations

\[ \mathcal{F}_\text{condecls} \] : decls \rightarrow \{ \alpha \}

\[ \mathcal{F}_\text{condecls} \left[ \text{condecl}_1 \ldots \text{condecl}_n \right] = \bigcup_{i \leq n} \mathcal{F}_\text{condecl} \left[ \text{condecl}_i \right] \]

Individual constructor declarations

\[ \mathcal{F}_\text{condecl} \] : decl \rightarrow \{ \alpha \}

\[ \mathcal{F}_\text{condecl} \left[ \text{condecl} \right] = \bigcup_{i \leq j} \mathcal{F}_\text{monotype} \left[ \tau_i \right] \]

Monotypes

\[ \mathcal{F}_\text{monotype} \] : \tau \rightarrow \{ \alpha \}

\[ \mathcal{F}_\text{monotype} \left[ \pi \right] = \mathcal{F}_\text{boxedtype} \left[ \pi \right] \]
\[ \mathcal{F}_\text{monotype} \left[ \nu \right] = \{ \} \]
Boxed types

\[ \mathcal{FV}_{boxed\text{type}}[] : \pi \to \{\alpha\} \]

\[ \mathcal{FV}_{boxed\text{type}}[\alpha] = \{\alpha\} \]
\[ \mathcal{FV}_{boxed\text{type}}[\tau_1 \to \tau_2] = \mathcal{FV}_{monotype}[\tau_1] \cup \mathcal{FV}_{monotype}[\tau_2] \]
\[ \mathcal{FV}_{boxed\text{type}}[\times \pi_1 \ldots \pi_n] = \bigcup_{1 \leq i \leq n} \mathcal{FV}_{boxed\text{type}}[\pi_i] \]

E.3 Lambda forms

\[ \mathcal{FV}_{\lambda\text{mbda}}[\lambda\text{-form}] : \text{lambda\_form} \to \{\text{var}\} \to \{\text{var}\} \]

\[ \mathcal{FV}_{\lambda\text{mbda}}[\pi_{\text{var} \text{free}_1 \ldots \text{var} \text{free}_m}] \rightarrow \exp] \]
\[ g = \{\text{var} \text{free}_1, \ldots, \text{var} \text{free}_m\} \]
\[ = \mathcal{FV}_{\exp}[\exp] g' \setminus \text{var}\text{args} \quad \text{(definition)} \]
\[ \text{where} \]
\[ g' = g \setminus \text{var}\text{args} \]
\[ \text{var}\text{args} = \{\text{var}\text{arg}_1, \ldots, \text{var}\text{arg}_n\} \quad \text{(derived)} \]

E.4 Expressions

\[ \mathcal{FV}_{\exp}[\exp] : \exp \to \{\text{var}\} \to \{\text{var}\} \]

The let expression

\[ \mathcal{FV}_{\exp}[\text{let}] \]
\[ \begin{array}{l}
\text{let} \ : \\
\var_1 = \lambda\text{mbda}_1 \\
\vdots \\
\var_n = \lambda\text{mbda}_n
\end{array} \]
\[ g = \]
\[ (\text{free}\exp \setminus \text{var}\text{bound}) \cup \text{free}\lambda\text{mbdas} \]
\[ \text{where} \]
\[ \text{free}\exp = \mathcal{FV}_{\exp}[\exp] g' \]
\[ \text{free}\lambda\text{mbdas} = \bigcup_{1 \leq i \leq n} \mathcal{FV}_{\lambda\text{mbda}}[\lambda\text{mbda}_i] g' \]
\[ g' = g \setminus \text{var}\text{bound} \]
\[ \text{var}\text{bound} = \{\var_1, \ldots, \var_n\} \]

The letrec expression

\[ \mathcal{FV}_{\exp}[\text{letrec}] \]
\[ \begin{array}{l}
\text{letrec} \ : \\
\var_1 = \lambda\text{mbda}_1 \\
\vdots \\
\var_n = \lambda\text{mbda}_n
\end{array} \]
\[ g = \]
\[ (\text{free}\exp \cup \text{free}\lambda\text{mbdas}) \setminus \text{var}\text{bound} \]
\[ \text{where} \]
\[ \text{free}\exp = \mathcal{FV}_{\exp}[\exp] g' \]
\[ \text{free}\lambda\text{mbdas} = \bigcup_{1 \leq i \leq n} \mathcal{FV}_{\lambda\text{mbda}}[\lambda\text{mbda}_i] g' \]
\[ g' = g \setminus \text{var}\text{bound} \]
\[ \text{var}\text{bound} = \{\var_1, \ldots, \var_n\} \]
The let# expression

\[
\mathcal{F}_\text{let#}(\text{var} = \text{exprhs}) \text{ expbody} g = \\
\mathcal{F}_\text{let#}(\text{exprhs}) g \cup (\mathcal{F}_\text{exp}(\text{expbody}) g' \setminus \{\text{var}\})
\]

where \( g' = g \setminus \{\text{var}\} \)

The letstrict expression

\[
\mathcal{F}_\text{letstrict}(\text{var} = \text{exprhs}) \text{ expbody} g = \\
\mathcal{F}_\text{letstrict}(\text{exprhs}) g \cup (\mathcal{F}_\text{exp}(\text{expbody}) g' \setminus \{\text{var}\})
\]

where \( g' = g \setminus \{\text{var}\} \)

The case expression

\[
\mathcal{F}_\text{case}(\text{exp} \text{ of} \text{alts default}) g = \\
\mathcal{F}_\text{exp}(\text{exp}) g \cup \mathcal{F}_\text{alts}(\text{alts}) g \cup \mathcal{F}_\text{default}(\text{default}) g
\]

Variable application

\[
\mathcal{F}_\text{exp}(\text{var fun atoms}) g = \mathcal{F}_\text{atoms}(\text{atoms}) g \cup \mathcal{F}_\text{var}(\text{var fun}) g
\]

Constructor application

\[
\mathcal{F}_\text{exp}(\text{cons atoms}) g = \mathcal{F}_\text{atoms}(\text{atoms}) g
\]

Primitive functions

\[
\mathcal{F}_\text{exp}(\text{primitive atoms}) g = \mathcal{F}_\text{atoms}(\text{atoms}) g
\]

Literal values

\[
\mathcal{F}_\text{exp}(\text{literal}) g = \{\}
\]

E.5 case alternatives

General alternatives

\[
\mathcal{F}_\text{alts}[\text{alts}]: \text{alts} \rightarrow \{\text{var}\} \rightarrow \{\text{var}\}
\]

\[
\mathcal{F}_\text{alts}(\text{alt}_1 \ldots \text{alt}_n) g = \bigcup_{1 \leq i \leq n} \mathcal{F}_\text{alt}(\text{alt}_i) g
\]

\[
\mathcal{F}_\text{alts}(\text{aalt}_1 \ldots \text{aalt}_n) g = \bigcup_{1 \leq i \leq n} \mathcal{F}_\text{aalt}(\text{aalt}_i) g
\]

Literal alternatives

\[
\mathcal{F}_\text{alt}[\text{literal} \rightarrow \text{exp}]: \text{literal} \rightarrow \{\text{var}\} \rightarrow \{\text{var}\}
\]

\[
\mathcal{F}_\text{alt}(\text{literal} \rightarrow \text{exp}) g = \mathcal{F}_\text{exp}(\text{exp}) g
\]
Algebraic alternatives

\[ \mathcal{FV}_{\text{aalt}}[\text{aalt}] : \text{aalt} \rightarrow \{\text{var}\} \rightarrow \{\text{var}\} \]

\[ \mathcal{FV}_{\text{aalt}}[\text{cons var}_1 \ldots \text{var}_n \rightarrow \text{exp}] \ g = \mathcal{FV}_{\text{exp}}[\text{exp}] \ g' \ \setminus \ \text{vars}_{\text{bound}} \]

where

\[ g' = g \ \setminus \ \text{vars}_{\text{bound}} \]

and

\[ \text{vars}_{\text{bound}} = \{\text{var}_1, \ldots, \text{var}_n\} \]

Default expressions

\[ \mathcal{FV}_{\text{default}}[\text{default}] : \text{default} \rightarrow \{\text{var}\} \rightarrow \{\text{var}\} \]

\[ \mathcal{FV}_{\text{default}}[\rightarrow \text{exp}] \ g = \mathcal{FV}_{\text{exp}}[\text{exp}] \ g \]

E.6 Atoms and variables

Atoms

\[ \mathcal{FV}_{\text{atoms}}[\text{atoms}] : \text{atoms} \rightarrow \{\text{var}\} \rightarrow \{\text{var}\} \]

\[ \mathcal{FV}_{\text{atoms}}[\text{atom}_1 \ldots \text{atom}_n] \ \text{globals} = \bigcup_{i \leq n} \mathcal{FV}_{\text{atom}}[\text{atom}_i] \ \text{globals} \]

Individual atoms

\[ \mathcal{FV}_{\text{atom}}[\text{atom}] : \text{atom} \rightarrow \{\text{var}\} \rightarrow \{\text{var}\} \]

\[ \mathcal{FV}_{\text{atom}}[\text{var}] \ \text{globals} = \mathcal{FV}_{\text{var}}[\text{var}] \ \text{globals} \]

\[ \mathcal{FV}_{\text{atom}}[\text{literal}] \ \text{globals} = \{\} \]

Variables

\[ \mathcal{FV}_{\text{var}}[\text{var}] : \text{var} \rightarrow \{\text{var}\} \rightarrow \{\text{var}\} \]

\[ \mathcal{FV}_{\text{var}}[\text{var}] \ \text{globals} = \{\text{var}\} \ \setminus \ \text{globals} \]
Appendix F

The RISC target language

F.1 Introduction

This chapter presents the instruction set used by the RISC-processor model outlined in chapter 7. Based on the Alpha instruction formats [DEC, 1992, figures 3-1 through 3-6, pages 3-8 to 3-12], the instructions are split into four categories: memory references, branches, operate instructions, and system instructions. The Haskell representation of the instruction set is as follows:

\[
\text{data Instruction} = \begin{cases} 
\text{MemoryInst} : & \text{MemoryOpCode Register Register MemoryOffset} | \\
\text{BranchInst} : & \text{BranchOpCode Register Register MemoryOffset} | \\
\text{Op1Inst} : & \text{OperateOpCode Register Register Register} | \\
\text{Op2Inst} : & \text{OperateOpCode Register Word Register} | \\
\text{SysInst} : & \text{SysOpCode Word} 
\end{cases}
\]

F.2 Operand notation

The notation for the instruction-set operands is described in the following table:

<table>
<thead>
<tr>
<th>notation</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>\text{name_{reg}}</td>
<td>one of the thirty-two general-purpose registers, which will be associated with the given \text{name}</td>
</tr>
<tr>
<td>\text{immediate}_{x}</td>
<td>a signed integer, made up of (x) bits</td>
</tr>
<tr>
<td>\text{offset}_{x}</td>
<td>a signed integer, made up of (x) bits, used as an address offset</td>
</tr>
<tr>
<td>\text{offset}_{x-y}</td>
<td>a signed integer, made up of (x) bits, which will be shifted (y) bits to the left and used as an address offset</td>
</tr>
<tr>
<td>\text{reg-Imm}_{x}</td>
<td>either the contents of a general-purpose register or an (x)-bit signed integer</td>
</tr>
</tbody>
</table>

F.3 Memory references

\[
\text{data MemoryOpCode} = \text{LD} | \text{LL} | \text{LA} | \text{LAH} | \text{ST} | \text{SC}
\]
### Instruction Description

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>LD</strong></td>
<td>load of ( fset_{16}(\text{base}<em>{reg}) ) ( target</em>{reg} ) load a word from the address (word-aligned) formed by adding the 16-bit signed offset and the contents of the base register. The value is then stored in the target register.</td>
</tr>
<tr>
<td><strong>LL</strong></td>
<td>load(<em>{linked}) of ( fset</em>{16}(\text{base}<em>{reg}) ) ( target</em>{reg} ) in addition to performing a regular load (LD), the instruction indicates the start of a semaphore action. If the address is accessed between the execution of this instruction and the matching conditional store (SC), the conditional store will fail.</td>
</tr>
<tr>
<td><strong>LA</strong></td>
<td>load(<em>{address}) of ( fset</em>{16}(\text{base}<em>{reg}) ) ( target</em>{reg} ) this instruction does not access memory, it simply loads the target register with the sum of the offset and the base register.</td>
</tr>
<tr>
<td><strong>LH</strong></td>
<td>load(<em>{high}) of ( fset</em>{16}(\text{base}<em>{reg}) ) ( target</em>{reg} ) similar to the load address instruction, except the offset is first (arithmetically) shifted sixteen bits to the left.</td>
</tr>
<tr>
<td><strong>ST</strong></td>
<td>store of ( fset_{16}(\text{base}<em>{reg}) ) ( target</em>{reg} ) stores the word contained in the value register into the address formed by adding the 16-bit signed offset and the contents of the base register.</td>
</tr>
<tr>
<td><strong>SC</strong></td>
<td>store(<em>{linked}) of ( fset</em>{16}(\text{base}<em>{reg}) ) ( target</em>{reg} ) this is the second instruction of a semaphore pair – if the memory location has not been accessed since the linked load, the word stored in the value register will be loaded into the memory address, and the value register will be set to one. If, however, the address has been accessed, no store will take place, and the value register will be set to zero.</td>
</tr>
</tbody>
</table>

### F.4 Branch instructions

Haskell

```haskell
data BranchOpCode = JMP | JSR | BR | BSR | CBR RISCCondition
```

The condition \( x \) functions (RISCCondition) are described in section F.7.

#### Unconditional branches

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>JMP</strong></td>
<td>jump of ( fset_{16}(\text{base}<em>{reg}) ) ( target</em>{reg} ) the 16-bit signed offset is first shifted two places to the left, then added to the base register to form the target address (which must be word aligned). The PC is set to this new address.</td>
</tr>
<tr>
<td><strong>JSR</strong></td>
<td>jump(<em>{link}) of ( fset</em>{16}(\text{base}<em>{reg}) ) ( link</em>{reg} ) in addition to performing a regular jump (JMP), the link register is loaded with the value of the current PC, allowing a subroutine to return control back to the caller.</td>
</tr>
<tr>
<td><strong>BR</strong></td>
<td>branch of ( fset_{21}(\text{base}<em>{reg}) ) ( target</em>{reg} ) similar to a jump, but the (larger) offset is added to the current PC to form the target address.</td>
</tr>
<tr>
<td><strong>BSR</strong></td>
<td>branch(<em>{link}) of ( fset</em>{21}(\text{base}<em>{reg}) ) ( link</em>{reg} ) loads the link register with the value of the current PC, before branching.</td>
</tr>
</tbody>
</table>
### Conditional branches

<table>
<thead>
<tr>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>BEQ</code></td>
<td><code>branch_{x=0}</code></td>
</tr>
<tr>
<td><code>BNE</code></td>
<td><code>branch_{x≠0}</code></td>
</tr>
<tr>
<td><code>BLT</code></td>
<td><code>branch_{x&lt;0}</code></td>
</tr>
<tr>
<td><code>BLE</code></td>
<td><code>branch_{x&lt;0}</code></td>
</tr>
<tr>
<td><code>BGT</code></td>
<td><code>branch_{x&gt;0}</code></td>
</tr>
<tr>
<td><code>BGE</code></td>
<td><code>branch_{x&gt;0}</code></td>
</tr>
<tr>
<td><code>BLBC</code></td>
<td><code>branch_{bit0_clear}</code></td>
</tr>
<tr>
<td><code>BLBS</code></td>
<td><code>branch_{bit0_set}</code></td>
</tr>
</tbody>
</table>

### F.5 Operate instructions

The *condition* `x` functions (RISCCondition) are described in section F.7.

#### Arithmetic operations

<table>
<thead>
<tr>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>ADD</code></td>
<td>add <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;signed addition of the first two arguments, the result of which is stored in the target register</td>
</tr>
<tr>
<td><code>ADDT</code></td>
<td>add trap <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;as for <code>ADD</code> but an over or underflow will generate an exception (that must be explicitly trapped with a <code>barrier_trap</code> instruction — <code>TRAPS</code>)</td>
</tr>
<tr>
<td><code>S2ADD</code></td>
<td>add shift 2 <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;before performing the addition, the second argument is shifted left by two bits</td>
</tr>
<tr>
<td><code>SUB</code></td>
<td>subtract <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;signed subtraction of the second argument from the first, the result of which is stored in the target register</td>
</tr>
<tr>
<td><code>SUBT</code></td>
<td>subtract trap <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;as for <code>SUB</code> but can generate an exception (see <code>ADDT</code> for further details)</td>
</tr>
<tr>
<td><code>S2SUB</code></td>
<td>subtract shift 2 <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;before performing the subtraction, the second argument is shifted left by two bits</td>
</tr>
<tr>
<td><code>MUL</code></td>
<td>multiply <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;signed multiplication of the two arguments (no exception generation)</td>
</tr>
<tr>
<td><code>DIV</code></td>
<td>multiply <code>value_{reg}</code>&lt;br&gt;<code>reg_{imm}</code>&lt;br&gt;<code>target_{reg}</code>&lt;br&gt;signed division of the two arguments (exception generated when dividing by zero)</td>
</tr>
</tbody>
</table>
Move instructions

<table>
<thead>
<tr>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMOVEQ</td>
<td>move_z=0 , x_reg , reg_imm8 , target_reg \textit{if} , x \textit{is zero then set the target to the value of the second argument}</td>
</tr>
<tr>
<td>CMOVNE</td>
<td>move_z\neq0 , x_reg , reg_imm8 , target_reg \textit{perform the move if} , x \textit{is not zero}</td>
</tr>
<tr>
<td>CMOVLE</td>
<td>move_z&lt;0 , x_reg , reg_imm8 , target_reg \textit{perform the move if} , x \textit{is less than zero}</td>
</tr>
<tr>
<td>CMOVGT</td>
<td>move_z&gt;0 , x_reg , reg_imm8 , target_reg \textit{perform the move if} , x \textit{is less than or equal to zero}</td>
</tr>
<tr>
<td>CMOVGE</td>
<td>move_z\geq0 , x_reg , reg_imm8 , target_reg \textit{perform the move if} , x \textit{is greater than or equal to zero}</td>
</tr>
<tr>
<td>CMOVLBC</td>
<td>move_bit_clear , x_reg , reg_imm8 , target_reg \textit{perform the move if the low bit of} , x \textit{is zero}</td>
</tr>
<tr>
<td>CMOVLBS</td>
<td>move_bit_set , x_reg , reg_imm8 , target_reg \textit{perform the move if the low bit of} , x \textit{is one}</td>
</tr>
</tbody>
</table>

An unconditional move from \( x\_reg \) to \( y\_reg \) is effected by the move\_z=0 \, zero\_reg \, x\_reg \, y\_reg. The condition \( x \) (RISCCondition) functions are described in section F.7.

Logical instruction

<table>
<thead>
<tr>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>AND</td>
<td>\textit{and} , value_reg , reg_imm8 , target_reg \textit{perform a bit-wise logical of the two arguments and store the result in the target}</td>
</tr>
<tr>
<td>BIS</td>
<td>\textit{or} , value_reg , reg_imm8 , target_reg \textit{as for} \textit{and}, but use the bit-wise logical or operation</td>
</tr>
<tr>
<td>XOR</td>
<td>\textit{xor} , value_reg , reg_imm8 , target_reg \textit{as for} \textit{and}, but use the bit-wise logical xor operation</td>
</tr>
<tr>
<td>BIC</td>
<td>\textit{and_not} , value_reg , reg_imm8 , target_reg \textit{complement the second argument before performing the} \textit{and} \textit{operation}</td>
</tr>
<tr>
<td>ORNOT</td>
<td>\textit{or_not} , value_reg , reg_imm8 , target_reg \textit{complement the second argument before performing the} \textit{or} \textit{operation}</td>
</tr>
<tr>
<td>EQV</td>
<td>\textit{xor_not} , value_reg , reg_imm8 , target_reg \textit{complement the second argument before performing the} \textit{xor} \textit{operation}</td>
</tr>
</tbody>
</table>

Comparisons

<table>
<thead>
<tr>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMPEQ</td>
<td>compare_z=y , value_reg , reg_imm8 , target_reg \textit{if the two values are equal then set the target register to one, otherwise set it to zero}</td>
</tr>
<tr>
<td>CMPLT</td>
<td>compare_z&lt;y , value_reg , reg_imm8 , target_reg \textit{if the first argument is less than the second then set the target register to one, otherwise set it to zero}</td>
</tr>
<tr>
<td>CMPLE</td>
<td>compare_z\leq y , value_reg , reg_imm8 , target_reg \textit{if the first argument is less than or equal to the second then set the target register to one, otherwise set it to zero}</td>
</tr>
</tbody>
</table>

Negation of register \( x\_reg \) is effected by the or\_not \, zero\_reg \, x\_reg instruction.
## Shift instructions

<table>
<thead>
<tr>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>SLL $shift_{left}$ $x_{reg}$ $reg_{imm}$ $target_{reg}$</td>
<td>the first argument is shifted left by the number of bits specified by the second argument (up to a maximum of 32 places) and the result stored in the target</td>
</tr>
<tr>
<td>SRL $shift_{right}$ $x_{reg}$ $reg_{imm}$ $target_{reg}$</td>
<td>as for $shift_{left}$, but shift to the right</td>
</tr>
<tr>
<td>SRA $shift_{arithmetic}$ $x_{reg}$ $reg_{imm}$ $target_{reg}$</td>
<td>as for $shift_{right}$, but the sign bit is invariable</td>
</tr>
</tbody>
</table>

## F.6 System instructions

**Haskell**

```haskell
data SysOpCode = CALL_PAL | TRAPB | MB | MBW
```

<table>
<thead>
<tr>
<th>instruction</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>CALL_PAL syscall immediate</td>
<td>cause a system-call exception</td>
</tr>
<tr>
<td>TRAPB barriertrap</td>
<td>if an arithmetic exception is pending, then skip the next instruction</td>
</tr>
<tr>
<td>MB barrierread</td>
<td>wait until all outstanding reads have completed (only applicable in a shared memory environment)</td>
</tr>
<tr>
<td>MBW barrierwrite</td>
<td>wait until all outstanding writes have completed (only applicable in a shared-memory environment)</td>
</tr>
</tbody>
</table>

## F.7 Condition codes

**Haskell**

```haskell
data RISCCondition = EQ | NE | LT | LE | GT | GE | LBC | LBS
```

<table>
<thead>
<tr>
<th>condition</th>
<th>branch instruction</th>
<th>move instruction</th>
<th>condition $x$</th>
</tr>
</thead>
<tbody>
<tr>
<td>EQ</td>
<td>BEQ</td>
<td>CMOVEQ</td>
<td>$(x = 0)$</td>
</tr>
<tr>
<td>NE</td>
<td>BNE</td>
<td>CMOVNE</td>
<td>$(x \neq 0)$</td>
</tr>
<tr>
<td>LT</td>
<td>BLT</td>
<td>CMOVLT</td>
<td>$(x &lt; 0)$</td>
</tr>
<tr>
<td>LE</td>
<td>BLE</td>
<td>CMOVLE</td>
<td>$(x \leq 0)$</td>
</tr>
<tr>
<td>GT</td>
<td>BGT</td>
<td>CMOVGTE</td>
<td>$(x &gt; 0)$</td>
</tr>
<tr>
<td>GE</td>
<td>BGE</td>
<td>CMOVGE</td>
<td>$(x \geq 0)$</td>
</tr>
<tr>
<td>LBC</td>
<td>BLBC</td>
<td>CMOVLBC</td>
<td>$(x \mod 2 = 0)$</td>
</tr>
<tr>
<td>LBS</td>
<td>BLBS</td>
<td>CMOVLBS</td>
<td>$(x \mod 2 \neq 0)$</td>
</tr>
</tbody>
</table>
Appendix G

State-transition rules for modelling a RISC processor

G.1 Introduction

This chapter presents the state-transition rules needed to complete the RISC-uniprocessor model outlined in chapter 7. The terminology follows that presented in section 4.8.2.

G.2 Decoding instructions

Pending exceptions

\[ \text{Decode } pc \text{ regs memory semaphore (pending, mask, counter, trigger)} \]

such that \( \text{pending} \setminus (\text{mask} \cup \{\text{Overflow}\}) \neq \emptyset \)

\[ \Rightarrow \text{Exception } pc \text{ registers memory semaphore (pending, mask, counter, trigger)} \]

Instruction fetch and decode

\[ \begin{array}{c|c}
1 & \text{Decode } pc \text{ registers memory semaphore exceptions} \\
\hline
& \Rightarrow \text{Execute instruction } pc' \text{ registers memory semaphore exceptions} \\
& \text{where } \text{instruction} = \text{decode memory}(pc) \\
& pc' = pc + 32 \]

G.3 The post-execution phase

\[ \begin{array}{c|c}
3 & \text{PostExec } pc \text{ registers memory semaphore (pending, mask, counter, trigger)} \\
\hline
\Rightarrow & \text{Decode } pc \text{ registers memory semaphore (pending', mask, counter', trigger)} \\
\text{where} & \text{counter'} = \text{counter} + 32 \]

pending' = pending \cup \text{clock_interrupt} \\
\text{clock_interrupt} = \text{if } (\text{trigger} = \text{counter}) \text{ then } \{\text{Clock}\} \text{ else } \emptyset \]

250
G.4 Exceptions

| 4 | Exception pc registers memory semaphore exceptions
|   | Exception pc registers memory semaphore exceptions |

G.5 Memory references

Unaligned access

5 | Execute load/store offset(base) pc registers memory semaphore (pending, mask, ctr, tr) such that (load/store ∈ {load, loadlinked, store, storelinked}) and (address mod 4 ≠ 0)

⇒ PostExec pc registers memory semaphore (pending', mask, ctr, tr)
where pending' = pending ∪ \{Unaligned_{data}\}
address = offset + 32 registers(base)

Load instruction

6 | Execute load offset(base) target pc registers memory semaphore exceptions

⇒ PostExec pc registers' memory semaphore exceptions
where registers' = registers[target → value]
value = memory(address)
address = offset + 32 registers(base)

Linked loads

7 | Execute loadlinked offset(base) target pc registers memory semaphore exceptions

⇒ PostExec pc registers' memory (address, false) exceptions
where registers' = registers[target → value]
value = memory(address)
address = offset + 32 registers(base)

Store instructions

8 | Execute store source offset(base) pc registers memory (address_{sem}, stale?) exceptions

⇒ PostExec pc registers memory' (address_{sem}, stale?)' exceptions
where memory' = memory[address → registers(source)]
stale?' = if (address = address_{sem}) then true else stale?
address = offset + 32 registers(base)

Conditional store instruction

Clean address

9 | Execute storelink source offset(base) pc registers memory (address_{sem}, stale?) except.
such that (address = address_{sem}) and (stale? = false)

⇒ PostExec pc registers' memory' (address_{sem}, true) except.
where memory' = memory[address → registers(source)]
registers' = registers[source → 1]
address = offset + 32 registers(base)
### Dirty or a non-equal address

| 10 | Execute store Link source \(\text{offset}(\text{base})\) \(\text{pc registers memory (address sem, stale?) except.}\)  
| \(\Rightarrow\) PostExec \(\text{pc registers' memory (address sem, true) except.}\)  
| \(\text{where}\) \(\text{registers'} = \text{registers}[\text{source} \mapsto 0]\)  
| \(\text{address} = \text{offset} + 32 \text{ registers}(\text{base})\) |

### Load address

| 11 | Execute load \(\text{address of fset}(\text{base})\) target \(\text{pc registers memory semaphore exceptions}\)  
| \(\Rightarrow\) PostExec \(\text{pc registers' memory semaphore exceptions}\)  
| \(\text{where}\) \(\text{registers'} = \text{registers}[\text{target} \mapsto \text{address}]\)  
| \(\text{address} = \text{offset} + 32 \text{ registers}(\text{base})\) |

### Load address high

| 12 | Execute load \(\text{address high of fset}(\text{base})\) target \(\text{pc registers memory semaphore exceptions}\)  
| \(\Rightarrow\) PostExec \(\text{pc registers' memory semaphore exceptions}\)  
| \(\text{where}\) \(\text{registers'} = \text{registers}[\text{target} \mapsto \text{address}]\)  
| \(\text{address} = \text{offset} + 32 \text{ registers}(\text{base})\)  
| \(\text{offset}' = \text{shiftleft}_16 \text{ offset}\) |

### G.6 Branch instructions

#### Unaligned computed jumps

| 13 | Execute \(\text{jump of fset}(\text{base})\) \(\text{pc registers memory semaphore (pending, mask, ctr, tr)}\)  
| \(\Rightarrow\) PostExec \(\text{pc registers memory semaphore (pending', mask, ctr, tr)}\)  
| \(\text{such that } (\text{jump} \in \{\text{jump, jump_{link}}\}) \text{ and } (\text{address mod 4} \neq 0)\)  
| \(\text{where}\) \(\text{pending'} = \text{pending} \cup \{\text{Unaligned\_instruction}\}\)  
| \(\text{address} = \text{offset} + 32 \text{ registers}(\text{base})\) |

#### Computed jumps

| 14 | Execute \(\text{jump of fset}(\text{base})\) \(\text{pc registers memory semaphore exceptions}\)  
| \(\Rightarrow\) PostExec \(\text{pc registers memory semaphore exceptions}\)  
| \(\text{where}\) \(\text{pc}' = \text{offset} + 32 \text{ registers}(\text{base})\) |

#### Linked computed jumps

| 15 | Execute \(\text{jump_{link} of fset}(\text{base})\) \(\text{link pc registers memory semaphore exceptions}\)  
| \(\Rightarrow\) PostExec \(\text{pc registers' memory semaphore exceptions}\)  
| \(\text{where}\) \(\text{pc'} = \text{offset} + 32 \text{ registers}(\text{base})\)  
| \(\text{registers'} = \text{registers}[\text{link} \mapsto \text{pc}]\) |
Unconditional branches

<table>
<thead>
<tr>
<th>16</th>
<th>Execute branch offset pc registers memory semaphore exceptions</th>
</tr>
</thead>
<tbody>
<tr>
<td>➞</td>
<td>PostExec pc' registers memory semaphore exceptions</td>
</tr>
<tr>
<td>where</td>
<td>p' = offset' + 32 pc</td>
</tr>
<tr>
<td></td>
<td>offset' = shift_left offset 2</td>
</tr>
</tbody>
</table>

Linked branches

<table>
<thead>
<tr>
<th>17</th>
<th>Execute branch link offset link pc registers memory semaphore exceptions</th>
</tr>
</thead>
<tbody>
<tr>
<td>➞</td>
<td>PostExec pc' registers' memory semaphore exceptions</td>
</tr>
<tr>
<td>where</td>
<td>p' = offset' + 32 pc</td>
</tr>
<tr>
<td></td>
<td>registers' = registers[link -&gt; pc]</td>
</tr>
<tr>
<td></td>
<td>offset' = shift_left offset 2</td>
</tr>
</tbody>
</table>

Conditional branches

Condition satisfied

<table>
<thead>
<tr>
<th>18</th>
<th>Execute branch condition x offset pc registers memory semaphore exceptions</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>such that (condition registers(x) == true) and (address mod 4 == 0)</td>
</tr>
<tr>
<td>➞</td>
<td>PostExec pc' registers memory semaphore exceptions</td>
</tr>
<tr>
<td>where</td>
<td>p' = offset' + 32 pc</td>
</tr>
<tr>
<td></td>
<td>offset' = shift_left offset 2</td>
</tr>
</tbody>
</table>

Condition not met

<table>
<thead>
<tr>
<th>19</th>
<th>Execute branch condition x offset pc registers memory semaphore exceptions</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>

G.7 Operate instructions

Trapped addition

No overflow

<table>
<thead>
<tr>
<th>20</th>
<th>Execute addtrap register1 { immediate}_{register2} target pc registers memory semaphore ex</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>such that (sign argument1 ≠ sign argument2) or (sign argument1 = sign result)</td>
</tr>
<tr>
<td>➞</td>
<td>PostExec pc registers' memory semaphore ex</td>
</tr>
<tr>
<td>where</td>
<td>registers' = registers[target -&gt; result]</td>
</tr>
<tr>
<td></td>
<td>result = argument1 + 32 argument2</td>
</tr>
<tr>
<td></td>
<td>argument1 = registers(register1)</td>
</tr>
<tr>
<td></td>
<td>argument2 = { immediate}_{registers(register2)}</td>
</tr>
</tbody>
</table>
**Overflow occurs**

<table>
<thead>
<tr>
<th>21</th>
<th>( \text{Execute addtrap } \text{pc registers memory semaphore (pending, mask, counter, trigger)} )</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( \Rightarrow \text{PostExec pc registers' memory semaphore (pending', mask, counter, trigger)} )</td>
</tr>
<tr>
<td></td>
<td>where ( \text{pending'} = \text{pending} \cup { \text{Overflow} } )</td>
</tr>
</tbody>
</table>

**Trapped subtraction**

**No overflow**

<table>
<thead>
<tr>
<th>22</th>
<th>( \text{Execute subtracttrap reg_1 {imm \ operative \ reg_2 } target pc registers memory semaphore exceptions} )</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>such that ( (\text{sign argument}_1 = \text{sign argument}_2) ) or ( (\text{sign argument}_1 = \text{sign result}) )</td>
</tr>
<tr>
<td></td>
<td>( \Rightarrow \text{PostExec pc registers' memory semaphore exceptions} )</td>
</tr>
<tr>
<td></td>
<td>where ( \text{registers'} = \text{registers}[\text{target} \rightarrow \text{result}] )</td>
</tr>
<tr>
<td></td>
<td>( \text{result} = \text{argument}_1 - \text{argument}_2 )</td>
</tr>
<tr>
<td></td>
<td>( \text{argument}_1 = \text{registers}(\text{reg}_1) )</td>
</tr>
<tr>
<td></td>
<td>( \text{argument}_2 = {\text{imm} \ operative \ \text{registers}(\text{reg}_2)} )</td>
</tr>
</tbody>
</table>

**Overflow occurs**

<table>
<thead>
<tr>
<th>23</th>
<th>( \text{Execute subtracttrap pc registers memory semaphore (pending, mask, counter, trigger)} )</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( \Rightarrow \text{PostExec pc registers' memory semaphore (pending', mask, counter, trigger)} )</td>
</tr>
<tr>
<td></td>
<td>where ( \text{pending'} = \text{pending} \cup { \text{Overflow} } )</td>
</tr>
</tbody>
</table>

**Move instructions**

**Condition satisfied**

<table>
<thead>
<tr>
<th>24</th>
<th>( \text{Execute move}_{\text{condition}} \ \text{reg}_1 {\text{imm} \ operative \ \text{reg}_2 } \ \text{target pc registers memory semaphore except.} )</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>such that ( (\text{condition argument}_1 = \text{true}) )</td>
</tr>
<tr>
<td></td>
<td>( \Rightarrow \text{PostExec pc registers' memory semaphore except.} )</td>
</tr>
<tr>
<td></td>
<td>where ( \text{registers'} = \text{registers}[\text{target} \rightarrow \text{argument}_2] )</td>
</tr>
<tr>
<td></td>
<td>( \text{argument}_1 = \text{registers}(\text{reg}_1) )</td>
</tr>
<tr>
<td></td>
<td>( \text{argument}_2 = {\text{imm} \ operative \ \text{registers}(\text{reg}_2)} )</td>
</tr>
</tbody>
</table>

**Condition not met**

<table>
<thead>
<tr>
<th>25</th>
<th>( \text{Execute move}_{\text{condition}} \ \text{reg}_1 {\text{imm} \ operative \ \text{reg}_2 } \ \text{target pc registers memory semaphore exceptions} )</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( \Rightarrow \text{PostExec pc registers memory semaphore exceptions} )</td>
</tr>
</tbody>
</table>
Arithmetic and logical operations, shifts and comparisons

The remaining operations are regular, so we only need to define one reduction rule:

\[
\begin{align*}
\text{Execute operator register}_1 \{ \text{immediate} \} \text{ target pc registers memory semaphore ex} \\
\text{such that } (\text{operator} \in \{ \text{add, addshift}_2, \text{subtract, subtractshift}_2, \text{multiply, divide,} \right. \\
\left. \text{and, andnot, or, ornot, xor, xornot,} \right. \\
\text{shiftright, shiftleft, shiftarithmetic, comparecondition} \}) \\
\Rightarrow \text{PostExec} \\
\text{where } \text{registers'} = \text{registers[target map result]} \\
\text{result} = \text{operator} \text{ argument}_1 \text{ argument}_2 \\
\text{argument}_1 = \text{registers} (\text{register}_1) \\
\text{argument}_2 = \{ \text{immediate} \}
\end{align*}
\]

G.8 System instructions

System calls

\[
\begin{align*}
\text{Execute syscall arg pc registers memory semaphore (pending, mask, counter, trigger)} \\
\Rightarrow \text{PostExec pc registers memory semaphore (pending', mask, counter, trigger)} \\
\text{where } \text{pending'} = \text{pending} \cup \{ \text{SysCall} \}
\end{align*}
\]

Arithmetic trap

\[
\begin{align*}
\text{Execute barrier}_\text{trap pc registers memory semaphore (pending, mask, counter, trigger)} \\
\Rightarrow \text{PostExec pc' registers memory semaphore (pending', mask, counter, trigger)} \\
\text{where } pc' = \text{if } (\text{Overflow} \in \text{pending}) \text{ then } (pc + 32) \text{ else pc} \\
\text{pending'} = \text{pending} \setminus \{ \text{Overflow} \}
\end{align*}
\]

Read barrier

\[
\begin{align*}
\text{Execute barrier}_\text{read pc registers memory semaphore exceptions} \\
\Rightarrow \text{PostExec pc registers memory semaphore exceptions}
\end{align*}
\]

Write barrier

\[
\begin{align*}
\text{Execute barrier}_\text{write pc registers memory semaphore exceptions} \\
\Rightarrow \text{PostExec pc registers memory semaphore exceptions}
\end{align*}
\]
## G.9 Condition codes

<table>
<thead>
<tr>
<th>condition</th>
<th>branch instruction</th>
<th>move instruction</th>
<th>$condition ; x$</th>
</tr>
</thead>
<tbody>
<tr>
<td>EQ</td>
<td>BEQ</td>
<td>CMOVEQ</td>
<td>$(x = 0)$</td>
</tr>
<tr>
<td>NE</td>
<td>BNE</td>
<td>CMOVNE</td>
<td>$(x \neq 0)$</td>
</tr>
<tr>
<td>LT</td>
<td>BLT</td>
<td>CMOVLT</td>
<td>$(x &lt; 0)$</td>
</tr>
<tr>
<td>LE</td>
<td>BLE</td>
<td>CMOVLE</td>
<td>$(x \leq 0)$</td>
</tr>
<tr>
<td>GT</td>
<td>BGT</td>
<td>CMOVGT</td>
<td>$(x &gt; 0)$</td>
</tr>
<tr>
<td>GE</td>
<td>BGGE</td>
<td>CMOVGE</td>
<td>$(x \geq 0)$</td>
</tr>
<tr>
<td>LBC</td>
<td>BLBCE</td>
<td>CMOVLBC</td>
<td>$(x \mod 2) = 0$</td>
</tr>
<tr>
<td>LBS</td>
<td>BLBS</td>
<td>CMOVLBS</td>
<td>$(x \mod 2) \neq 0$</td>
</tr>
</tbody>
</table>
Appendix H

Compilation rules of the STG'-machine language

This chapter presents the state-transition rules used to prototype a modern optimising compiler for functional languages. The notation used is described in section 4.8.2, while the rules themselves are introduced and described in chapter 8.

H.1 The initial state

<table>
<thead>
<tr>
<th>INIT</th>
<th>Expression-code</th>
<th>Continuation</th>
<th>Pending</th>
<th>Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>Code</td>
<td>stack</td>
<td>stack</td>
<td>bindings</td>
<td>blocks</td>
</tr>
</tbody>
</table>

Continue $\langle stack \rangle_{stack} \langle stack \rangle_{stack} pending blocks \sigma$

where

pending $= \{bind_1, \ldots, bind_n\}$
blocks $= \{label_{node_{g_1}} \mapsto \langle label_{info_{g_1}}, 0\rangle, \ldots, label_{node_{g_n}} \mapsto \langle label_{info_{g_n}}, 0\rangle\}$
$\sigma = \{g_1 \mapsto label_{node_{g_1}}, \ldots, g_n \mapsto label_{node_{g_n}}\}$
bind$_i$ $\equiv (g_i = lambda_form_i)$

H.2 The compiler framework

1. $F_1_a$ Continue $exps$ $next : cons$ $pending blocks \sigma$

   $\Rightarrow$ $next$ $exps$ $cons$ $pending blocks \sigma$

2. $F_1_b$ Continue $\langle stack \rangle_{stack} \langle stack \rangle_{stack} pending blocks \sigma$

   such that $bind \in pending$

   $\Rightarrow$ CompileBind $bind$ $\langle stack \rangle_{stack} \langle stack \rangle_{stack}$ pending' blocks $\sigma$

   where pending' $= pending \setminus bind$

3. $F_1_c$ Continue $\langle stack \rangle_{stack} \langle stack \rangle_{stack} \langle stack \rangle_{stack}$ blocks $\sigma$

   $\Rightarrow$ Finish $\langle stack \rangle_{stack} \langle stack \rangle_{stack} \langle stack \rangle_{stack}$ blocks $\sigma$
\[ \text{CompileBind} \quad \text{bind} \quad \text{exps} \quad \text{conts} \quad \text{pending blocks} \quad \sigma \]

such that \( \vdash \text{var}_{\text{fun}} : \tau_1 \rightarrow \cdots \rightarrow \tau_n \rightarrow \text{Int}, \ (n \geq 1) \)

\[ \Rightarrow \ \text{CEval exp}_{\text{fun}} \ \rho_{\text{init}} \ (\text{return}_{\text{init}}) \text{stack exps} \text{conts' pending blocks'} \ \sigma \]

where

\[ \rho_{\text{init}} = \rho_{\text{args}} \oplus \rho_{\text{frees}} \oplus \{ \text{var}_{\text{return}} \mapsto \text{register}_{24}, \text{var}_{\text{node}} \mapsto \text{register}_{25} \} \]

\[ \rho_{\text{args}} = \{ \text{var}_{\text{arg}1} \mapsto \text{operand}_{\text{arg}1}, \ldots, \text{var}_{\text{arg}n} \mapsto \text{operand}_{\text{arg}n} \} \]

\[ \text{operand}_{\text{arg}i} = \begin{cases} \text{stack}_{1}^{\alpha} & \text{var} \quad \text{var}_{\text{arg}i} : \alpha \\ \text{stack}_{2}^{\alpha} & \text{var}_{\text{arg}i} : \text{Int}\# \end{cases} \]

\[ \rho_{\text{frees}} = \{ \text{var}_{\text{free}1} \mapsto \text{operand}_{\text{free}1}, \ldots, \text{var}_{\text{free}m} \mapsto \text{operand}_{\text{free}m} \} \]

\[ \text{operand}_{\text{free}i} = \begin{cases} \text{memory}_{\text{node}} & \text{var}_{\text{free}i} : \alpha \\ \text{memory}_{\text{node}} & \text{var}_{\text{free}i} : \text{Int}\# \end{cases} \]

\[ \text{return}_{\text{init}} = \{ \text{var}_{\text{return}}, \text{register}_{1} \} \_\text{set} \]

\[ \text{conts'} = (\text{SealEntry bind}) : (\text{ReturnBind bind}) : \text{conts} \]

\[ \text{bind} = (\text{var}_{\text{fun}} = \text{var}_{\text{free}1} \cdots \text{var}_{\text{free}m}, \text{r}_{\text{var}_{\text{arg}1}} \cdots \text{r}_{\text{var}_{\text{arg}n}} \rightarrow \text{exp}_{\text{fun}}) \]

\[ \Rightarrow \ \text{Continue} \quad \text{code'} : \text{exps} \quad \text{conts} \quad \text{pending blocks'} \quad \sigma \]

\[ \Rightarrow \ \text{SealEntry bind} \quad \text{code} : \text{exps} \quad \text{conts} \quad \text{pending blocks'} \quad \sigma \]

where \( \text{code'} = \text{check_args} + \text{check_stacks} + \text{check_heap} + \text{code} \)

\[ \Rightarrow \ \text{ReturnBind} \quad (\text{var} = \text{lambda_form}) \quad \text{code} : \text{exps} \quad \text{conts} \quad \text{pending blocks'} \quad \sigma \]

where \( \text{blocks'} = \text{blocks} \oplus \{ \text{label}_{\text{enter-var}} \mapsto \text{code}, \ldots, \text{label}_{\text{info-var}} \mapsto \text{info}\_\text{table} \} \)

\[ \text{info}\_\text{table} = \{ \text{label}_{\text{enter-var}}, \text{label}_{\text{update-var}}, \ldots, \text{label}_{\text{pc-var}} \} \]

H.3 Applications

\[ \text{CEval} \ (f (\text{atom}_{1}, \ldots, \text{atom}_{n})) \quad \rho \quad \text{code'} \quad \text{returns} \quad \text{exps} \quad \text{conts} \quad \text{pending blocks} \quad \sigma \]

such that \( \vdash f : \alpha \)

\[ \Rightarrow \ \text{CEnter register}_{25} \quad \rho' \quad \text{code'} \quad \text{returns} \quad \text{exps} \quad \text{conts} \quad \text{pending blocks} \quad \sigma \]

where

\[ \text{moves'}, \rho' \quad = \text{combine_moves moves} \quad \rho \quad \sigma \]

\[ \text{code'} \quad = \text{code} + \text{moves'} \]

\[ \text{moves} \quad = \text{load_node} + \text{push_args} + \text{save_volatile_vars} + \text{stub_dead_A_slots} \]

\[ \text{load_node} \quad = \{ \text{move} (\text{val} \ \rho \ \sigma \ f), \text{register}_{25} \} \]

\[ \text{push_args} \quad = \{ \text{push}_{\text{arg}1}, \ldots, \text{push}_{\text{arg}n} \} \]

\[ \text{push}_{\text{arg}i} \quad = \{ \text{move} (\text{atom}_{\text{to-oper}} \text{atom}_{i}), \text{operand}_{\text{arg}i} \} \]

\[ \text{operand}_{\text{arg}i} \quad = \begin{cases} \text{stack}_{1}^{\alpha} & \text{atom} \quad \text{atom}_{i} : \alpha \\ \text{stack}_{2}^{\alpha} & \text{atom} \quad \text{atom}_{i} : \text{Int}\# \end{cases} \]
The auxiliary function, \( \text{atom\_to\_operand} \), is defined below:

\[
\begin{align*}
\text{atom\_to\_operand } \rho \sigma \text{ literal} & = \text{literal} \\
\text{atom\_to\_operand } \rho \sigma \text{ var} & = \text{val } \rho \sigma \text{ var}
\end{align*}
\]

2A \hspace{1cm} \text{Enter operand } \rho \text{ code returns exps conts pending blocks } \sigma

\[
\Rightarrow \text{ReturnExpression code'} \quad \text{exps conts pending blocks } \sigma
\]

where

- \( \text{code'} = \text{code } \text{load\_return } \text{jump} \)
- \( \text{load\_return } = \text{move } \text{(val } \rho \sigma \text{ varreturn }, \text{register24)} \)
- \( \text{jump } = \text{move } \text{memory0 } \text{operand } \text{register} \text{tmp} \)
- \( \text{jump } \text{link } \text{(register} \text{tmp}, \text{register24)} \)

2B \hspace{1cm} \text{Enter operand } \rho \text{ code returns exps conts pending blocks } \sigma

\[
\Rightarrow \text{ReturnInt register1, } \rho \emptyset \text{ returns exps conts' pending blocks } \sigma
\]

where \( \text{conts'} = (\text{CJoinEnter } \text{(val } \rho \sigma \text{ varnode) code'}) : \text{conts} \)

2C \hspace{1cm} \text{CJoinEnter operand node code\_pre\_entry code\_post\_entry : exps conts pending blocks } \sigma

\[
\Rightarrow \text{ReturnExpression code'} \quad \text{exps conts pending blocks } \sigma
\]

where

- \( \text{code'} = \text{code\_pre\_entry } \text{jump } \text{code\_post\_entry} \)
- \( \text{jump } = \text{move } \text{memory0 } \text{operand\_node, register} \text{tmp} \)
- \( \text{jump } \text{link } \text{(register} \text{tmp}, \text{register24)} \)

\[\text{H.4 \hspace{1cm} let(rec) expressions}\]

3 \hspace{1cm} \text{CEval (let bindings in exp) } \rho \text{ code returns exps conts pending blocks } \sigma

\[
\Rightarrow \text{CEval exp } \rho' \text{ code'} \quad \text{returns exps conts pending' blocks } \sigma
\]

where

- \( \rho' = \rho \setminus \text{vars}_\text{dead} \)
- \( \text{code'} = \text{code } \text{moves} \)
- \( \text{pending'} = \{\text{binding1, ..., bindingn}\} \cup \text{pending} \)
- \( \text{var}_{\text{dead}} = \text{FP[binding]} \setminus \text{FP[exp]} \)

The rule for \( \text{letrec} \) expressions is almost identical, requiring only a minor modification of the \( \text{allocate\_closures} \) rule (see the description of this function for further details).
H.4.1 Variable bindings

\[
\begin{align*}
\text{allocate_closures} \left( \begin{array}{c}
\text{var}_1 = \text{lambda}_1 \\
\vdots \\
\text{var}_n = \text{lambda}_n 
\end{array} \right) \quad & \rho \sigma = \text{combine_moves} \{ \text{moves}_1, \ldots, \text{moves}_n \} \rho_{\text{binds}} \sigma \\
\text{where} \quad & \rho_{\text{binds}} = \rho \oplus \{ \text{var}_1 \mapsto \text{heap}(\text{offset}_1), \ldots, \text{var}_n \mapsto \text{heap}(\text{offset}_n), \text{heap}_{\text{max}} \mapsto \text{offset}_{n+1} \} \\
\text{moves}_i = & \text{create_closure var}_i \text{ lambda}_i \text{ offset}_i \rho_{\text{rhs}} \sigma \\
\rho_{\text{rhs}} = & \rho \\
\text{offset}_i = & \text{val } \rho \sigma \text{ heap}_{\text{max}} \\
\text{offset}_{\text{i}} = & \text{offset}_1 + \sum_{j<i} \max(\text{closure_size lambda}_j, \text{closure_size}_{\text{min}})
\end{align*}
\]

The rule for recursive bindings is almost identical, except that \(\rho_{\text{rhs}}\) is defined to be \(\rho_{\text{binds}}\) instead of \(\rho\).

H.4.2 Closure layout

\[
\text{create_closure var base (vars}{}_{\text{free}} \pi \text{ vars}{}_{\text{args}} \rightarrow \text{exp}) = \text{move label}_{\text{inj_table} \text{ var}} \text{ memory}_{0} \text{ base,} \\
\quad \text{move operand}_1, \text{ memory}_{1} \text{ base,} \\
\quad \vdots \\
\quad \text{move operand}_n, \text{ memory}_{n} \text{ base,} \\
\text{where} \quad \text{operand}_i = \text{val } \rho \sigma \left\{ \begin{array}{l}
\text{ith} \text{ free variable of type } \pi \\
(n-i)^{th} \text{ free variable of type } \nu \quad (1 \leq i \leq \text{length vars}{}_{\text{free}}) 
\end{array} \right\}
\]

H.5 Case expressions

\[
\begin{array}{ll}
\begin{array}{l}
\text{CEval (case exp of alts default)} \quad \rho \text{ code returns exps conts pending blocks } \sigma \\
\Rightarrow \text{CEval exp} \quad \rho \text{ code returns' exps conts pending blocks } \sigma \\
\text{where} \quad \text{returns'} = (\text{alts, default, vars}{}_{\text{free}}) : \text{returns} \\
\text{vars}{}_{\text{free}} = FV[\text{exp}] \cup FV[\text{alts}] \cup FV[\text{default}] \\
\end{array}
\end{array}
\]

\[
\begin{array}{ll}
\begin{array}{l}
\text{CEval (let\# (var = exp}_{\text{rhs}}) \text{ exp}_{\text{body}}) \quad \rho \text{ code returns exps conts pending blocks } \sigma \\
\Rightarrow \text{CEval exp}_{\text{rhs}} \quad \rho \text{ code returns' exps conts pending blocks } \sigma \\
\text{where} \quad \text{returns'} = (\text{var, exp}_{\text{body}}, \text{vars}{}_{\text{dead}}) : \text{assign : returns} \\
\text{vars}{}_{\text{dead}} = FV[\text{exp}_{\text{rhs}}] \setminus FV[\text{exp}_{\text{body}}] \\
\end{array}
\end{array}
\]

The rule for \textit{letstrict} (rule 4A, see figure 4.12) is almost identical, with just the initial expression requiring modification.

H.6 Built-in operations

\[
\begin{array}{ll}
\begin{array}{l}
\text{CEval (k)} \quad \rho \text{ code returns exps conts pending blocks } \sigma \\
\Rightarrow \text{CReturnInt } k \quad \rho \text{ code returns exps conts pending blocks } \sigma \\
\end{array}
\end{array}
\]
10 \[ CEval(f()) \quad \rho \text{ code returns } \text{exps} \text{ conts pending blocks } \sigma \]

such that \( \text{var} \vdash f : \text{Int} \# \)

\[ \Rightarrow C\text{ReturnInt}(\text{val } \rho \sigma f) \quad \rho \text{ code returns } \text{exps} \text{ conts pending blocks } \sigma \]

11A \[ C\text{ReturnInt operand} \quad \rho \text{ code returns } \text{exps} \text{ conts pending blocks } \sigma \]

such that \( \text{returns} \equiv \left( (\text{var}_{\text{return}}, \text{register}_{\text{return Int} \#}) \right) \text{stack} \)

\[ \Rightarrow \text{ReturnExpression code'} \quad \text{exps conts pending blocks } \sigma \]

where \( \text{code'} = \text{code} \cup \text{moves} \cup \text{trim stacks} \cup \text{jump} \)

\( (\text{moves, } \rho') = \text{combine moves} \left( \text{move operand, register}_{\text{return Int} \#}, \text{move (val } \rho \sigma \text{ var}_{\text{return}}, \text{register}_{\text{return} \#} \right) \rho \sigma \)

\( \text{jump} = \text{jump (register}_{\text{return} \#} \right) \)

11B \[ C\text{ReturnInt operand} \quad \rho \text{ code return : returns } \text{exps conts pending blocks } \sigma \]

such that \( \text{return} \equiv (k_1 \rightarrow \text{exp}_1 \ldots \ k_n \rightarrow \text{exp}_n, \ldots \rightarrow \text{exp}_d, \text{vars}_{\text{free}}) \text{case} \)

\[ \Rightarrow \text{cont}_1 \quad \text{exps conts' pending blocks } \sigma \]

where \( \text{cont}_1 = CEval \text{ exp}, \rho \_ () \text{ returns} \)

\( \text{conts}' = \text{conts}_2 : \ldots : \text{conts}_n : \text{conts}_d : \text{join returns : conts} \)

\( \rho'_1 = (\rho \_ \{ \text{vars}_{\text{free}} \_ \text{EV} [\text{exp}] \}) \)

\( \text{join returns} = C\text{JoinReturns operand code return} \)

12 \[ C\text{ReturnInt operand} \quad \rho \text{ code return : returns } \text{exps conts pending blocks } \sigma \]

such that \( \text{return} \equiv (\text{var}, \text{exp}_{\text{body}}, \text{vars}_{\text{dead}}) \text{assign} \)

\[ \Rightarrow CEval \text{ exp}_{\text{body}} \quad \rho' \quad \text{code returns } \text{exps conts pending blocks } \sigma \]

where \( \rho' = (\rho \_ \{ \text{vars}_{\text{dead}} \} \_ \{ \text{var} \rightarrow \text{operand} \}) \)

13' \[ C\text{JoinReturns operand code return exps conts pending blocks } \sigma \]

\[ \Rightarrow \text{ReturnExpression code'} \quad \text{exps' conts pending blocks' } \sigma \]

where \( \text{code'} = \text{code} \cup \text{select alt} \)

select alt \( = \text{move } \text{kn, register}_{\text{imp} i}; \quad \text{subtract register}_{\text{imp} i}, \text{operand, register}_{\text{imp} i}; \quad \text{branch}_{\text{zn} = 0} \text{ register}_{\text{imp} i}, \text{label}_{\text{unique} 1}; \quad \ldots \quad \text{move } \text{kn, register}_{\text{imp} i}; \quad \text{subtract register}_{\text{imp} i}, \text{operand, register}_{\text{imp} i}; \quad \text{branch}_{\text{zn} = 0} \text{ register}_{\text{imp} i}, \text{label}_{\text{unique n}}; \quad \text{jump label}_{\text{unique n}}; \quad \text{blocks'} = \text{blocks} \_ \{ \text{label}_{\text{unique} 1} \rightarrow \text{code} 1, \ldots, \text{label}_{\text{unique n}} \rightarrow \text{code} n, \text{label}_{\text{unique d}} \rightarrow \text{code} d \}\)

\( \text{return} \equiv (k_1 \rightarrow \text{exp}_1 \ldots \ < k_n \rightarrow \text{exp}_n, \ldots \rightarrow \text{exp}_d, \text{vars}_{\text{free}}) \text{case} \)

\( \text{exps} \equiv \text{code}_d : \text{code}_1 : \ldots : \text{code}_1 : \text{exps'} \)
Note, the above rule uses a linear search, which will be inefficient when dealing with large numbers of alternatives – Bernstein [1985] describes an algorithm for generating the optimal combination of linear and binary searches, and jump tables.

<table>
<thead>
<tr>
<th></th>
<th>(CEval) primIntPlus# atom(_1) atom(_2) (\rho) code returns exps conts pending blocks (\sigma)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(\Rightarrow)</td>
<td>(CReturnInt) register(_{tmp1}) (\rho) code' returns exps conts pending blocks (\sigma)</td>
</tr>
</tbody>
</table>

where

- code' = code ++ add\(_{atoms}\)
- add\(_{atoms}\) = move (atom\(_{toOperand}\) \(\rho\) \(\sigma\) atom\(_1\)), register\(_{tmp1}\);
  move (atom\(_{toOperand}\) \(\rho\) \(\sigma\) atom\(_2\)), register\(_{tmp2}\);
  add register\(_{tmp1}\), register\(_{tmp2}\), register\(_{tmp1}\);
Appendix I

Example RISC programs

This section presents a number of RISC implementations of the STG' routines from chapter B, as generated by the compilation routines from chapter 8 (including some hand editing).

Section I.1 looks at some of the prelude operations used to support integers, booleans, and lists. Two nofib programs, fib, and primes, are then presented in sections I.2 and through I.3. The remaining sections, I.4 and I.5, look at updating polymorphic algebraic constructors and partial applications respectively.

I.1 Prelude operations

This section looks at the RISC definitions needed to support the three main data types of the Haskell language, namely integers, booleans, and lists. Where applicable, the equivalent STG' code is also included. All of the RISC bindings have been taken directly from the library of test routines used by the prototyping system (see section 3.4).

I.1.1 Integers

Constants

The following STG' declarations define the constants zero and one:

\[
\begin{align*}
\text{STG' code} & \\
\text{zero} & = \mathbb{N} \times \mathbb{N} \rightarrow \text{Int} [0#] ; \\
\text{one} & = \mathbb{N} \times \mathbb{N} \rightarrow \text{Int} [1#] ;
\end{align*}
\]

The equivalent RISC code consists of two closure definitions, the reversed info table for integers, and the corresponding update code:

\[
\begin{align*}
\text{RISC code} & \\
closure \text{zero} & \quad \text{Linfo\_table\_Int, 0} ; \\
closure \text{one} & \quad \text{Linfo\_table\_Int, 1} ;
\end{align*}
\]

\[
\text{Linfo\_table\_Int:}
\begin{align*}
dw \text{Update\_int} ; & \quad \text{update routine} \\
dw \text{Linfo\_table\_Int} ; & \quad \text{fast entry} \\
dw \text{Linfo\_table\_Int} ; & \quad \text{std entry} \\
load +4 (\mathbb{N}p), \mathbb{N}1 ; & \quad \text{load the integer value into R1} \\
jump +4 \mathbb{N}R\mathbb{N} & \quad \text{and return}
\end{align*}
\]
RISC code

Lupdate_Int:

load_high Linfo_table_Int(R0), R2; // load the integer info table
store R2, (RNp); // and overwrite the closure's
store R1, +4(RNp); // save the integer value
jump +4 RRet; // invoke the actual return address

Addition

The STG' definition for the addition operator is as follows:

```
const.Int.+ = □ \r [x y] -> case x of
  { Int x'  -> case y of { Int y'  -> let# xy = plusInt# [x', y'] in Int [xy] ; };
}
```

The RISC equivalent includes the obligatory static closure, the reversed info table, and the associated code. The code itself is split into three main parts. The first part ensures there are sufficient arguments available to complete the operation, prepares the return vector (including saving the location of the second argument), and then initiates the evaluation of the first argument:

```
closure const.Int.+ Linfo_table_Int.+;
Linfo_table_Int.+

  dw Lupdate_Int.+; // update routine
  dw Linfo_table_Int.+ +12; // fast entry
  dw Linfo_table_Int.+; // stnd entry

  subtract RStkA, RStkABase, R1; // calculate the number of args
  subtract R1, +8, R1; // are there at least two?
  branch_x<0 R1, Lupdate_PAP; // if not, perform an update
  load -8(RStkA), RNp; // load the node pointer of arg1
  load -4(RStkA), R1; // load the node pointer of arg2
  subtract RStkB, +4, RStkB; // trim the B stack
  store R1, -4(RStkA); // ...and save arg2
  store RRet, +4(RStkB); // save the return pointer
  load (RNp), R1; // get the info table of arg1
  jump_link R1, RRet; // enter the closure
  branch Lupdate_Int; // handle an update request
```

The second part is called when the first argument has been evaluated, and it recovers the address of the second argument, prepares another return vector (including saving the integer value of the first argument), and initiates the evaluation of the second argument:

```
load -4(RStkA), RNp; // load the node pointer of arg2
subtract RStkB, +4, RStkB; // re-allocate stack space (from
subtract RStkB, +4, RStkA; // A to B)
store R1, +4(RStkB); // save the value of R1 on stack B
load (RNp), R1; // get the info table of arg2
jump_link R1, RRet; // enter the closure
branch Lupdate_Int; // handle an update request
```

Finally, the two integers are added together and the return continuation invoked:
The less-than operator

The structure of the RISC code is almost identical to that of the addition operation from the previous section. The only major difference is the final operation performed on the two arguments:

```risc
RISC code
load +4(RStkB), R2;
add R1, R2, R1;
load +8(RStkB), RRet;
add RStkB, +8, RStkB;
jump +4 RRet;

// recover the value of arg1
// add the two values
// recover the return register
// trim the B stack
// and return normally
```

Quotients

The quotient function, quotRem, demonstrates the basic techniques of stack allocation and tail calling. The RISC definition given below is based on Int#-specialised versions of the following prelude function:

```stg'
STG' code
const.Int.< = [] \r [x y] ->
case x of { Int x' -> case y of { Int y' -> ltInt# [x', y'] ; } ; } ;
```

The RISC code simply allocates two thunks, containing the addresses of the two arguments, and simply returns a pair containing the thunks' locations:

```risc
RISC code
closure const.Int.< Linfo_table_Int._;
Linfo_table_Int._:<
dw Lupdate_Int._; // update routine
dw Linfo_table_Int._< +1 2; // fast entry
dw Linfo_table_Int._<; // stdn entry
...
load +4(RStkB), R2;
compare_x<y R2, R1, R1;
load +8(RStkB), RRet;
add RStkB, +8, RStkB;
jump +4 RRet;

// calculate the number of args
// are there at least two?
branch_x<0 R1, Lupdate.PAP;
```

The quotient function, quotRem, demonstrates the basic techniques of stack allocation and tail calling. The RISC definition given below is based on Int#-specialised versions of the following prelude function:

```stg'
STG' code
const.Int.< = [] \r [n d] ->
let { q = [n d] \u \r \r -> const.Int.quot n d;
r = [n d] \u \r \r -> const.Int.rem n d; } in Tup2 [q, r];
```

The RISC code simply allocates two thunks, containing the addresses of the two arguments, and simply returns a pair containing the thunks' locations:

```risc
RISC code
closure const.Int.quotRem Linfo_table_Int.quotRem;
Linfo_table_Int.quotRem:
dw Lupdate_Int.quotRem; // update routine
dw Linfo_table_Int.quotRem +12; // fast entry
dw Linfo_table_Int.quotRem; // stdn entry
subtract RStkA, RStkB, R1;
subtract R1, +8, R1;
branch_x<0 R1, Lupdate.PAP; // if not, perform an update
```
After checking that there are sufficient arguments available, heap space for the two thunks is allocated (3 words of space for each):

```
add RHp, +24, RHp; // allocate space for 2 closures
compare_x<y RHLimit, RHp, R1; // ensure there's space
branch_bit0_set R1, Lgarbage_collect; // otherwise invoke the GC
```

The thunk for q is then filled in:

```
load_high Linfo_table_Int_quotRem_l(R0), R1;
load_address +0(R1), R1;
store R1, -24(RHp); // set q's info table
load -8(RStkA), R1; // recover the location of n
store R1, -20(RHp); // store n as a free variable
load -4(RStkA), R2; // recover the location of d
store R2, -16(RHp); // store d as a free variable
```

Next, the thunk for r is filled in:

```
load_high Linfo_table_Int_quotRem_2(R0), R3;
load_address +0(R3), R3;
store R3, -12(RHp); // set r's info table
store R1, -8(RHp); // store n as a free variable
store R2, -4(RHp); // store d as a free variable
```

Finally, the tuples is constructed and the appropriate return entry called:

```
subtract RHp, +24, R1; // set q as the fst pointer
subtract RHp, +12, R2; // set d as the snd pointer
subtract RStkA, +8, RStkA; // trim the A stack
jump +4 RRet; // and return
```

The info tables and corresponding code for the two thunks, quotRem_1 and quotRem_2 are very similar, so only that for determining the quotient is reproduced here:

```
Linfo_table_Int_quotRem_l:
dw Lupdate_lnt_quotRem_l;
dw Linfo_table_Int_quotRem_l +12;
dw Linfo_table_lnt_quotRem_l;
// update routine
// fast entry
// stnd entry
```

Upon entry to the thunk, an update frame is created:

```
subtract RStkB, +16, RStkB; // decrease the B stack frame
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
store RStkBBase, +16(RStkB); // the A stack pointer
store RStkBBase, +12(RStkB); // the B stack pointer
store RHp, +8(RStkB); // the node pointer
store RRet, +4(RStkB); // the current return vector
move RStkA, RStkBBase;
move RStkB, RStkBBase;
move RUpdate, RRet; // set the return to an update
```

Then, sufficient space is allocated to allow the two free variables, n and d, to be pushed onto the A stack:
The two free variables are then recovered, and pushed onto the stack, before the division function is tail called:

```
add RStkB, +8, RStkA;
compare_x<y RStkB, RStkA, R1;  // check for stack overflow
branch_bit0_set R1, Lstack_overflow;  // overflow error handler
```

Signs

The sign function, `signum`, demonstrates how the RISC code handles conditionals, and the use of worker functions. The RISC definitions given below are based on `Int#`-specialised versions of the following prelude functions:

```
const.Int.signum = [] \( x \) -> case x of {Int x' -> const.Int.signum.wrk x';};
const.Int.signum.wrk = [] \( x \) -> case x of
  {0# -> Int [0#];
  _ -> case gtint# [x, 0#] of { True -> Int [ 1#]; False -> Int [ -1#]; } }
```

The first part of the code forces evaluation of the argument, and then tail calls the worker function, `LInt_signum_wrk`:

```
closure const.Int.signum    Linfo_table_Int_signum;
Linfo_table_Int_signum:
  dw Lupdate_Int_signum;  // update routine
  dw Linfo_table_Int_signum +12;  // fast entry
  dw Linfo_table_Int_signum;  // stdn entry
  subtract RStkA, RStkABase, R1;  // calculate the number of args
  subtract R1, +4, R1;  // is there at least one?
  branch_x<0 R1, Lupdate_PAP;  // if not, perform an update
  load -4(RStkA), RNp;  // pop the arg
  load (RNp), R1;
  subtract RStkA, +4, RStkA;  // re-allocate stack space
  subtract RStkB, +4, RStkB;
  store RRet, +4(RStkB);  // save the return register
  jump_link R1, RRet;
  branch Lupdate_Int;  // evaluate the arg
  load +4(RStkB), RRet;  // recover the return register
  add RStkB, +4, RStkB;  // trim the stack
  branch LInt_signum_wrk;  // call the worker function
```
The worker function then determines whether the value is zero, negative or positive, and returns the corresponding integer value:

```
RISC code
LInt_signum_wrk:
  branch_x=0 R1, LInt_signum_wrk_1;  // return 0 if it's zero
  branch_x<0 R1, LInt_signum_wrk_2;  // return -1 if it's negative
  add R0, +1, R1;
  jump +4 RRet;

LInt_signum_wrk_1:
  move R0, R1;
  jump +4 RRet;

LInt_signum_wrk_2:
  subtract R0, +1, R1;
  jump +4 RRet;
```

I.1.2 Booleans

```
STG' code
data Bool = True | False;
true = 0 \lor \ell -> True \ell;
false = 0 \lor \ell -> False \ell;
otherwise = 0 \lor \ell -> True \ell;

RISC code

closure true Linfo_table_Bool_True;
closure false Linfo_table_Bool_False;
closure otherwise Linfo_table_Bool_True;
```

There are two info tables for dealing with boolean values: one for true, and the other for false. Note, that the code uses the convention that true is represented by the integer one and false by zero:

```
RISC code
Linfo_table_Bool_True:
  dw Lupdate_Bool;  // update routine
  dw Linfo_table_Bool_True;  // fast entry
  dw Linfo_table_Bool_True;  // stnd entry
  add R0, +1, R1;
  jump +4 RRet;

Linfo_table_Bool_False:
  dw Lupdate_Bool;  // update routine
  dw Linfo_table_Bool_False;  // fast entry
  dw Linfo_table_Bool_False;  // stnd entry
  move R0, R1;
  jump +4 RRet;
```

The update code for boolean values is straightforward, simply overwriting the thunks info table with either that for true or false:
Logical negation

\[
\text{not } = \begin{cases} \text{True} & \text{False} \\ \text{False} & \text{True} \end{cases};
\]

First, there's the usual argument check:

\[
\text{calculate the number of args} \\
\text{are there at least two?} \\
\text{if not, perform an update}
\]

Then the argument is evaluated:

\[
\text{retrieve the return vector} \\
\text{trim the stack} \\
\text{if false, return true} \\
\text{return false} \\
\text{return true}
\]
I.1.3 Lists

Rather than introducing special syntactic support, the following STG' declaration is used to define the List algebraic data type:

```
_ STG' code _______________________________________________________________________________________
data List a = Cons a (List a) | Nil;
```


Nil and null

The nil value represents an empty list:

```
_ STG' code _______________________________________________________________________________________
nil = [] [] -> Nil [];
```

The RISC code simply calls the nil entry from the return vector:

```
__ RISC code _______________________________________________________________________________________
closure nil Linfo_table_Nil;
Linfo_table_Nil:
    dw Lupdate_Nil; // update routine
    dw Linfo_table_Nil; // fast entry
    dw Linfo_table_Nil; // stnd entry
    load_high Linl_head(R0), R1; // load dummy values into the
    load_high Linl_tail(R0), R2; // head and tail to help debugging
    load -4(RRet), R3; // select the nil return entry
    jump R3; // and return
```

To illustrate how the return vector is constructed and used, consider the null operator:

```
_ STG' code _______________________________________________________________________________________
null = [] [xss] -> case xss of { Nil -> True []; Cons x xs -> False []; };
```

The RISC code performs the usual argument checks and then forces the evaluation of its argument:

```
__ RISC code _______________________________________________________________________________________
closure null Linfo_table_null;
Linfo_table_null:
    dw Lupdate_null; // update routine
    dw Linfo_table_null +12; // fast entry
    dw Linfo_table_null; // stnd entry
    subtract RStkA, RStkB & RStkABase, R1; // calculate the number of args
    subtract R1, +4, R1; // is there at least one?
    branch_x<0 R1, Lupdate_PAP; // if not, perform an update
    load -4(RStkA), RNp; // fetch the arg
    load (RNp), R1;
    subtract RStkA, +4, RStkA; // re-organise the stacks
```

However, the return is set to a custom return vector which correctly handles the nil and non-nil lists:

```
RISC code

store RRet, +4(RStkB);  // save the return register
load_high Lnull_return_1(R0), RRet;  // set the return register
load_address +0(RRet), RRet;
jump R1;
```

The return vector is specified as follows:

```
RISC code

Lnull_return_1:

dw Lupdate_List;
dw Lnull_return_List;
dw Lupdate_Nil;
dw Lnull_return_Nil;
```

The odd entries point to the associated update routines, while the even entries point to the code to handle the various cases (nil and non-nil lists). Nil-returns are handled as follows (the following two sections will look at the update entries):

```
RISC code

Lnull_return_Nil:

load +4(RStkB), RRet;
add RStkB, +4, RStkB;
add R0, +1, R1;
jump +4 RRet;
```

// recover the return reg
// trim the stack
// set the return to true
// perform a normal return

Non-nil returns simply return false:

```
RISC code

Lnull_return_List:

load +4(RStkB), RRet;
add RStkB, +4, RStkB;
move R0, R1;
jump +4 RRet;
```

// recover the return reg
// trim the stack
// set the return to false
// perform a normal return

Updating empty lists

As for boolean values, updating a thunk with a nil list is simply a matter of resetting its info table, and then invoking the nil return from the original return vector:

```
RISC code

Lupdate_Nil:

load_high Linfo_table_Nil(R0), R1;
store R1, (RNp);
load -4(RRet), R1;
jump R1;
```

Updating lists

Updating lists on the other hand is more troublesome. First a cons cell is allocated, and the head and tail stored into it. Then the thunk is overwritten with an indirection to the new cons, before the non-nil return is invoked from the original return vector:
The info table and code for the cons cell is shown below:

--- RISC code ---

Lupdate_List:

```plaintext
add RHp, +12, RHp; // allocate space for a cons cell
compare_x<y RHlimit, RHp, R3; // ensure there's space
branch_bit0_set R3, Lgarbage_collect; // otherwise invoke the GC

load_high Linfo_table_List(R0), R3;
store R3, -12(RHp);
store R1, -8(RHp);
store R2, -4(RHp);

load_high Linfo_table_Ind(R0), R3; // create an indirection
load_address +0(R3), R3; // to then new closure
store R3, (RNp);
subtract RHp, +12, R3;
store R3, +4(RNp);

load -12(RRet), R3; // invoke the regular return
jump R3;
```

Selecting the head of a list

--- STG' code ---

```
head = [] \r [xss] -> case xss of { Cons x xs -> x ; Nil -> error# [] ; };
```

Again, as for null, the code first forces the evaluation of the list, using a custom return vector:

--- RISC code ---

closure head Linfo_table_head;

Linfo_table_head:

```plaintext
dw Lupdate_head; // update routine
dw Linfo_table_head +12; // fast entry
dw Linfo_table_head; // stnd entry

subtract RStkA, RStkBBase, R1; // calculate the number of args
subtract R1, +4, R1; // is there at least one?
branch_x<0 R1, Lupdate_PAP; // if not, perform an update

load -4(RStkA), RNp; // fetch the arg
load (RNp), R1;
subtract RStkA, +4, RStkB; // re-organise the stacks
```
The return vector is specified as follows:

```
Lhead_return_1:
  dw Lupdate_List;
  dw Lhead_return_List;
  dw Lupdate_Nil;
  dw Lhead_return_Nil;  // will cause an error!
```

The nil return simply throws an error, ending the current evaluation. However, the list return simply forces the evaluation of the head of the list:

```
Lhead_return_List:
  load +4(RStkB), RRet;  // recover the return reg
  add RStkB, +4, RStkB;  // trim the stack
  move R1, RNp;          // set the node pointer
  load (RNp), R1;        // fetch the entry code
  jump R1;              // evaluate the head of the list
```

Length

Rather than use the fold1-based version, the more traditional version is used:

```
length = [] \r [xs] -> case xs of
  { Nil -> Int [0#] ;
    Cons x xs -> case length xs of { Int 1 -> let# l' = plusInt# [1#, 1]
      in Int [l'] ; });

The RISC implementation demonstrates the use of recursion and the fast-entry method (effectively skipping the argument check). The method starts as before, by evaluating its argument:

```
closure length Linfo_table_length;
Linfo_table_length:
  dw Lupdate_length;                     // update routine
  dw Linfo_table_length +12;             // fast entry
  dw Linfo_table_length;                 // stdn entry
  subtract RStkA, RStkBBase, R1;         // calculate the number of args
  subtract R1, +4, R1;                   // is there at least one?
  branch_x<0 R1, Lupdate_PAP;            // if not, perform an update
  load -4(RStkA), RNp;                   // fetch the arg
  load (RNp), R1;
  subtract RStkB, +4, RStkB;             // re-organise the stacks
```
This time, however, both nil and non-nil entries of the return vector are used:

```
RISC code
Llength_return_1:
    dw Lupdate_List;
    dw Llength_return_List;
    dw Lupdate.Nil;
    dw Llength_return.Nil;
```

A nil-return simply returns a zero length:

```
RISC code
Llength_return.Nil:
    load +4(RStkB), RRet; // recover the return address
    add RStkB, +4, RStkB; // trim the stack
    move R0, R1; // set length = 0
    jump +4 RRet; // and return
```

A list return, however, retrieves the tail of the list and fast-calls the length method (bypassing the argument check and pre-loading the necessary arguments into the correct register):

```
RISC code
Llength_return.List:
    add RStkA, +4, RStkA;
    compare_x<y RStkB, RStkA, R1; // check for stack overflow
    branch_bitO_set R1, Lstack_overflow; // overflow error handler
    store R2, -4(RStkA); // push the tail
    branch_link Linfo_table_length +12, RRet; // and calculate its length
    branch Lupdate_Int; // handle the update
```

Upon return, one is added to the tails length:

```
RISC code
add R1, +1, R1; // increment the result
load +4(RStkB), RRet; // recover the return address
add RStkB, +4, RStkB; // trim the stack
jump +4 RRet; // and return
```

Map

```
STG' code
map = [] \r f xss -> case xss of
  Nil -> Nil [] ;
  Cons x xs -> let { x' = [f x] \u [] -> f x ;
                      xs' = [f xs] \u [] -> map f xs ; } in Cons [x', xs'] ;
};
```
The return vector is specified below:

```
Lmap_return_1:
    dw Lupdate_List;
    dw Lmap_return_List;
    dw Lupdate_Nil;
    dw Lmap_return_Nil;
```

A nil return results in another nil return:

```
Lmap_return_Nil:
    load +4(RStkB), RRet;           // recover the return address
    subtract RStkA, +4, RStkB;     // trim the stack
    add RStkB, +4, RStkB;          // trim the stack
    load -4(RRet), R1;             // fetch the nil return
    jump R1;                       // and return
```

A non-nil return, however, results in the creation of the x' and xs' closures (represented by the map_1 and map_2 thunks respectively), which are then returned as a cons pair:

```
Lmap.return.List:
    add RHp, +24, RHp;              // allocate two closures
    compare_x<y RHLimit, RHp, R3;  // ensure there's space
    branch_bit0_set R3, Lgarbage_collect; // otherwise invoke the GC
    load_high Linfo_table_map_1(R0), R3; // set the info table
    load_address +0(R3), R3;
    store R3, -24(RHp);
    load -4(RStkA), R3;            // recover f
    store R3, -20(RHp);            // store f
    store R1, -16(RHp);            // store x
The x' thunk, when evaluated, pushes an update frame, retrieves its free variables and invokes the function on the list element:

RISC code

Linfo_table_map_1:

dw Lupdate_map_1;  // update routine
dw Linfo_table_map_1 +12;  // fast entry
dw Linfo_table_map_1;  // stnd entry

subtract RStkB, +16, RStkB;  // decrease the B stack frame
compare_x<y RStkB, RStkA, R1;  // check for stack overflow
branch_bit0_set R1, Lstack_overflow;  // overflow error handler
store RStkABase, +16(RStkB);  // the A stack pointer
store RStkBBase, +12(RStkB);  // the B stack pointer
store RNp, +8(RStkB);  // the node pointer
store RRet, +4(RStkB);  // the current return vector
move RStkA, RStkBBase;
move RStkB, RStkBBase;
move RUpdate, RRet;  // set the return to an update

add RStkA, +4, RStkA;
compare_x<y RStkB, RStkA, R1;  // check for stack overflow
branch_bit0_set R1, Lstack_overflow;  // overflow error handler
load +8(RNp), R1;  // fetch x
store R1, -4(RStkA);  // push x
load +4(RNp), RNp;  // fetch f
load (RNp), R1;
jump R1;  // enter f

The xs' thunk, when evaluated, pushes an update frame, retrieves its free variables and tail-calls map via its fast entry point:

RISC code

Linfo_table_map_2:

dw Lupdate_map_2;  // update routine
dw Linfo_table_map_2 +12;  // fast entry
dw Linfo_table_map_2;  // stnd entry
I.2 Generating Fibonacci numbers

Unoptimised version

The main entry point implements the following code:

```_STG' code_

fib = [] n -> case const.Int.<= n one of
  { True -> one ;
    False -> let { sum_2_fibs = ... } in const.Int.+ sum_2_fibs one;
  };
```

As seen before, the standard entry point checks that there are sufficient arguments and evaluates the argument:

```_RISC code_

Linfo_table_Fib:

dw Lupdate_Fib; // update routine
dw Linfo_table_Fib +12; // fast entry
dw Linfo_table_Fib; // std entry
subtract RStkA, RStkABase, R1; // calculate the number of args
subtract R1, +4, R1; // is there at least one?
branch_x<0 R1, Lupdate_Fib_PAP; // if not, perform an update
subtract RStkB, +4, RStkB; // save the return pointer
add RStkA, +8, RStkA; // push the args
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
store RRet, +4(RStkB); // save the return pointer
load -12(RStkB), R1; // load the argument n
store R1, -8(RStkA); // push the arg n
load_high Lone(RO), R1; // load the closure one
load_address +0(R1), R1; // (low bits)
store R1, -4(RStkA); // push the arg one
branch_link Linfo_table_Int_<=, RRet; // call <= n one
branch Lupdate_Bool; // handle an update request
```
The simple case is handled by the code stored at Lfib_1, otherwise, the \texttt{sum.2.fibs} closure is allocated (represented by the Fib_1 thunk) and the addition operator called:

\begin{verbatim}
 RISC code

 branch_x<0 R1, Lfib_1; // return one if it's zero
 add RHp, +8, RHp; // heap allocate sum_fibs
 compare_x<y RHLimit, RHp, R1; // ensure there's space
 branch_bit0_set R1, Lgarbage_collect; // otherwise invoke the GC
 load_high Linfo_table_Fib_1(R0), R1; // load the info-table ptr
 add R1, +0, R1; // (low bits)
 store R1, -8(RHp); // store it in the closure
 load -4(RStkA), R1; // recover the ptr to n
 store R1, -4(RHp); // store it in the closure
 load +4(RStkB), RRet; // recover the return ptr
 add RStkB, +4, RStkB; // re-allocate stack space
 add RStkA, +4, RStkB; // (from B to A)
 load_high Lone(R0), R1; // load the address of one
 load_address +0(R1), R1; // (low bits)
 store R1, -8(RStkA); // push one
 subtract RHp, +8, R1; // calculate the closure's heap
 store R1, -4(RStkB); // address, and push it as an arg
 branch Linfo_table_Int_+; // add them

\end{verbatim}

The following code handles the simple case whereby the argument is less than or equal to one, and simply returns the value one:

\begin{verbatim}
 RISC code

 Lfib_1:

 load +4(RStkB), RRet; // recover the return register
 add RStkB, +4, RStkB; // trim the B stack
 subtract RStkB, +4, RStkA; // trim the A stack
 add R0, +1, R1; // set value to one
 jump +4 RRet; // return

\end{verbatim}

The code for the \texttt{sum.2.fibs} closure implements the following STG' code:

\begin{verbatim}
 STG' code

 \texttt{sum.2.fibs [n] \u [] \rightarrow let \{ fib.n.less.2 = [n] \u [] ...; \}
 \texttt{fib.n.less.1 = [n] \u [] ...; \}
 \texttt{in const.Int.\+ fib.n.less.1 fib.n.less.2;} \n
\end{verbatim}

The code pushes an update frame, and then heap allocates the two closures before tail-calling the addition operator:

\begin{verbatim}
 RISC code

 Linfo_table_Fib_1:

 dw LUpdate_Fib_1; // update routine
 dw Linfo_table_Fib_1; // fast entry
 dw Linfo_table_Fib_1; // stnd entry
 subtract RStkB, +16, RStkB; // decrease the B stack frame
 compare_x<y RStkB, RStkA, R1; // check for stack overflow
 branch_bit0_set R1, Lstack_overflow; // overflow error handler
 store RStkBBase, +16(RStkB); // the A stack pointer
 store RStkBBase, +12(RStkB); // the B stack pointer
 store RHp, +8(RStkB); // the node pointer
 store RRet, +4(RStkB); // the current return vector
 move RStkA, RStkBBase; // clear the A stack frame
 move RStkB, RStkBBase; // clear the B stack frame
 move RUpdate, RRet; // set the return to an update

\end{verbatim}
The \texttt{fib\_n\_less\_2} and \texttt{fib\_n\_less\_1} closures are represented by the \texttt{Fib\_2} and \texttt{Fib\_3} thunks:

\begin{verbatim}
add RHp, +16, RHp; // increase the heap pointer
compare_x<y RHLimit, RHp, R1; // check if there's room
branch_bitO_set R1, Lgarbage_collect; // otherwise, invoke the GC
load_high Linfo_table_Fib_2(R0), R1; // load the info-table ptr
load_address +0(R1), R1; // (low bits)
store R1, -16(RHp); // store it in the closure
load +4(RHp), R1; // load the FV n
store R1, -12(RHp); // ...and store it in 1
store R1, -4(RHp); // ...and 2
load_high Linfo_table_Fib_3(R0), R1; // load the info-table ptr
load_address +0(R1), R1; // (low bits)
store R1, -8(RHp); // set the closures info table
add RStkA, +8, RStkA; // allocate 2 arg slots
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bitO_set R1, Lstack_overflow; // overflow error handler
subtract RHp, +16, R1; // calc the first heap addr
store R1, -8(RStkB); // push fib n-1
subtract RHp, +8, R1; // calc the second heap addr
store R1, -4(RStkB); // push fib n-2
branch Linfo_table_Int_+; // and add them
\end{verbatim}

The \texttt{fib\_n\_less\_2} closure implements the following STG' code:

\begin{verbatim}
fib\_n\_less\_2 = [n] \u [] \to let { n\_less\_2 = [n] \u [] \to const.Int.- n two; } in fib\_n\_less\_2;
\end{verbatim}

The corresponding RISC code pushes an update frame, heap allocates \texttt{n\_less\_2} (represented by the \texttt{Fib\_4} thunk) and then tail calls fib:

\begin{verbatim}
add RHp, +16, RHp; // increase the heap pointer
compare_x<y RHLimit, RHp, R1; // check if there's room
branch_bitO_set R1, Lgarbage_collect; // otherwise, invoke the GC
load_high Linfo_table_Fib_4(R0), R1; // load the info-table ptr
load_address +0(R1), R1; // (low bits)
store R1, -6(RHp); // store it in the closure
load +4(RHp), R1; // load the FV n
store R1, -4(RHp); // ...and 2
\end{verbatim}
The n_less_2 closure implements the following STG' code:

\[
\text{STG'} \quad n_{\text{less}_2} = \begin{bmatrix} n \end{bmatrix} \uparrow u \downarrow \rightarrow \text{const.Int.} \cdot n \text{ two};
\]

It is implemented as follows:

```
RISC code
add RStkA, +4, RStkA;
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bitO_set R1, Lstack_overflow; // overflow error handler
subtract RHp, +8, R1;
store R1, -4(RStkA);
branch Linfo_table_Fib; // call Fib
```

The tt fib_n_less_1 closure is sufficiently similar to tt fib_n_less_1 to not warrant inclusion here.

**Optimised version**

```
STG' code
fib = \begin{bmatrix} \end{bmatrix} \downarrow \rightarrow \text{case } n \text{ of } \{ \text{Int } n' \rightarrow \text{fib.wrk } n'; \}
```

```
RISC code
Linfo_table_Fib:
```
```
RISC code

```plaintext
subtract RStkA, RStkB, R1; // calculate the number of args
subtract R1, +4, R1; // is there at least one?
branch_x<0 R1, Lupdate_PAP; // if not, perform an update
load -4(RStkA), RNp;
load (RNp), R1; // get the entry code
subtract RStkB, +4, RStkA; // free up the arg. slot,
subtract RStkB, +4, RStkA; // but claim the space back for
store RRet, +4(RStkB); // saving the return pointer
jump_link R1, RRet; // evaluate n
branch Lupdate_Int; // handle an update request
load +4(RStkB), RRet; // recover the return vector
add RStkB, +4, RStkB; // and trim the stack
branch Linfo_table_Fib'; // tail-call fib.wrk
```

STG' code:

```plaintext
fib.wrk = [ ] \r [n'] -> case leInt# [n', 1#] of
{ True -> Int [1#];
False -> let# n'_less_1 = minusInt# [n', 1#] in
  case fib.wrk n'_less_1 of { Int fib_n'_less_1 ->
    let# n'_less_2 = minusInt# [n', 2#] in
    case fib.wrk n'_less_2 of { Int fib_n'_less_2 ->
      let# sum_2_fibs' = plusInt# [fib_n'_less_1, fib_n'_less_2] in
      let# result = plusInt# [sum_2_fibs', 1#] in
      Int [result];
    };
  };};
```

RISC code

```plaintext
Linfo_table_Fib':
```

```plaintext
dw Lupdate_Fib'; // update routine
dw Linfo_table_Fib'; // fast entry
dw Linfo_table_Fib'; // stnd entry
```

```plaintext
compare_x<=y R1, +1, R2; // test if n' <= 1
branch_bitO_set R2, LFib'_1; // return one if it is
```

```plaintext
subtract R1, +1, R2; // calculate n_less_one'
subtract RStkB, +8, RStkB; // allocate space for n'
compare_x<y RStkB, RStkA, R3; // check for stack overflow
branch_bitO_set R3, Lstack_overflow; // overflow error handler
store R1, +4(RStkB); // save n'
store RRet, +8(RStkB); // save the return vector
move R2, R1;
```

```plaintext
branch_link Linfo_table_Fib', RRet; // recursive call to Fib'
branch Lupdate_Int; // handle an update request
load +4(RStkB), R2; // recover fib' (n' - 1)
add R1, R2, R1; // sum the two values
add R1, +1, R1; // and increment
load +8(RStkB), RRet; // recover the return register
add RStkB, +8, RStkB; // trim the stack
jump +4 RRet;
```
1.3 Generating prime numbers – the sieve of Eratosthenes

Unoptimised version

```
primes = [] \r [a] -> let { primes' = [] \u \r -> ...; } in !! primes' a;
```

```
Linfo_table_Primes:

dw Lupdate_Primes;          // update routine
dw Linfo_table_Primes +12;  // fast entry
dw Linfo_table_Primes;      // stdn entry
subtract RStkA, RStkABase, R1;
subtract R1, +4, R1;         // is there at least one?
branch_x<0 R1, Lupdate_PAP;  // if not, perform an update
add RHp, +8, RHp;
compare_x<y RHLimit, RHp, R1; // ensure there's space
branch_bit0_set R1, Lgarbage_collect; // otherwise invoke the GC
load_high Linfo_table_Primes_1(R0), R1;
load_address +0(R1), R1;
store R1, -8(RHp);  // set the info table;
load -4(RStkA), R1;    // pop a
add RStkA, +4, RStkA;
compare_x<y RStkB, RStkA, R2; // check for stack overflow
branch_bit0_set R2, Lstack_overflow; // overflow error handler
subtract RHp, +8, R2;   // calculate primes
store R2, -8(RStkB);    // push primes
store R1, -4(RStkB);    // push a
branch Linfo_table_!! +12; // tail call !
```

```
primes' = [] \u [] -> let { xs = [] \u [] -> ...; } in map head xs;
```

```
Linfo_table_Primes_1:

dw Lupdate_Primes_1;        // update routine
dw Linfo_table_Primes_1 +12; // fast entry
dw Linfo_table_Primes_1;     // stdn entry
subtract RStkB, +16, RStkB;  // decrease the B stack frame
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
store RStkBBase, +12(RStkB);  // the A stack pointer
store RStkBABase, +16(RStkB);  // the B stack pointer
store RHp, +8(RStkB);         // the node pointer
store RRet, +4(RStkB);        // the current return vector
```
**RISC code**

```plaintext
move RStkA, RStkABase;
move RStkB, RStkBBase;
move RUpdate, RRet; // set the return to an update

add RHp, +8, RHp;
compare_x<y RHLimit, R Hp, R1; // ensure there's space
branch_bit0_set R1, Lgarbage_collect; // otherwise invoke the GC
load_high Linfo_table_Primes_2(R0), R1;
load_address +0(R1), R1;
store R1, -8(RHp); // set the info table;

add RStkA, +8, RStkA;
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
load_high Lhead(R0), R1;
load_address +0(R1), R1;
store R1, -8(RStkA);
subtract RHp, +8, R1; // calculate xs
store R1, -4(RStkA); // push xs
branch Linfo_table_map +12; // tail call map
```

---

**STG' code**

```plaintext
xs = [] \u [] -> let { from_2 = [] \u [] -> iterate inc two;}
in iterate the_filter from_2; }
```

---

**RISC code**

```plaintext
Linfo_table_Primes_2:

dw Lupdate_Primes_2; // update routine
dw Linfo_table_Primes_2 +12; // fast entry
dw Linfo_table_Primes_2; // stdn entry

subtract RStkB, +16, RStkB; // decrease the B stack frame
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
store RStkBBase, +12(RStkB); // the A stack pointer
store RNp, +8(RStkB); // the node pointer
store RRet, +4(RStkB); // the current return vector
move RStkA, RStkBBase;
move RStkB, RStkBBase;
move RUpdate, RRet; // set the return to an update

add RHp, +8, RHp;
compare_x<y RHLimit, R Hp, R1; // ensure there's space
branch_bit0_set R1, Lgarbage_collect; // otherwise invoke the GC
load_high Linfo_table_Primes_3(R0), R1;
load_address +0(R1), R1;
store R1, -8(RHp); // set the info table;

add RStkA, +8, RStkA;
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
load_high Lthe_filter(RO), R1;
load_address +0(R1), R1;
store R1, -8(RStkA); // push the filter
subtract RHp, +8, R1; // calculate from_2
store R1, -4(RStkA); // push from_2
branch Linfo_table_iterate +12; // tail call iterate
```
--- STG' code ---

from_2 = [] \u [] -> iterate inc two

--- RISC code ---

Linfo_table_Primes_3:

dw Lupdate_Primes_3; // update routine
dw Linfo_table_Primes_3 +12; // fast entry
dw Linfo_table_Primes_3; // stnd entry

subtract RStkB, +16, RStkB; // decrease the B stack frame
compare_x<y RStkB, RStkB, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
store RStkBBase, +16(RStkB); // the A stack pointer
store RStkBBase, +12(RStkB); // the B stack pointer
store RNp, +8(RStkB); // the node pointer
store RRet, +4(RStkB); // the current return vector
move RStkB, RStkBBase;
mov RStkA, RStkBBase;
mov Rupdate, RRet; // set the return to an update

add RStkA, +8, RStkA;
compare_x<y RStkB, RStkB, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
load_high Line(RO), R1;
loadaddress +0(R1), R1;
store R1, -8(RStkB);
load_high Ltwo(RO), R1;
loadaddress +0(R1), R1;
store R1, -4(RStkB);
branch Linfo_table_iterate +12; // tail call iterate

--- STG' code ---

the_filter = [] \r [nss] -> case nss of { Cons n ns ->
       let { isdivs_n = [n] \r [x] -> isdivs n x; } in filter isdivs_n ns; };
RISC code

LThe_Filter_return_1:

dw Lupdate_List;
dw LThe_Filter_return_List;
dw Lupdate NIL;
dw LThe_Filter_return_Nil_error;

RISC code

LThe_Filter_return_List:

add RHp, +8, RHp;
compare_x<y RHLimit, RHp, R3; // ensure there's space
branch_bit0_set R3, Lgarbage_collect; // otherwise invoke the GC
load_high Linfo_table_The_Filter_1(R0), R3;
load_address +0(R3), R3;
store R3, -8(RHp); // set the info table;
store R1, -4(RHp); // store n
load +4(RStkB), RRet; // recover the return vector
add RStkA, +8, RStkA; // allocate stack space
add RStkB, +4, RStkB;
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
subtract RHp, +8, R1; // calculate isdivs_n
store R1, -8(RStkB); // push isdivs_n
store R2, -4(RStkB); // push n
branch Linfo_table_filter +12; // tail call filter

STG' code

isdivs_n = [n] \r [x] -> isdivs n x

RISC code

Linfo_table_The_Filter_1:

dw Lupdate_The_Filter_1; // update routine
dw Linfo_table_The_Filter_1 +12; // fast entry
dw Linfo_table_The_Filter_1; // std entry
subtract RStkA, RStkBBase, R1; // calculate the number of args
subtract R1, +4, R1; // is there at least one?
branch_x<0 R1, Lupdate_PAP; // if not, perform an update
load -4(RStkA), R1; // pop x
add RStkA, +4, RStkA;
compare_x<y RStkB, RStkA, R2; // check for stack overflow
branch_bit0_set R2, Lstack_overflow; // overflow error handler
load +4(RHp), R2; // fetch n
store R2, -8(RStkA); // push n
store R1, -4(RStkA); // push x
branch Linfo_table_isdivs +12; // tail call is_divs

STG' code

isdivs = [] \r [n x] -> let { mod_x_n = [n x] \u [] -> const.Int.mod x n; } in const.Int. /= mod_x_n zero;
RISC code

Linfo_table_isdivs:

dw Lupdate_isdivs;  // update routine

dx Linfo_table_isdivs +12;  // fast entry

dx Linfo_table_isdivs;  // std entry

subtract RStkA, RStkB, RI;  // calculate the number of args
subtract RI, +8, RI;  // are there at least two?
branch_x<0 RI, Lupdate_PAP;  // if not, perform an update

add RHp, +12, RHp;  // ensure there's space
compare_x<y RHLimit, RHp, RI;  // otherwise invoke the GC
branch_bit0_set RI, Lgarbage_collect;
load_high Linfo_table_isdivs_l(RO), RI;
load_address +0(R1), RI;
store RI, -12(RHp);
store RI, -8(RHp);
store RI, -4(RHp);
subtract RHp, +12, RI;
store RI, -8(RStkA);
load_high Lzero(RO), RI;
load_address +0(R1), RI;
store RI, -4(RStkA);
branch Linfo_table_Int_m_/= +12;  // tail call /

STG' code

mod_x_n = (n x) \u [] -> const.Int.mod x n;

RISC code

Linfo_table_isdivs_1:

dw Lupdate_isdivs_1;  // update routine

dx Linfo_table_isdivs_1 +12;  // fast entry

dx Linfo_table_isdivs_1;  // std entry

subtract RStkB, +16, RStkB;  // decrease the B stack frame
compare_x<y RStkB, RStkB, RI;  // check for stack overflow
branch_bit0_set RI, Lstack_overflow;  // overflow error handler
store RStkB, +16(RStkB);  // the A stack pointer
store RStkB, +12(RStkB);  // the B stack pointer
store RHp, +8(RStkB);  // the node pointer
store RRet, +4(RStkB);  // the current return vector
move RStkB, RStkBBase;
move RStkB, RStkBBase;
move RUpdate, RRet;  // set the return to an update

add RStkA, +8, RStkA;  // allocate stack space
compare_x<y RStkB, RStkB, RI;  // check for stack overflow
branch_bit0_set RI, Lstack_overflow;  // overflow error handler
load +8(RH), R1;  // fetch x
store R1, -8(RStkA);  // push x
load +4(RH), R1;  // fetch n
store R1, -4(RStkA);  // push n
branch Linfo_table_Int_mod +12;  // tail call mod
\[ \text{STG' code} \]
\[
\text{succ} = \[ \text{x} \] \to \begin{cases} \text{case } x \text{ of } \{ \text{Int } x' \to \text{let# } \text{succ}_x = \text{plusInt# } [x', 1\#] \text{ in Int } [\text{succ}_x] \}; \end{cases}
\]

\[ \text{RISC code} \]
\[
\text{Linfo_table_inc:}
\begin{align*}
\text{dw Lupdate_Int_inc; } & \quad \text{// update routine} \\
\text{dw Linfo_table_inc +12; } & \quad \text{// fast entry} \\
\text{dw Linfo_table_inc; } & \quad \text{// stnd entry} \\
\text{subtract RStkA, RStkABase, R1; } & \quad \text{// calculate the number of args} \\
\text{subtract R1, +4, R1; } & \quad \text{// is there at least one?} \\
\text{branch_x<0 R1, Lupdate_PAP;} & \text{// if not, perform an update} \\
\text{load -4(RStkA), RNp;} & \text{// load the node pointer of arg} \\
\text{subtract RStkA, +4, RStkA;} & \text{// trim the A stack} \\
\text{subtract RStkB, +4, RStkB;} & \text{// trim the B stack} \\
\text{store RRet, +4(RStkB);} & \text{// save the return pointer} \\
\text{load (RNp), R1;} & \text{// get the info table of arg} \\
\text{jump_link R1, RRet;} & \text{// enter the closure} \\
\text{branch Lupdate_Int;} & \text{// handle an update request} \\
\text{add R1, +1, R1;} & \text{// increase the value} \\
\text{load +4(RStkB), RRet;} & \text{// recover the return register} \\
\text{add RStkB, +4, RStkB;} & \text{// trim the B stack} \\
\text{jump +4 RRet;} & \text{// and return normally}
\end{align*}
\]

\[ \text{Optimised version} \]
\[
\text{STG' code} \]
\[
\text{primes = } \[ \text{a} \] \to \text{case } a \text{ of } \{ \text{Int } a' \to \text{primes.wrk a'} \};
\]

\[ \text{RISC code} \]
\[
\text{Linfo_table_Primes:}
\begin{align*}
\text{dw Lupdate_Primes;} & \quad \text{// update routine} \\
\text{dw Linfo_table_Primes +12;} & \quad \text{// fast entry} \\
\text{dw Linfo_table_Primes; } & \quad \text{// stnd entry} \\
\text{subtract RStkA, RStkABase, R1; } & \quad \text{// calculate the number of args} \\
\text{subtract R1, +4, R1; } & \quad \text{// is there at least one?} \\
\text{branch_x<0 R1, Lupdate_PAP;} & \text{// if not, perform an update} \\
\text{load -4(RStkA), RNp;} & \text{// fetch n} \\
\text{load (RNp), R1;} \\
\text{subtract RStkA, +4, RStkA;} & \text{// re-organise the stacks} \\
\text{subtract RStkB, +4, RStkB;} \\
\text{store RRet, +4(RStkB);} & \text{// save the return register} \\
\text{jump_link R1, RRet;} & \text{// evaluate n} \\
\text{branch Lupdate_Int;} & \text{// handle the update} \\
\text{load +4(RStkB), RRet;} & \text{// recover the return reg} \\
\text{store R1, +4(RStkB);} & \text{// push n'} \\
\text{branch Lprimes.wrk + 12;} & \text{// tail call the wrapper}
\end{align*}
\]

\[ \text{STG' code} \]
\[
\text{primes.wrk = } \[ \text{a} \] \to \text{let } \{ \text{from_2 = } \[ \text{a} \] \to \text{iterate inc two; } \} \text{ in } \\
\text{letstrict forced_xs = iterate_filter from_2 in } \\
\text{letstrict forced_primes = map head forced_xs in } \text{!!.wrk forced_primes a'}
\]
RISC code

Lprimes.wrk:

```plaintext
subtract RStkBBase, RStkB, R1;  // calculate the number of args
subtract R1, +4, R1;  // is there at least one?
branch_x<0 R1, Lupdate_PAP;  // if not, perform an update

add RHp, +8, RHp;
compare_x<y RHLimit, RHp, R1;  // ensure there's space
branch_bit0_set R1, Lgarbage_collect;  // otherwise invoke the GC
load_high Linfo_table_Primes_1(RO), R1;
load_address +0(R1), R1;
store R1, -8(RHp);  // set the info table;

add RStkB, +8, RStkB;
subtract RStkB, +4, RStkB;
compare_x<y RStkB, RStkB, R1;  // check for stack overflow
branch_bit0_set R1, Lstack_overflow;  // overflow error handler
load_high Lthe_filter(RO), R1;
load_address +0(R1), R1;
store R1, -8(RStkB);

subtract RHp, +8, RHp;
store RHp, -8(RHp);  // save the return vector
load_high Lprimes_return_1(RO), RHp;  // set the return register
load_address +0(RHp), RHp;
branch Linfo_table_iterate +12;  // tail call iterate
```

RISC code

Lprimes.return_1:

```plaintext
dw Lupdate.List;
dw Lprimes_return_List_1;
dw Lupdate.Nil;
dw Lprimes_return_Nil_1;
```

RISC code

Lprimes.return_Nil_1:

```plaintext
add RHp, +8, RHp;
compare_x<y RHLimit, RHp, R1;  // ensure there's space
branch_bit0_set R1, Lgarbage_collect;  // otherwise invoke the GC
load_high Linfo_table_Nil(RO), R1;
load_address +0(R1), R1;
store R1, -8(RHp);
subtract RHp, +8, R1;  // calculate forced_xs
branch Lprimes.join_1;
```

RISC code

Lprimes.return_List_1:

```plaintext
add RHp, +12, RHp;
compare_x<y RHLimit, RHp, R3;  // ensure there's space
branch_bit0_set R3, Lgarbage_collect;  // otherwise invoke the GC
load_high Linfo_table_List(RO), R3;  // list info table
store R3, -12(RHp);
store R1, -8(RHp);  // store x
store R2, -4(RHp);  // store xs
subtract RHp, +12, R1;  // calculate forced_xs
branch Lprimes.join_1;
```
RISC code

Lprimes_join_1:

```
add RStkA, +8, RStkA;  // know there are min. 2 slots
load_high Lhead(R0), R2;
load_address +0(R2), R2;
store R2, -8(RStkA);
store R1, -4(RStkA);
load_high Lprimes_return_2(R0), RRet;  // set the return register
load_address +0(RRet), RRet;
branch Linfo_table_map +12;  // tail call
```

RISC code

Lprimes_return_2:

```
dw Lupdate_List;
dw Lprimes_return_List_2;
dw Lupdate.Nil;
dw Lprimes_return_Nil_2;
```

RISC code

Lprimes_return_Nil_2:

```
add RHp, +8, RHp;
compare_x<y RHLimit, RHp, R1;  // ensure there's space
branch_bitO_set R1, Lgcirbage_collect;  // otherwise invoke the GC
load_high Linfo_table.Nil(R0), R1;  // nil info table
store R1, -8(RHp);
subtract RHp, +8, R1;  // calculate forced_primes
branch Lprimes_join_2;
```

RISC code

Lprimes_return_List_2:

```
add RHp, +12, RHp;
compare_x<y RHLimit, RHp, R3;  // ensure there's space
branch_bitO_set R3, Lgarbage_collect;  // otherwise invoke the GC
load_high Linfo_table.List(R0), R3;  // list info table
store R3, -12(RHp);
store R1, -8(RHp);
store R2, -4(RHp);
subtract RHp, +12, R1;  // calculate forced_primes
branch Lprimes_join_2;
```

RISC code

Lprimes_join_2:

```
load +4(RStkB), RRet;  // recover the return register
add RStkB, +4, RStkB;  // re-allocate stack space
add RStkA, +4, RStkA;
store R1, -4(RStkA);
branch L!!.wrk +24;  // tail call !!!
```

STG' code

```
from_2 = [] + u [] -> iterate inc two;
```
RISC code

Linfo_table_Primes_1:

```assembly
    dw Lupdate_Primes_1;           // update routine
    dw Linfo_table_Primes_1 +12;    // fast entry
    dw Linfo_table_Primes_1;        // std entry

    subtract RStkB, +16, RStkB;     // decrease the B stack frame
    compare_x<y RStkB, RStkB, R1;   // check for stack overflow
    branch_bit0_set R1, Lstack_overflow; // overflow error handler
    store RStkBBase, +16(RStkB);    // the A stack pointer
    store RStkBBase, +12(RStkB);    // the B stack pointer
    store RNp, +8(RStkB);           // the node pointer
    store RRet, +4(RStkB);          // the current return vector
    move RStkB, RStkBBase;
    move RStkB, RStkBBase;
    move RUpdate, RRet;             // set the return to an update

    add RStkB, +8, RStkB;           // check for stack overflow
    compare_x<y RStkB, RStkB, R1;   // check for stack overflow
    branch_bit0_set R1, Lstack_overflow; // overflow error handler
    load_high Linc(RO), R1;
    load_address +0(R1), R1;
    store R1, -8(RStkB);
    load_high Ltwo(RO), R1;
    load_address +0(R1), R1;
    store R1, -4(RStkB);            // push two
    branch Linfo_table_iterate +12; // tail call iterate
```

STG' code

```haskell```
the_filter = [] \r [nss] -> case nss of { Cons n ns ->
    let { isdivs_n = [n] \r [x] -> ...; } in filter isdivs_n ns; };
```

RISC code

Linfo_table_The_Filter:

```assembly
    dw Lupdate_The_Filter;           // update routine
    dw Linfo_table_The_Filter +12;    // fast entry
    dw Linfo_table_The_Filter;        // std entry

    subtract RStkA, RStkBBase, R1;   // calculate the number of args
    subtract R1, +4, R1;             // is there at least one?
    branch_x<0 R1, Lupdate_PAP;      // if not, perform an update

    load -4(RStkA), RNp;             // fetch the arg
    load (RNp), R1;
    subtract RStkA, +4, RStkA;       // re-organise the stacks
    subtract RStkB, +4, RStkB;
    store RRet, +4(RStkB);           // save the return register
    load_high LThe_Filter_return_l(RO), RRet;
    load_address +0(RRet), RRet;     // set the return register
    jump R1;
```

RISC code

LThe_Filter_return_l:

```assembly
    dw Lupdate_List;
    dw LThe_Filter_return_List;
    dw Lupdate Nil;
    dw LThe_Filter_return_Nil_error;
```
RISC code

The Filter_return_List:

```plaintext
add RHp, +8, RHp;
compare_x<y RHLimit, RHp, R3; // ensure there's space
branch_bit0_set R3, Lgarbage_collect; // otherwise invoke the GC
load_high Linfo_table_The_Filter_1(R0), R3;
load_address +0(R3), R3;
store R3, -8(RHp); // set the info table;
store R1, -4(RHp); // store n

load +4(RStkB), RRet; // recover the return vector
add RStkA, +8, RStkA; // allocate stack space
add RStkB, +4, RStkB;
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler

subtract RHp, +8, R1; // calculate isdivs_n
store R1, -8(RStkA); // push isdivs_n
store R2, -4(RStkA); // push ns
branch Linfo_table_filter +12; // tail call filter
```

STG' code:

```plaintext
isdivs_n = [n] \ r [x] -> case n of { Int n' ->
    case x of { Int x' -> isdivs.wrk n' x'; }; };
```

RISC code

Linfol_table_The_Filter_1:

```plaintext
dw Lupdate_The_Filter_1; // update routine
dw Linfo_table_The_Filter_1 +12; // fast entry
dw Linfo_table_The_Filter_1; // std entry

subtract RStkA, RStkABase, R1; // calculate the number of args
subtract R1, +4, R1; // is there at least one?
branch_x<0 R1, Lupdate.PAP; // if not, perform an update

load +4(RNp), RNp; // recover n
subtract RStkB, +4, RStkB;
compare_x<y RStkB, RStkA, R1; // check for stack overflow
branch_bit0_set R1, Lstack_overflow; // overflow error handler
store RRet, +4(RStkB); // save the return
load (RNp), R1;
jump_link R1, RRet; // evaluate n
branch Lupdate_Int;

load -4(RStkA), RNp; // recover x
subtract RStkA, +4, RStkA;
subtract RStkB, +4, RStkB;
store R1, +4(RStkB); // save n'
load (RNp), R1;
jump_link R1, RRet; // evaluate x
branch Lupdate_Int;

load +8(RStkB), RRet; // recover return vector
load +4(RStkB), R2;
store R2, +8(RStkB); // push n'
store R1, +4(RStkB); // push x'
branch Lisdivs.wrk +12; // tail call isdivs.wrk
```
STG' code

\[
isdivs = [] \forall [n \ x] -> \\
\text{case } n \text{ of } \{ \text{Int } n' \rightarrow \text{case } x \text{ of } \{ \text{Int } x' \rightarrow \text{isdivs.wrk } n' x'; \}; \};
\]

RISC code

\[
\text{Linfo table isdivs:}
\]
\[
\begin{align*}
\text{dw } & \text{Lupdate_isdivs;} & & \text{// update routine} \\
\text{dw } & \text{Linfo_table_isdivs +12;} & & \text{// fast entry} \\
\text{dw } & \text{Linfo_table_isdivs;} & & \text{// stdn entry} \\
\text{subtract } & \text{RStkA, RStkBBase, R1;} & & \text{// calculate the number of args} \\
\text{subtract } & \text{R1, +8, R1;} & & \text{// are there at least two?} \\
\text{branch}_x<0 & \text{R1, Lupdate_PAP;} & & \text{// if not, perform an update} \\
\text{load } & -8(\text{RStkA}), \text{RNp;} & & \text{// pop } n \\
\text{load } & -4(\text{RStkA}), \text{R1;} & & \text{// pop } x \\
\text{subtract } & \text{RStkA, +4, RStkB;} & & \text{// re-allocate stack space} \\
\text{subtract } & \text{RStkB, +4, RStkB;} & & \text{// save the return vector} \\
\text{store } & \text{RRet, +4}(\text{RStkB}); & & \text{// save } x \\
\text{store } & \text{R1, -4}(\text{RStkB}); & & \text{// save } x' \\
\text{load } & (\text{RNp}), \text{R1;} & & \text{// evaluate } n \\
\text{jump_link } & \text{R1, RRet;} & & \text{// evaluate } n' \\
\text{branch } & \text{Lupdate_Int;} & & \\
\text{load } & -4(\text{RStkB}), \text{RNp;} & & \text{// recover } x \\
\text{subtract } & \text{RStkB, +4, RStkB;} & & \text{// re-allocate stack space} \\
\text{load } & (\text{RNp}), \text{R1;} & & \text{// evaluate } x \\
\text{jump_link } & \text{R1, RRet;} & & \text{// evaluate } x' \\
\text{branch } & \text{Lupdate_Int;} & & \\
\text{load } & +8(\text{RStkB}), \text{RRet;} & & \text{// recover return vector} \\
\text{load } & +4(\text{RStkB}), \text{R2;} & & \text{// push } n' \\
\text{store } & \text{R2, +8}(\text{RStkB}); & & \text{// push } x' \\
\text{store } & \text{R1, +4}(\text{RStkB}); & & \text{// save } x' \\
\text{branch } & \text{Lisdivs.wrk +12;} & & \text{// tail call isdivs.wrk}
\end{align*}
\]

STG' code

\[
isdivs.wrk = [] \forall [n' \ x'] -> \\
\text{case } \text{const.Int.mod.wrk } x' \ n' \text{ of } \{ \text{Int } \text{mod'} -> \\
\text{\text{\_} \to \text{False }[]}; \\
\text{\_} \to \text{True }[] \};
\]

RISC code

\[
\text{Lisdivs.wrk:}
\]
\[
\begin{align*}
\text{subtract } & \text{RStkBBase, RStkB, R1;} & & \text{// calculate the number of args} \\
\text{subtract } & \text{R1, +8, R1;} & & \text{// are there at least two?} \\
\text{branch}_x<0 & \text{R1, Lupdate_PAP;} & & \text{// if not, perform an update} \\
\text{load } & +4(\text{RStkB}), \text{R1;} & & \text{// pop } x' \\
\text{load } & +8(\text{RStkB}), \text{R2;} & & \text{// pop } n' \\
\text{add } & \text{RStkB, +8, RStkB;} & & \text{// calculate } \text{mod } x' \ n' \\
\text{remainder } & \text{R1, R2, R1;} & & \text{// if } \text{mod} == 0 \text{ return false} \\
\text{branch}_x=0 & \text{R1, Lisdivs.wrk +12;} & & \text{// return true} \\
\text{add } & \text{R0, +1, R1;} & & \\
\text{jump } & +4 \text{ RRet;} & & \\
\end{align*}
\]
1.4 Updating algebraic constructors

The following return vector is suitable for updating polymorphic expressions, and will catch and correctly handle all forms of algebraic constructors:

```
RISC code
Lupdate_constr:

dw Lupdate_vector_8_chained;
dw Lupdate_vector_8;
dw Lupdate_vector_7_chained;
dw Lupdate_vector_7;
dw Lupdate_vector_6_chained;
dw Lupdate_vector_6;
dw Lupdate_vector_5_chained;
dw Lupdate_vector_5;
dw Lupdate_vector_4_chained;
dw Lupdate_vector_4;
dw Lupdate_vector_3_chained;
dw Lupdate_vector_3;
dw Lupdate_vector_2_chained;
dw Lupdate_vector_2;
dw Lupdate_vector_1_chained;
dw Lupdate_vector_1;

branch Lupdate_vector_0_chained;
load +4(RStkB), RRet;  // recover the return ptr
load +8(RStkB), RNp;   // recover the node pointer
load +12(RStkB), RStkBBase;  // recover the B stack frame
load +16(RStkB), RStkABase;  // recover the A stack frame
add RStkB, +16, RStkB;  // pop the update frame
jump RRet;                // invoke the 'update' return
```

```
RISC code
Lupdate_vector_0_chained:

load +8(RStkB), RRet;  // recover the return ptr
load +12(RStkB), RStkBBase;  // recover the B stack frame
load +16(RStkB), RStkABase;  // recover the A stack frame
add RStkB, +16, RStkB;  // pop the update frame
jump RRet;                // invoke the 'update' return
```
I.5 Updating partial applications

--- RISC code ---

Lupdate_PAP:

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>load +4(RStkBBase), RRet;</td>
<td>recover the return reg</td>
</tr>
<tr>
<td>load +8(RStkBBase), RIp;</td>
<td>recover the node pointer</td>
</tr>
<tr>
<td>load_high Linfo_table_PAP(R0), R2;</td>
<td>and overwrite it with an indirection</td>
</tr>
<tr>
<td>loadaddress +0(R2), R2;</td>
<td>indirect to the PAP</td>
</tr>
<tr>
<td>store R2, (R1);</td>
<td></td>
</tr>
<tr>
<td>store RIp, +4(R1);</td>
<td></td>
</tr>
<tr>
<td>subtract RStkA, RStkB, R1;</td>
<td>number of A args on stack</td>
</tr>
<tr>
<td>subtract RStkABase, R2;</td>
<td>number of B args on stack</td>
</tr>
<tr>
<td>add R1, R2, R3;</td>
<td>total no of args</td>
</tr>
<tr>
<td>add R3, +16, R3;</td>
<td>factor in 4 extra words</td>
</tr>
<tr>
<td>add RIp, R3, R5;</td>
<td>allocate R3 words</td>
</tr>
<tr>
<td>compare_x&lt;y RHLimit, R5;</td>
<td>ensure there’s space</td>
</tr>
<tr>
<td>branch_bit0_set R5, Lgarbage_collect;</td>
<td>otherwise invoke the GC</td>
</tr>
<tr>
<td>subtract RIp, R3, R4;</td>
<td>calculate address of the PAP</td>
</tr>
<tr>
<td>load_high Linfo_table_PAP(R0), R5;</td>
<td>create the PAP</td>
</tr>
<tr>
<td>loadaddress +0(R5), R5;</td>
<td>set the info table</td>
</tr>
<tr>
<td>store R5, +0(R4);</td>
<td>save the node pointer</td>
</tr>
<tr>
<td>store RIp, +4(R4);</td>
<td>save the no A args</td>
</tr>
<tr>
<td>store R1, +8(R4);</td>
<td>save the no B args</td>
</tr>
<tr>
<td>store R2, +12(R4);</td>
<td></td>
</tr>
</tbody>
</table>

--- RISC code ---

Lupdate_vector_1:

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>load +4(RStkB), RRet;</td>
<td>recover the return ptr</td>
</tr>
<tr>
<td>load +8(RStkB), RIp;</td>
<td>recover the node pointer</td>
</tr>
<tr>
<td>load +12(RStkB), RStkB;</td>
<td>recover the B stack frame</td>
</tr>
<tr>
<td>load +16(RStkB), RStkB;</td>
<td>recover the A stack frame</td>
</tr>
<tr>
<td>add RStkB, +16, RStkB;</td>
<td>pop the update frame</td>
</tr>
<tr>
<td>load -8(RRet), RIp;</td>
<td>select the correct vector</td>
</tr>
<tr>
<td>jump RIp;</td>
<td>invoke the ‘update’ return</td>
</tr>
</tbody>
</table>

--- RISC code ---

Lupdate_vector_1_chained:

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>load +8(RStkB), RIp;</td>
<td>recover the node pointer</td>
</tr>
<tr>
<td>load_high Linfo_table_Ind(RO), RRet;</td>
<td>overwrite the existing closure</td>
</tr>
<tr>
<td>load_address (RRet), RRet;</td>
<td>with an indirection</td>
</tr>
<tr>
<td>store RRet, (RIP);</td>
<td>store the address of the node</td>
</tr>
<tr>
<td>store RIp, +4(RIp);</td>
<td>pointer which will be updated</td>
</tr>
<tr>
<td>load +4(RStkB), RRet;</td>
<td>recover the return ptr</td>
</tr>
<tr>
<td>load +12(RStkB), RStkB;</td>
<td>recover the B stack frame</td>
</tr>
<tr>
<td>load +16(RStkB), RStkB;</td>
<td>recover the A stack frame</td>
</tr>
<tr>
<td>add RStkB, +16, RStkB;</td>
<td>pop the update frame</td>
</tr>
<tr>
<td>load -8(RRet), RIp;</td>
<td>invoke the ‘update’ return</td>
</tr>
<tr>
<td>jump RIp;</td>
<td></td>
</tr>
</tbody>
</table>

--- RISC code ---
RISC code

add R4, +16, R4; // set R4 = heap ptr
move RStkBBase, R5; // set R5 = stack ptr
load +16(RStkBBase), RStkBBase; // recover the A stack limit

branch_x=0 R1, +6; // skip forward
load (R5), R6; // get the first A entry
store R6, (R4); // store it in the closure
add R5, +4, R5; // increase stk ptr
add R4, +4, R4; // increase the stk ptr
subtract R1, +4, R1; // decrement the count
branch_x>0 R1, -6; // re-enter the loop

move RStkBBase, R5; // set R5 = stack ptr
load +12(RStkBBase), RStkBBase; // recover RStackBase
add RStkB, +16, RStkB; // pop the update frame

branch_x=0 R2, +7; // possibly skip forward
load (R6), R7; // get the first B entry
store R7, (R4); // store it in the closure
store R7, +16(R5); // slide it down the stack
add R4, +4, R4; // increase stk ptr
subtract R5, +4, R5; // increase stk ptr
subtract R2, +4, R2; // decrement the count
branch_x>0 R2, -7; // re-enter the loop

load (RNp), R1; // re-enter the function
jump R1;

The info table for a partial application is shown below:

RISC code

Linfo_table_PAP:

dw LUpdate_PAP_closure; // update routine
dw Linfo_table_PAP; // fast entry
dw Linfo_table_PAP; // stnd entry

load +8(RNp), R1; // fetch the no of A args
load +12(RNp), R2; // fetch the no of B args
move RStkB, R3; // save RStkB
move RStkB, R4; // and RStkB
add RStkB, R1, RStkB; // increase the stacks
subtract RStkB, R2, RStkB; //
compare_x<y RStkB, RStkA, R5; // check for stack overflow
branch_bit0_set R5, Lstack_overflow; // overflow error handler

add RNp, +16, R5; // set the closure ptr
branch_x=0 R1, +6;

load (R5), R6; // recover the argument
store R6, (R3); // push the argument
add R3, +4, R3; // advance the stk ptr
add R5, +4, R5; // advance the closure ptr
subtract R1, +4, R1; // reduce the no words
branch_x>0 R1, -6; // and repeat
RISC code

\[
\begin{align*}
\text{branch}_{x=0} & \text{ R2, } +6; \\
\text{load } (R5), & \text{ R6; } \quad \text{// fetch the arg} \\
\text{store } & \text{ R6, (R4); } \quad \text{// push the arg} \\
\text{subtract } & \text{ R4, } +4, \text{ R4; } \quad \text{// decrease the stack ptr} \\
\text{add } & \text{ R5, } +4, \text{ R5; } \quad \text{// advance the closure ptr} \\
\text{subtract } & \text{ R2, } +4, \text{ R2; } \quad \text{// and repeat} \\
\text{branch}_{x>0} & \text{ R2, } -6; \\
\text{load } +4(RNp), & \text{ RNp; } \quad \text{// recover the function pointer} \\
\text{load } (RNp), & \text{ R1; } \quad \text{// fetch its info table} \\
\text{jump } & \text{ R1; } \quad \text{// and call it}
\end{align*}
\]
References


determinancy in functional programming: An essential feature or a programmer’s
nightmare. In [Böhm and Feo, 1995], (pp. 235–238).

processing. In [Böhm and Feo, 1995], (pp. 149–163).

reduction. In [Davis and Hughes, 1990], (pp. 225–237).


Sons.

mentations. In PLILP ’95: 7th International Symposium on Programming Languages:
Implementations, Logics and Programs, 20–22 September, Utrecht, The Netherlands,
number 982 in Lecture Notes in Computer Science. Springer-Verlag.


on Theory and Practice of Software Development, volume II (CSE), number 186 in Lecture Notes in Computer Science. Springer-Verlag.

EUROPAL ’90 (1990). The First European Conference on the Practical Application of

supercombinators. In [FPCA ’81, 1981], (pp. 34–45).

29 September – 1 October, 1986, Santa Fe, New Mexico, USA, number 279 in Lecture Notes in Computer Science. Springer-Verlag.

Feeley, M. and Miller, J. S. (1990). A parallel virtual machine for efficient scheme compi-
lation. In [LFP ’90, 1990].

Flynn, M. J. (1972). Some computer organizations and their effectiveness. IEEE Trans-
actions on Computers, 21(9):948–960.

language. Addison-Wesley.

FPCA ’81 (1981). FPCA ’81: 1st Conference on Functional Programming Languages and
Computer Architecture, Boston, Massachusetts, October.

FPCA ’93 (1993). FPCA ’93: 6th Conference on Functional Programming Languages and
Computer Architecture, Copenhagen, Denmark. ACM Press.

FPCA ’95 (1995). FPCA ’95: 7th Conference on Functional Programming Languages and
Computer Architecture, San Diego, California, USA. ACM Press.


Smith, B. (1990). The end of architecture (keynote address). In [ISCA '90, 1990].


The FAST project team (1993). FAST: Functional programming for arrays of transputers – the collected papers. Technical Report DOC 93/4 (Imperial College), CSTR 93-15 (University of Southampton), Department of Computing, Imperial College of Science, Technology and Medicine, University of London, and the Department of Electronics and Computer Science, University of Southampton, with contributions from the Department of Computer Systems, University of Amsterdam.


