1 jsem: parallelism semaphores for GHC¶
We propose to extend GHC to allow many GHC processes to cooperate in sharing
compute resources. Clients, such as cabal-install
and stack
, may
exploit this feature to execute build plans in shorter wall-time durations.
For the proposal to offer any value, the protocol herein should be implemented
by cabal-install
and stack
. Therefore we seek advice and agreement from
the community.
1.1 Motivation¶
1.1.1 Background¶
GHC, when invoked as ghc --make
, compiles a single package component
(or unit) consisting of many modules. Build tools, such as cabal-install
and stack
, compute the dependencies of a unit, compose a build plan, and
execute many ghc --make
subprocesses; first to build the dependencies of the
target unit and then the unit itself.
The ghc --make
command accepts a -jN
option to instruct it to compile up
to N
modules in parallel [1] . Each of these modules consumes one OS
thread (a GHC capability).
cabal-install
and stack
both also accept a -jN
option [2] [3] ,
however the meaning here is different. It instructs the build tool to build up
to N
units in parallel. Each of those units will be built by an invocation
of ghc --make -j1
.
In the sequel (and for brevity), we will use cabal
commands in the examples
and exposition.
1.1.2 Problem Description¶
While both GHC and cabal provide ways to parallelise their workloads, these two mechanisms do not compose well [4] .
Consider how to compile the following build plans using 8 capabilities:
Many small units. In this case, we can use the invocation
cabal build -j8 --ghc-options=-j1
, which compiles 8 units at a time, with modules from each unit being compiled serially.A single large unit. In this case, we can use the invocation
cabal build -j1 --ghc-options=-j8
, which compiles the unit in a singleghc
invocation, with GHC compiling 8 modules at a time in parallel.
In practice, however, a build plan is “wide” in some parts and “tall” in others.
Some units, e.g. vector
, have many large modules, while a build plan may
have many small units depending on vector
.
Consider the following build plan:
The optimal build strategy here is to assign all cores to building the bottom unit. Once that is complete, build all the middle units in parallel, each on a single core. Finally, compile the top unit, in parallel.
Crucially, in order to saturate all the cores, we need to be able to dynamically assign a number of capabilities to compile each unit. No single command of the form:
cabal build -j<n> --ghc-options=-j<m>
would be suitable.
Note that cabal always uses --ghc-options=-j1
, even when compiling the
“top” unit, so a top-level application with 500 modules is, by default,
always compiled serially even though many more capabilities might be available.
1.2 Proposed Change Specification¶
We want to allow the build tool and individual invocations of GHC to share
capabilities, by communicating through a semaphore. To do this, we introduce
the -jsem <sem>
flag, which specifies by name (a string) a system semaphore
through which GHC invocations can acquire and release capabilities.
All changes are gated behind this -jsem
flag:
The
-jsem
and-j
flags override eachother to determine which mechanism to use to control paralellism.If no semaphore named
<sem>
exists, GHC reports an error.
1.2.1 Change to user manual¶
.. ghc-flag:: -jsem ⟨sem⟩
:shortdesc: When compiling with :ghc-flag:`--make`, coordinate with
other processes through the semaphore ⟨sem⟩ to compile
modules in parallel.
:type: dynamic
:category: misc
Perform compilation in parallel when possible, coordinating with other
processes through the semaphore ⟨sem⟩ (specified as a string).
Error if the semaphore doesn't exist.
Use of ``-jsem`` will override use of :ghc-flag:``-j[⟨n⟩]``,
and vice-versa.
1.2.2 GHC Jobserver Protocol¶
This proposal introduces the GHC Jobserver Protocol. This protocol allows a server to dynamically invoke many instances of a client process, while restricting all of those instances to use no more than <n> capabilities. This is achieved by coordination over a system semaphore (either a POSIX semaphore [6] in the case of Linux and Darwin, or a Win32 semaphore [7] in the case of Windows platforms).
There are two kinds of participants in the GHC Jobserver protocol:
The jobserver creates a new system semaphore with a certain number of available tokens.
Each time the jobserver wants to spawn a new jobclient subprocess, it must first acquire a single token from the semaphore, before spawning the subprocess. This token must be released once the subprocess terminates.
When spawning the subprocess, the jobserver should pass on the name of the semaphore tha the jobserver created to the spawned subprocess, in order to allow the subprocess to request tokens from the semaphore.
Once work is finished, the jobserver must destroy the semaphore it created.
A jobclient is a subprocess spawned by the jobserver or another jobclient.
Each jobclient starts with one available token (its implicit token, which was acquired by the parent which spawned it), and can request more tokens through the Jobserver Protocol by waiting on the semaphore.
Each time a jobclient wants to spawn a new jobclient subprocess, it must pass on a single token to the child jobclient. This token can either be the jobclient’s implicit token, or another token which the jobclient acquired from the semaphore. It should also pass on the name of the semaphore, so that the child jobclient may acquire additional tokens from the semaphore.
Each jobclient must release exactly as many tokens as it has acquired from the semaphore (this does not include the implicit token).
All communication between the jobserver and the jobclient happens through the semaphore. This ensures modularity of the protocol: jobclients don’t need to know anything about the jobserver that has spawned them.
1.2.3 Implementation of the protocol¶
The implementation relies on an interface for system semaphores. For this, we
propose to using the functionality provided by the unix
and Win32
libraries. In practice, our implementation ships with a compatibility layer
which provides an abstraction over those two libraries. This handles creation of
fresh semaphores, acquisition, and release of tokens, including a mechanism for
interrupting a wait on a semaphore when a token is no longer needed.
The implementation consists of two separate parts: jobservers and jobclients.
We want GHC to act as a jobclient and cabal
/stack
to be jobservers.
The implementation of the jobserver protocol is very straightforward, as it
mostly consists of switching to using a system semaphore to control the tokens.
This means that the implementation in cabal
is non-invasive and easy to
maintain.
The implementation of the jobclient protocol is more complex, while still
remaining very non-invasive, as it can be implemented in a single standalone
module which contains all the logic for interacting with the semaphore. The only
other changes required consist of threading through the -jsem
information
through the driver.
GHC’s implementation of the jobclient protocol should have the following characteristics:
GHC requests one token for each unit of work it can do concurrently. (In the current implementation, a unit of work is the compilation of a single module. This means that GHC wants to have as many tokens as it can compile modules in parallel. In the future, we could envision GHC being able to use more than one token per module, e.g. if one is able to parallelise the simplifier workload.)
Note that all work is done in parallel with waiting on the semaphore. For example, if GHC has 3 tokens (including the implicit token) but could use more, it will continue compiling with 3 concurrent jobs while it waits on the semaphore for more tokens.
GHC always returns all the tokens it has acquired from the semaphore, either upon successful completion or when an exception is raised, by using the
bracket
pattern.GHC should adjust its number of capabilities, via
setNumCapabilities
[8] , to the number of tokens it is using (up to the number of available CPU cores). This is because there is a hidden cost in having a GHC program (such asghc
itself) run on fewer CPU cores than its capabilities: the stop-the-world cost of garbage collection becomes much more expensive. If we were to give GHC a fixed numberN
of capabilities, then parallel garbage collections in that GHC would recruitN
OS threads. Note that, when using-jN
instead of-jsem
, GHC already callssetNumCapabilities N
; so this extends the behaviour to-jsem
.GHC should rate-limit the release of semaphore tokens (the precise mechanism is left unspecified here). This achieves the following:
It avoids rapidly adjusting the number of capabilities (as per B), as this may have adverse effects.
It skews the balance in favour of in-unit parallelism (one unit with many capabilities) against compiling many units in parallelm (many units each being compiled using a single capability).
This allows us to prioritise completing a single large unit before moving on to other work.
The justification is that the memory used by compiling units can be released before starting another parallel process. Were GHC to release semaphore tokens too eagerly, it could end up compiling a large number of units in parallel which each have a large loaded EPS. Combined, this will use a significant amount of memory.
In practice, we suggest to implement the token acquire/release mechanism by having GHC use a local pool of tokens, in order to avoid excessive communication with the semaphore: instead of systematically releasing a token to the semaphore once a job is done, we can instead re-use this token if we have other pending jobs. This prioritises the compilation of a single unit.
1.3 Examples¶
1.3.1 Illustration of the functionality¶
Let us explain how we envision cabal
handle the following build plan, with
8 capabilities.
To start,
cabal
would create a new semaphore⟨sem⟩
, with 8 available tokens.
Next, we compile the
Bot
unit, which is a large unit, with many modules, which sits at the bottom of the dependency graph and must thus be compiled before anything else.cabal
acquires one token from the semaphore and spawns oneghc --make -jsem ⟨sem⟩
invocation.This invocation of
ghc
notices it has a lot of work to do (many modules to compile from theBot
unit), so it requests more resources from the semaphore: at least one token per module it can compile concurrently. As no other processes are competing for semaphore tokens, and all modules can be compiled in parallel (in this example), this GHC invocation obtains the remaining 7 tokens.ghc
finishes compiling theBot
unit, releasing the 7 tokens it acquired.cabal
notices theghc
subprocess has terminated, and releases the final (8th) token to the semaphore.
After that, we move to compiling the middle units.
cabal
will acquire tokens from the semaphore and spawnghc --make -jsem ⟨sem⟩
invocations.Assuming
ghc
requests a single token per module it can compile concurrently, each of theseghc
invocations won’t query for more tokens, as each unit contains a single module. As a result, socabal
will manage running 8 concurrentghc
processes, spawning new ones as previous ones terminate.
Once all the middle units are compiled,
cabal
will move on to compiling the top unit, which will proceed as in (2) with a singleghc --make -jsem ⟨sem⟩
invocation compiling 8 modules in parallel.Once all
ghc
processes have terminated, we are done, andcabal
destroys⟨sem⟩
.
In this situation, cabal
is the jobserver: it manages the semaphore and
spawns ghc
subprocesses. The ghc
subprocesses are jobclients, and they
communicate by use of the semaphore.
1.3.2 Performance results¶
Preliminary benchmarking results confirm the expected benefit of -jsem
over any possible combination cabal -jN, ghc -jM
.
For example, we noted:
a 29% speedup in compiling
lens
using 8 tokens with-jsem
versuscabal -j8, ghc -j1
(118s vs 152s).a 42% speedup in compiling
pandoc
using 8 tokens with-jsem
versuscabal -j8, ghc -j1
(556s vs 788s).
Note that, in both of these examples, cabal -j8, ghc -j1
outperformed all
other combination of the form cabal -jN, ghc -jM
.
1.4 Effects and Interactions¶
The implementation in GHC is self-contained, and doesn’t impact the rest of the
compiler much. It does however add a new flag (which interacts with -j
),
and a complete implementation requires coordination with jobservers such as
cabal
and stack
. However, these changes are small and non-invasive,
as it usually only involves switching over to using a system semaphore
to control the behaviour of -j
.
The GHC jobserver protocol specifies that all communication happens through the semaphore. This means that it doesn’t matter which jobserver created the semaphore. If nothing else is competing for resources on the semaphore, GHC will acquire as many tokens as it can make use of.
If GHC can’t acquire any tokens from the semaphore, compilation will proceed serially (as if running with
-j1
). This is because jobclients always have their implicit token. As resource acquisition is done in parallel, we won’t block the world just because we are indefinitely waiting on the semaphore.Different jobserver invocations create distinct semaphores (with different names) through which their respective child jobclients communicate. (In practice, the uniqueness is achieved by generating new uniques names that don’t clash with any existing semaphore names.) To enable multiple invocations of e.g.
cabal
to cooperate over resources, these invocations should themselves be spawned by an overarching jobserver. In that case, some jobserver would create a unique semaphore, spawncabal
jobclient processes (to which the semaphore name is passed), and thecabal
processes would in turn spawn GHC jobclient process (to which the semaphore name is passed).
1.5 Alternatives¶
1.5.1 Multiple home units¶
Support for multiple home units
(not yet fully implemented in cabal
) would provide an alternative way
to saturate the number of available capabilities.
This is because compilation with multiple home units is achieved using a single
GHC invocation, which thus doesn’t have to worry about contention with
other processes.
In general, it would be preferred to use multiple home units when possible, as
it is expected to be more performant than -jsem
:
no scheduling between different GHC invocations is necessary;
modules are loaded directly into the home unit graph, which avoids having to load the same interface files in different GHC invocations,
it doesn’t require the entire unit to finish compiling before compilation can start on another unit that depends on it: we can begin as soon as all the modules we need have been compiled.
However, it’s not always possible to compile everything with a single GHC
invocation, e.g. if the build plan involves non-Haskell dependencies somewhere
in the middle. In comparison, the -jsem
functionality can fit into any build
system that one might be using, so it supports a wider range of use cases.
The implementation of jsem is also significantly simpler, as the changes required
to jobservers (such as cabal
and stack
) are minimal.
1.5.2 One-shot mode¶
The other option for build systems is to use GHC in one-shot mode. In that case,
the build system can control the scheduling of all the jobs. This is what
hadrian
and rules_haskell
do when building projects (cabal
currently does not).
However, modifying cabal
to support this workflow would be a significant
undertaking. Morever, --make
mode is in general more performant than one-shot
mode, as one retains more information in memory, as opposed to needing to re-obtain
the information by reading interface files.
1.5.3 Other jobserver protocols¶
GNU make supports a Jobserver protocol [9] [5] which is the same as the GHC Jobserver protocol described above, except that:
it uses POSIX pipes to exchange token’s between processes.
participants in the protocol learn about it through environment variables and the state of file descriptors on process entry.
For example, rust’s cargo
implement the GNU make Jobserver protocol [13] .
A prototype implementation of the GNU make Jobserver protocol for GHC was also
made by Ellie Hermaszewska [15] .
However, we have decided to depart from this design, for the following reasons:
Other communities have considered the Make jobserver, and decided that some aspects of the protocol are unsuitable (OCaml [10] [11] , Nix [12] , ninja [16] ). To summarise:
The protocol relies on spawned processes cooperating and returning tokens on termination. If this doesn’t happen, semaphore tokens can be lost entirely. In comparison, with the approach described in this proposal, implicit tokens are controlled by the server; this means that, at worst, the build will continue with reduced parallelism, and mitigation strategies are available.
The protocol uses anonymous file descriptors to communicate between processes. This seems to be fragile, with many edge-cases. In 2019, changes to the Linux kernel broke the Make jobserver protocol, due to subtle changes in the semantics of pipes [18].
We expect
cabal-install
andstack
to be the only users of this feature in the near term. We think the proposed protocol is adequate for this use case.-jsem
doesn’t provide a general solution either for mixed-language code bases which require coordination with other build tools but then there isn’t a widely adopted solution which does.We can extend GHC to use the GNU make Jobserver protocol in the future, if there are users for it.
Some operating systems also have OS-specific methods of mediating parallelism between processes. For example, MacOS’s “Grand Central Dispatch” mechanism allows applications to queue up tasks to be run in parallel, and handles the scheduling. In order to implement parallelism at this level, it seems necessary to modify* the RTS and GHC’s own thread scheduling algorithms. Not only this, the implementation would be specific to a platform.
1.6 Unresolved Questions¶
What should the name of the command-line flag be? Perhaps
-juse-jobserver
?Should we also offer configuration of this feature via environment variables?
1.7 Implementation Plan¶
Douglas Wilson, Sam Derbyshire and Matthew Pickering have implemented a prototype at [14] .
Matthew Pickering has implemented the feature in cabal-install
in [19] .
Ongoing work from Well-Typed LLP is funded by Hasura.