1 jsem: parallelism semaphores for GHC¶
We propose to extend GHC to allow many GHC processes to cooperate in sharing
compute resources. Clients, such as cabal-install and stack, may
exploit this feature to execute build plans in shorter wall-time durations.
For the proposal to offer any value, the protocol herein should be implemented
by cabal-install and stack. Therefore we seek advice and agreement from
the community.
1.1 Motivation¶
1.1.1 Background¶
GHC, when invoked as ghc --make, compiles a single package component
(or unit) consisting of many modules. Build tools, such as cabal-install
and stack, compute the dependencies of a unit, compose a build plan, and
execute many ghc --make subprocesses; first to build the dependencies of the
target unit and then the unit itself.
The ghc --make command accepts a -jN option to instruct it to compile up
to N modules in parallel [1] . Each of these modules consumes one OS
thread (a GHC capability).
cabal-install and stack both also accept a -jN option [2] [3] ,
however the meaning here is different. It instructs the build tool to build up
to N units in parallel. Each of those units will be built by an invocation
of ghc --make -j1.
In the sequel (and for brevity), we will use cabal commands in the examples
and exposition.
1.1.2 Problem Description¶
While both GHC and cabal provide ways to parallelise their workloads, these two mechanisms do not compose well [4] .
Consider how to compile the following build plans using 8 capabilities:
Many small units. In this case, we can use the invocation
cabal build -j8 --ghc-options=-j1, which compiles 8 units at a time, with modules from each unit being compiled serially.A single large unit. In this case, we can use the invocation
cabal build -j1 --ghc-options=-j8, which compiles the unit in a singleghcinvocation, with GHC compiling 8 modules at a time in parallel.
In practice, however, a build plan is “wide” in some parts and “tall” in others.
Some units, e.g. vector, have many large modules, while a build plan may
have many small units depending on vector.
Consider the following build plan:
The optimal build strategy here is to assign all cores to building the bottom unit. Once that is complete, build all the middle units in parallel, each on a single core. Finally, compile the top unit, in parallel.
Crucially, in order to saturate all the cores, we need to be able to dynamically assign a number of capabilities to compile each unit. No single command of the form:
cabal build -j<n> --ghc-options=-j<m>
would be suitable.
Note that cabal always uses --ghc-options=-j1, even when compiling the
“top” unit, so a top-level application with 500 modules is, by default,
always compiled serially even though many more capabilities might be available.
1.2 Proposed Change Specification¶
We want to allow the build tool and individual invocations of GHC to share
capabilities, by communicating through a semaphore. To do this, we introduce
the -jsem <sem> flag, which specifies by name (a string) a semaphore identifier
through which GHC invocations can acquire and release capabilities.
All changes are gated behind this -jsem flag:
The
-jsemand-jflags override each other to determine which mechanism to use to control parallelism.If the semaphore identified by
<sem>cannot be opened (e.g. it does not exist, or is incompatible), GHC emits a warning (-Wsemaphore-version-mismatch) and falls back to compiling sequentially (as if-j1had been specified). This is safe because each jobclient always has its implicit token.
1.2.1 Change to user manual¶
.. ghc-flag:: -jsem ⟨sem⟩
:shortdesc: When compiling with :ghc-flag:`--make`, coordinate with
other processes through the semaphore ⟨sem⟩ to compile
modules in parallel.
:type: dynamic
:category: misc
Perform compilation in parallel when possible, coordinating with other
processes through the semaphore ⟨sem⟩ (specified as a string).
If the semaphore cannot be opened or is incompatible, GHC emits a
warning and compiles sequentially.
Use of ``-jsem`` will override use of :ghc-flag:``-j[⟨n⟩]``,
and vice-versa.
1.2.2 GHC Jobserver Protocol¶
This proposal introduces the GHC Jobserver Protocol. This protocol allows
a server to dynamically invoke many instances of a client process,
while restricting all of those instances to use no more than <n> capabilities.
This is achieved by coordination over a semaphore implementation provided
by the semaphore-compat Haskell library.
There are two kinds of participants in the GHC Jobserver protocol:
The jobserver instantiates a new semaphore with a certain number of available tokens. This semaphore is identified by a string, called its semaphore identifier.
Each time the jobserver wants to spawn a new jobclient subprocess, it must first acquire a single token from the semaphore, before spawning the subprocess. This token must be released once the subprocess terminates.
When spawning the subprocess, the jobserver should pass on the identifier of the semaphore that the jobserver created to the spawned subprocess, in order to allow the subprocess to request tokens from the semaphore.
Once work is finished, the jobserver must destroy the semaphore it created. At this point, the semaphore identifier is not valid for any operations and attempting to use it will produce an exception.
A jobclient is a subprocess spawned by the jobserver or another jobclient.
Each jobclient can interact with the semaphore by using a valid semaphore identifier provided to it by a jobserver.
Each jobclient starts with one available token (its implicit token, which was acquired by the parent which spawned it), and can request more tokens through the Jobserver Protocol by waiting on the semaphore.
Each time a jobclient wants to spawn a new jobclient subprocess, it must pass on a single token to the child jobclient. This token can either be the jobclient’s implicit token, or another token which the jobclient acquired from the semaphore. It should also pass on the identifier of the semaphore, so that the child jobclient may acquire additional tokens from the semaphore.
Each jobclient must release exactly as many tokens as it has acquired from the semaphore (this does not include the implicit token).
The semaphore identifier must begin with a prefix v<N>- where <N>
is a positive integer specifying the protocol version of the semaphore.
For instance, a semaphore created with semaphore-compat-2.x uses protocol
version 2 and could be called v2-some-semaphore12345.
Note that the protocol version is distinct from the PVP package version of
semaphore-compat. All releases within a major version series of
semaphore-compat (e.g. 2.0.0 through 2.9.9) must use the same
protocol version and be wire-compatible with each other. A protocol version
bump (and corresponding major version bump of the library) requires approval
from the GHC Steering Committee.
An identifier that lacks the v<N>- prefix is treated as protocol version 1
(for backwards compatibility with semaphore-compat-1.0.0, which predates
the versioning scheme).
Version compatibility is checked by versionsAreCompatible, which requires
an exact match (==). On platforms where the underlying mechanism hasn’t
changed (e.g. Win32 named semaphores), semaphoreVersion stays at the same
value across library releases, so the check always succeeds. If a future
version changes the mechanism on a given platform, the version number bumps
and old code correctly rejects the mismatch.
Jobclients that receive an incompatible semaphore identifier must emit
a warning and continue the build without requesting any additional tokens,
using only the implicit token they inherited from their parent process.
In GHC, this warning is -Wsemaphore-version-mismatch (diagnostic code
[GHC-56206]).
Jobservers should check the protocol version supported by the jobclient
before passing -jsem. For GHC, the protocol version is available via the
Semaphore version field in ghc --info; absence of this field indicates
protocol version 1. If the versions are incompatible, the jobserver should
fall back to invoking the jobclient without -jsem and emit a warning.
1.2.3 The semaphore-compat library¶
All communication between jobservers and jobclients happens through
the semaphore-compat Haskell library
(documentation).
Tools that wish to participate in the GHC Jobserver Protocol must use this
library (or a compatible reimplementation).
The core API is:
-- Protocol version
newtype SemaphoreProtocolVersion =
SemaphoreProtocolVersion { getSemaphoreProtocolVersion :: Int }
-- Semaphore name (protocol version + unversioned name)
data SemaphoreName = SemaphoreName
{ semaphoreProtocolVersion :: !SemaphoreProtocolVersion
, unversionedSemaphoreNameString :: !String
}
-- Server-side
createSemaphore :: String -> Int -> IO (Either SemaphoreError ServerSemaphore)
freshSemaphore :: String -> Int -> IO (Either SemaphoreError ServerSemaphore)
destroyServerSemaphore :: ServerSemaphore -> IO ()
-- Client-side
openSemaphore :: String -> IO (Either SemaphoreError ClientSemaphore)
destroyClientSemaphore :: ClientSemaphore -> IO ()
-- Token operations
waitOnSemaphore :: ClientSemaphore -> IO SemaphoreToken -- interruptible
tryWaitOnSemaphore :: ClientSemaphore -> IO (Maybe SemaphoreToken)
releaseSemaphoreToken :: SemaphoreToken -> IO ()
withSemaphoreToken :: ClientSemaphore -> (SemaphoreToken -> IO a) -> IO a
-- Version negotiation
semaphoreVersion :: SemaphoreProtocolVersion -- 2 on POSIX, 1 on Windows
versionsAreCompatible :: SemaphoreProtocolVersion -> SemaphoreProtocolVersion -> Bool
semaphoreIdentifier :: SemaphoreName -> String -- the -jsem identifier string
semaphore-compat is free to implement the semaphore mechanism in any way it
chooses, provided all releases within the same major version series are
wire-compatible. The current v2 implementation uses Unix domain sockets on
POSIX and Win32 named semaphores on Windows.
GHC’s implementation of the jobclient protocol should have the following characteristics:
GHC requests one token for each unit of work it can do concurrently. (In the current implementation, a unit of work is the compilation of a single module. This means that GHC wants to have as many tokens as it can compile modules in parallel. In the future, we could envision GHC being able to use more than one token per module, e.g. if one is able to parallelise the simplifier workload.)
Note that all work is done in parallel with waiting on the semaphore. For example, if GHC has 3 tokens (including the implicit token) but could use more, it will continue compiling with 3 concurrent jobs while it waits on the semaphore for more tokens.
GHC always returns all the tokens it has acquired from the semaphore, either upon successful completion or when an exception is raised, by using the
bracketpattern.GHC should adjust its number of capabilities, via
setNumCapabilities[8] , to the number of tokens it is using (up to the number of available CPU cores). This is because there is a hidden cost in having a GHC program (such asghcitself) run on fewer CPU cores than its capabilities: the stop-the-world cost of garbage collection becomes much more expensive. If we were to give GHC a fixed numberNof capabilities, then parallel garbage collections in that GHC would recruitNOS threads. Note that, when using-jNinstead of-jsem, GHC already callssetNumCapabilities N; so this extends the behaviour to-jsem.GHC should rate-limit the release of semaphore tokens (the precise mechanism is left unspecified here). This achieves the following:
It avoids rapidly adjusting the number of capabilities (as per B), as this may have adverse effects.
It skews the balance in favour of in-unit parallelism (one unit with many capabilities) against compiling many units in parallel (many units each being compiled using a single capability).
This allows us to prioritise completing a single large unit before moving on to other work.
The justification is that the memory used by compiling units can be released before starting another parallel process. Were GHC to release semaphore tokens too eagerly, it could end up compiling a large number of units in parallel which each have a large loaded EPS. Combined, this will use a significant amount of memory.
In practice, we suggest to implement the token acquire/release mechanism by having GHC use a local pool of tokens, in order to avoid excessive communication with the semaphore: instead of systematically releasing a token to the semaphore once a job is done, we can instead re-use this token if we have other pending jobs. This prioritises the compilation of a single unit.
1.2.4 Backwards Compatibility¶
Motivation for protocol v2.
The original version of this proposal (GHC 9.8, semaphore-compat-1.0.0)
used POSIX system semaphores, whose shared memory layout is specific to each
libc implementation. When GHC (linked against glibc) and cabal-install
(statically linked against musl) used different libc implementations, the
incompatible shared memory layouts caused crashes ([20], [21]).
This v1-with-mixed-libc case cannot be fixed retroactively — both sides
are already released. Protocol v2 (semaphore-compat-2.0.0) replaces
POSIX system semaphores with Unix domain sockets, eliminating the libc
dependency entirely.
``ghc –info`` version field.
GHC versions that ship semaphore-compat-2.x must include a
Semaphore version field in their ghc --info output, reporting
the value of semaphoreVersion (e.g. ("Semaphore version","2")
on POSIX, ("Semaphore version","1") on Windows). Older GHC
releases (9.8–9.14) lack this field; jobservers treat the absence as
protocol version 1.
Compatibility matrix (POSIX).
Compatibility on Windows.
On Windows, the Win32 named semaphore mechanism is unchanged between v1
and v2, so semaphoreVersion remains at 1 on Windows. Because the
version is 1, semaphoreIdentifier produces a bare name (no v<N>-
prefix), matching the format used by v1 clients. As a result, all
version combinations work on Windows without fallback, including
library-v2 Cabal + v1 GHC and v1 Cabal + library-v2 GHC.
Graceful degradation.
When GHC falls back to sequential compilation (ignoring -jsem), it uses
only its implicit token — acquired by the jobserver before spawning GHC.
No additional tokens are acquired from or released to the semaphore, so the
semaphore invariants are preserved and no tokens are leaked. The build
proceeds at reduced parallelism but is always correct.
1.3 Examples¶
1.3.1 Illustration of the functionality¶
Let us explain how we envision cabal handle the following build plan, with
8 capabilities.
To start,
cabalwould create a new semaphore⟨sem⟩, with 8 available tokens.
Next, we compile the
Botunit, which is a large unit, with many modules, which sits at the bottom of the dependency graph and must thus be compiled before anything else.cabalacquires one token from the semaphore and spawns oneghc --make -jsem ⟨sem⟩invocation.This invocation of
ghcnotices it has a lot of work to do (many modules to compile from theBotunit), so it requests more resources from the semaphore: at least one token per module it can compile concurrently. As no other processes are competing for semaphore tokens, and all modules can be compiled in parallel (in this example), this GHC invocation obtains the remaining 7 tokens.ghcfinishes compiling theBotunit, releasing the 7 tokens it acquired.cabalnotices theghcsubprocess has terminated, and releases the final (8th) token to the semaphore.
After that, we move to compiling the middle units.
cabalwill acquire tokens from the semaphore and spawnghc --make -jsem ⟨sem⟩invocations.Assuming
ghcrequests a single token per module it can compile concurrently, each of theseghcinvocations won’t query for more tokens, as each unit contains a single module. As a result, socabalwill manage running 8 concurrentghcprocesses, spawning new ones as previous ones terminate.
Once all the middle units are compiled,
cabalwill move on to compiling the top unit, which will proceed as in (2) with a singleghc --make -jsem ⟨sem⟩invocation compiling 8 modules in parallel.Once all
ghcprocesses have terminated, we are done, andcabaldestroys⟨sem⟩.
In this situation, cabal is the jobserver: it manages the semaphore and
spawns ghc subprocesses. The ghc subprocesses are jobclients, and they
communicate by use of the semaphore.
1.3.2 Performance results¶
Preliminary benchmarking results confirm the expected benefit of -jsem
over any possible combination cabal -jN, ghc -jM.
For example, we noted:
a 29% speedup in compiling
lensusing 8 tokens with-jsemversuscabal -j8, ghc -j1(118s vs 152s).a 42% speedup in compiling
pandocusing 8 tokens with-jsemversuscabal -j8, ghc -j1(556s vs 788s).
Note that, in both of these examples, cabal -j8, ghc -j1 outperformed all
other combination of the form cabal -jN, ghc -jM.
1.4 Effects and Interactions¶
The implementation in GHC is self-contained, and doesn’t impact the rest of the
compiler much. It does however add a new flag (which interacts with -j),
and a complete implementation requires coordination with jobservers such as
cabal and stack. However, these changes are small and non-invasive,
as it usually only involves switching over to using semaphore-compat
to control the behaviour of -j.
The GHC jobserver protocol specifies that all communication happens through the semaphore. This means that it doesn’t matter which jobserver created the semaphore. If nothing else is competing for resources on the semaphore, GHC will acquire as many tokens as it can make use of.
If GHC can’t acquire any tokens from the semaphore, compilation will proceed serially (as if running with
-j1). This is because jobclients always have their implicit token. As resource acquisition is done in parallel, we won’t block the world just because we are indefinitely waiting on the semaphore.Different jobserver invocations create distinct semaphores (with different identifiers) through which their respective child jobclients communicate. (In practice, the uniqueness is achieved by generating new uniques names that don’t clash with any existing semaphore names.) To enable multiple invocations of e.g.
cabalto cooperate over resources, these invocations should themselves be spawned by an overarching jobserver. In that case, some jobserver would create a unique semaphore, spawncabaljobclient processes (to which the semaphore identifier is passed), and thecabalprocesses would in turn spawn GHC jobclient process (to which the semaphore identifier is passed).
1.5 Alternatives¶
1.5.1 Multiple home units¶
Support for multiple home units
(not yet fully implemented in cabal) would provide an alternative way
to saturate the number of available capabilities.
This is because compilation with multiple home units is achieved using a single
GHC invocation, which thus doesn’t have to worry about contention with
other processes.
In general, it would be preferred to use multiple home units when possible, as
it is expected to be more performant than -jsem:
no scheduling between different GHC invocations is necessary;
modules are loaded directly into the home unit graph, which avoids having to load the same interface files in different GHC invocations,
it doesn’t require the entire unit to finish compiling before compilation can start on another unit that depends on it: we can begin as soon as all the modules we need have been compiled.
However, it’s not always possible to compile everything with a single GHC
invocation, e.g. if the build plan involves non-Haskell dependencies somewhere
in the middle. In comparison, the -jsem functionality can fit into any build
system that one might be using, so it supports a wider range of use cases.
The implementation of jsem is also significantly simpler, as the changes required
to jobservers (such as cabal and stack) are minimal.
1.5.2 One-shot mode¶
The other option for build systems is to use GHC in one-shot mode. In that case,
the build system can control the scheduling of all the jobs. This is what
hadrian and rules_haskell do when building projects (cabal
currently does not).
However, modifying cabal to support this workflow would be a significant
undertaking. Morever, --make mode is in general more performant than one-shot
mode, as one retains more information in memory, as opposed to needing to re-obtain
the information by reading interface files.
1.5.3 Other jobserver protocols¶
GNU make supports a Jobserver protocol [9] [5] which is the same as the GHC Jobserver protocol described above, except that:
it uses POSIX pipes to exchange token’s between processes.
participants in the protocol learn about it through environment variables and the state of file descriptors on process entry.
For example, rust’s cargo implement the GNU make Jobserver protocol [13] .
A prototype implementation of the GNU make Jobserver protocol for GHC was also
made by Ellie Hermaszewska [15] .
However, we have decided to depart from this design, for the following reasons:
Other communities have considered the Make jobserver, and decided that some aspects of the protocol are unsuitable (OCaml [10] [11] , Nix [12] , ninja [16] ). To summarise:
The protocol relies on spawned processes cooperating and returning tokens on termination. If this doesn’t happen, semaphore tokens can be lost entirely. In comparison, with the approach described in this proposal, implicit tokens are controlled by the server; this means that, at worst, the build will continue with reduced parallelism, and mitigation strategies are available.
The protocol uses anonymous file descriptors to communicate between processes. This seems to be fragile, with many edge-cases. In 2019, changes to the Linux kernel broke the Make jobserver protocol, due to subtle changes in the semantics of pipes [18].
We expect
cabal-installandstackto be the only users of this feature in the near term. We think the proposed protocol is adequate for this use case.-jsemdoesn’t provide a general solution either for mixed-language code bases which require coordination with other build tools but then there isn’t a widely adopted solution which does.We can extend GHC to use the GNU make Jobserver protocol in the future, if there are users for it.
Some operating systems also have OS-specific methods of mediating parallelism between processes. For example, MacOS’s “Grand Central Dispatch” mechanism allows applications to queue up tasks to be run in parallel, and handles the scheduling. In order to implement parallelism at this level, it seems necessary to modify* the RTS and GHC’s own thread scheduling algorithms. Not only this, the implementation would be specific to a platform.
1.6 Unresolved Questions¶
What should the name of the command-line flag be? Perhaps
-juse-jobserver?Should we also offer configuration of this feature via environment variables?
1.7 Implementation Plan¶
Douglas Wilson, Sam Derbyshire and Matthew Pickering have implemented a prototype at [14] .
Matthew Pickering has implemented the feature in cabal-install in [19] .
Ongoing work from Well-Typed LLP is funded by Hasura.