1 jsem: parallelism semaphores for GHC

We propose to extend GHC to allow many GHC processes to cooperate in sharing compute resources. Clients, such as cabal-install and stack, may exploit this feature to execute build plans in shorter wall-time durations.

For the proposal to offer any value, the protocol herein should be implemented by cabal-install and stack. Therefore we seek advice and agreement from the community.

1.1 Motivation

1.1.1 Background

GHC, when invoked as ghc --make, compiles a single package component (or unit) consisting of many modules. Build tools, such as cabal-install and stack, compute the dependencies of a unit, compose a build plan, and execute many ghc --make subprocesses; first to build the dependencies of the target unit and then the unit itself.

The ghc --make command accepts a -jN option to instruct it to compile up to N modules in parallel [1] . Each of these modules consumes one OS thread (a GHC capability).

cabal-install and stack both also accept a -jN option [2] [3] , however the meaning here is different. It instructs the build tool to build up to N units in parallel. Each of those units will be built by an invocation of ghc --make -j1.

In the sequel (and for brevity), we will use cabal commands in the examples and exposition.

1.1.2 Problem Description

While both GHC and cabal provide ways to parallelise their workloads, these two mechanisms do not compose well [4] .

Consider how to compile the following build plans using 8 capabilities:

  1. Many small units. In this case, we can use the invocation cabal build -j8 --ghc-options=-j1, which compiles 8 units at a time, with modules from each unit being compiled serially.

  2. A single large unit. In this case, we can use the invocation cabal build -j1 --ghc-options=-j8, which compiles the unit in a single ghc invocation, with GHC compiling 8 modules at a time in parallel.

In practice, however, a build plan is “wide” in some parts and “tall” in others. Some units, e.g. vector, have many large modules, while a build plan may have many small units depending on vector.

Consider the following build plan:

Example dependency graph: big units at the top and bottom, and many small units in the middle.

The optimal build strategy here is to assign all cores to building the bottom unit. Once that is complete, build all the middle units in parallel, each on a single core. Finally, compile the top unit, in parallel.

Crucially, in order to saturate all the cores, we need to be able to dynamically assign a number of capabilities to compile each unit. No single command of the form:

cabal build -j<n> --ghc-options=-j<m>

would be suitable.

Note that cabal always uses --ghc-options=-j1, even when compiling the “top” unit, so a top-level application with 500 modules is, by default, always compiled serially even though many more capabilities might be available.

1.2 Proposed Change Specification

We want to allow the build tool and individual invocations of GHC to share capabilities, by communicating through a semaphore. To do this, we introduce the -jsem <sem> flag, which specifies by name (a string) a semaphore identifier through which GHC invocations can acquire and release capabilities.

All changes are gated behind this -jsem flag:

  • The -jsem and -j flags override each other to determine which mechanism to use to control parallelism.

  • If the semaphore identified by <sem> cannot be opened (e.g. it does not exist, or is incompatible), GHC emits a warning (-Wsemaphore-version-mismatch) and falls back to compiling sequentially (as if -j1 had been specified). This is safe because each jobclient always has its implicit token.

1.2.1 Change to user manual

.. ghc-flag:: -jsem ⟨sem⟩
    :shortdesc: When compiling with :ghc-flag:`--make`, coordinate with
                other processes through the semaphore ⟨sem⟩ to compile
                modules in parallel.
    :type: dynamic
    :category: misc

    Perform compilation in parallel when possible, coordinating with other
    processes through the semaphore ⟨sem⟩ (specified as a string).
    If the semaphore cannot be opened or is incompatible, GHC emits a
    warning and compiles sequentially.

    Use of ``-jsem`` will override use of :ghc-flag:``-j[⟨n⟩]``,
    and vice-versa.

1.2.2 GHC Jobserver Protocol

This proposal introduces the GHC Jobserver Protocol. This protocol allows a server to dynamically invoke many instances of a client process, while restricting all of those instances to use no more than <n> capabilities. This is achieved by coordination over a semaphore implementation provided by the semaphore-compat Haskell library.

There are two kinds of participants in the GHC Jobserver protocol:

  • The jobserver instantiates a new semaphore with a certain number of available tokens. This semaphore is identified by a string, called its semaphore identifier.

    Each time the jobserver wants to spawn a new jobclient subprocess, it must first acquire a single token from the semaphore, before spawning the subprocess. This token must be released once the subprocess terminates.

    When spawning the subprocess, the jobserver should pass on the identifier of the semaphore that the jobserver created to the spawned subprocess, in order to allow the subprocess to request tokens from the semaphore.

    Once work is finished, the jobserver must destroy the semaphore it created. At this point, the semaphore identifier is not valid for any operations and attempting to use it will produce an exception.

  • A jobclient is a subprocess spawned by the jobserver or another jobclient.

    Each jobclient can interact with the semaphore by using a valid semaphore identifier provided to it by a jobserver.

    Each jobclient starts with one available token (its implicit token, which was acquired by the parent which spawned it), and can request more tokens through the Jobserver Protocol by waiting on the semaphore.

    Each time a jobclient wants to spawn a new jobclient subprocess, it must pass on a single token to the child jobclient. This token can either be the jobclient’s implicit token, or another token which the jobclient acquired from the semaphore. It should also pass on the identifier of the semaphore, so that the child jobclient may acquire additional tokens from the semaphore.

    Each jobclient must release exactly as many tokens as it has acquired from the semaphore (this does not include the implicit token).

The semaphore identifier must begin with a prefix v<N>- where <N> is a positive integer specifying the protocol version of the semaphore. For instance, a semaphore created with semaphore-compat-2.x uses protocol version 2 and could be called v2-some-semaphore12345.

Note that the protocol version is distinct from the PVP package version of semaphore-compat. All releases within a major version series of semaphore-compat (e.g. 2.0.0 through 2.9.9) must use the same protocol version and be wire-compatible with each other. A protocol version bump (and corresponding major version bump of the library) requires approval from the GHC Steering Committee.

An identifier that lacks the v<N>- prefix is treated as protocol version 1 (for backwards compatibility with semaphore-compat-1.0.0, which predates the versioning scheme).

Version compatibility is checked by versionsAreCompatible, which requires an exact match (==). On platforms where the underlying mechanism hasn’t changed (e.g. Win32 named semaphores), semaphoreVersion stays at the same value across library releases, so the check always succeeds. If a future version changes the mechanism on a given platform, the version number bumps and old code correctly rejects the mismatch.

Jobclients that receive an incompatible semaphore identifier must emit a warning and continue the build without requesting any additional tokens, using only the implicit token they inherited from their parent process. In GHC, this warning is -Wsemaphore-version-mismatch (diagnostic code [GHC-56206]).

Jobservers should check the protocol version supported by the jobclient before passing -jsem. For GHC, the protocol version is available via the Semaphore version field in ghc --info; absence of this field indicates protocol version 1. If the versions are incompatible, the jobserver should fall back to invoking the jobclient without -jsem and emit a warning.

1.2.3 The semaphore-compat library

All communication between jobservers and jobclients happens through the semaphore-compat Haskell library (documentation). Tools that wish to participate in the GHC Jobserver Protocol must use this library (or a compatible reimplementation).

The core API is:

-- Protocol version
newtype SemaphoreProtocolVersion =
  SemaphoreProtocolVersion { getSemaphoreProtocolVersion :: Int }

-- Semaphore name (protocol version + unversioned name)
data SemaphoreName = SemaphoreName
  { semaphoreProtocolVersion       :: !SemaphoreProtocolVersion
  , unversionedSemaphoreNameString :: !String
  }

-- Server-side
createSemaphore        :: String -> Int -> IO (Either SemaphoreError ServerSemaphore)
freshSemaphore         :: String -> Int -> IO (Either SemaphoreError ServerSemaphore)
destroyServerSemaphore :: ServerSemaphore -> IO ()

-- Client-side
openSemaphore          :: String -> IO (Either SemaphoreError ClientSemaphore)
destroyClientSemaphore :: ClientSemaphore -> IO ()

-- Token operations
waitOnSemaphore       :: ClientSemaphore -> IO SemaphoreToken  -- interruptible
tryWaitOnSemaphore    :: ClientSemaphore -> IO (Maybe SemaphoreToken)
releaseSemaphoreToken :: SemaphoreToken -> IO ()
withSemaphoreToken    :: ClientSemaphore -> (SemaphoreToken -> IO a) -> IO a

-- Version negotiation
semaphoreVersion      :: SemaphoreProtocolVersion  -- 2 on POSIX, 1 on Windows
versionsAreCompatible :: SemaphoreProtocolVersion -> SemaphoreProtocolVersion -> Bool
semaphoreIdentifier   :: SemaphoreName -> String   -- the -jsem identifier string

semaphore-compat is free to implement the semaphore mechanism in any way it chooses, provided all releases within the same major version series are wire-compatible. The current v2 implementation uses Unix domain sockets on POSIX and Win32 named semaphores on Windows.

GHC’s implementation of the jobclient protocol should have the following characteristics:

  1. GHC requests one token for each unit of work it can do concurrently. (In the current implementation, a unit of work is the compilation of a single module. This means that GHC wants to have as many tokens as it can compile modules in parallel. In the future, we could envision GHC being able to use more than one token per module, e.g. if one is able to parallelise the simplifier workload.)

    Note that all work is done in parallel with waiting on the semaphore. For example, if GHC has 3 tokens (including the implicit token) but could use more, it will continue compiling with 3 concurrent jobs while it waits on the semaphore for more tokens.

  2. GHC always returns all the tokens it has acquired from the semaphore, either upon successful completion or when an exception is raised, by using the bracket pattern.

  3. GHC should adjust its number of capabilities, via setNumCapabilities [8] , to the number of tokens it is using (up to the number of available CPU cores). This is because there is a hidden cost in having a GHC program (such as ghc itself) run on fewer CPU cores than its capabilities: the stop-the-world cost of garbage collection becomes much more expensive. If we were to give GHC a fixed number N of capabilities, then parallel garbage collections in that GHC would recruit N OS threads. Note that, when using -jN instead of -jsem, GHC already calls setNumCapabilities N; so this extends the behaviour to -jsem.

  4. GHC should rate-limit the release of semaphore tokens (the precise mechanism is left unspecified here). This achieves the following:

    1. It avoids rapidly adjusting the number of capabilities (as per B), as this may have adverse effects.

    2. It skews the balance in favour of in-unit parallelism (one unit with many capabilities) against compiling many units in parallel (many units each being compiled using a single capability).

      This allows us to prioritise completing a single large unit before moving on to other work.

      The justification is that the memory used by compiling units can be released before starting another parallel process. Were GHC to release semaphore tokens too eagerly, it could end up compiling a large number of units in parallel which each have a large loaded EPS. Combined, this will use a significant amount of memory.

In practice, we suggest to implement the token acquire/release mechanism by having GHC use a local pool of tokens, in order to avoid excessive communication with the semaphore: instead of systematically releasing a token to the semaphore once a job is done, we can instead re-use this token if we have other pending jobs. This prioritises the compilation of a single unit.

1.2.4 Backwards Compatibility

Motivation for protocol v2. The original version of this proposal (GHC 9.8, semaphore-compat-1.0.0) used POSIX system semaphores, whose shared memory layout is specific to each libc implementation. When GHC (linked against glibc) and cabal-install (statically linked against musl) used different libc implementations, the incompatible shared memory layouts caused crashes ([20], [21]). This v1-with-mixed-libc case cannot be fixed retroactively — both sides are already released. Protocol v2 (semaphore-compat-2.0.0) replaces POSIX system semaphores with Unix domain sockets, eliminating the libc dependency entirely.

``ghc –info`` version field. GHC versions that ship semaphore-compat-2.x must include a Semaphore version field in their ghc --info output, reporting the value of semaphoreVersion (e.g. ("Semaphore version","2") on POSIX, ("Semaphore version","1") on Windows). Older GHC releases (9.8–9.14) lack this field; jobservers treat the absence as protocol version 1.

Compatibility matrix (POSIX).

Compatibility on Windows. On Windows, the Win32 named semaphore mechanism is unchanged between v1 and v2, so semaphoreVersion remains at 1 on Windows. Because the version is 1, semaphoreIdentifier produces a bare name (no v<N>- prefix), matching the format used by v1 clients. As a result, all version combinations work on Windows without fallback, including library-v2 Cabal + v1 GHC and v1 Cabal + library-v2 GHC.

Graceful degradation. When GHC falls back to sequential compilation (ignoring -jsem), it uses only its implicit token — acquired by the jobserver before spawning GHC. No additional tokens are acquired from or released to the semaphore, so the semaphore invariants are preserved and no tokens are leaked. The build proceeds at reduced parallelism but is always correct.

1.3 Examples

1.3.1 Illustration of the functionality

Let us explain how we envision cabal handle the following build plan, with 8 capabilities.

Same dependency graph as before: a big unit at the top and bottom, and many single-module units in the middle.
  1. To start, cabal would create a new semaphore ⟨sem⟩, with 8 available tokens.

  1. Next, we compile the Bot unit, which is a large unit, with many modules, which sits at the bottom of the dependency graph and must thus be compiled before anything else.

    1. cabal acquires one token from the semaphore and spawns one ghc --make -jsem ⟨sem⟩ invocation.

    2. This invocation of ghc notices it has a lot of work to do (many modules to compile from the Bot unit), so it requests more resources from the semaphore: at least one token per module it can compile concurrently. As no other processes are competing for semaphore tokens, and all modules can be compiled in parallel (in this example), this GHC invocation obtains the remaining 7 tokens.

    3. ghc finishes compiling the Bot unit, releasing the 7 tokens it acquired.

    4. cabal notices the ghc subprocess has terminated, and releases the final (8th) token to the semaphore.

  2. After that, we move to compiling the middle units.

    1. cabal will acquire tokens from the semaphore and spawn ghc --make -jsem ⟨sem⟩ invocations.

    2. Assuming ghc requests a single token per module it can compile concurrently, each of these ghc invocations won’t query for more tokens, as each unit contains a single module. As a result, so cabal will manage running 8 concurrent ghc processes, spawning new ones as previous ones terminate.

  3. Once all the middle units are compiled, cabal will move on to compiling the top unit, which will proceed as in (2) with a single ghc --make -jsem ⟨sem⟩ invocation compiling 8 modules in parallel.

  4. Once all ghc processes have terminated, we are done, and cabal destroys ⟨sem⟩.

In this situation, cabal is the jobserver: it manages the semaphore and spawns ghc subprocesses. The ghc subprocesses are jobclients, and they communicate by use of the semaphore.

1.3.2 Performance results

Preliminary benchmarking results confirm the expected benefit of -jsem over any possible combination cabal -jN, ghc -jM.

For example, we noted:

  • a 29% speedup in compiling lens using 8 tokens with -jsem versus cabal -j8, ghc -j1 (118s vs 152s).

  • a 42% speedup in compiling pandoc using 8 tokens with -jsem versus cabal -j8, ghc -j1 (556s vs 788s).

Note that, in both of these examples, cabal -j8, ghc -j1 outperformed all other combination of the form cabal -jN, ghc -jM.

1.4 Effects and Interactions

The implementation in GHC is self-contained, and doesn’t impact the rest of the compiler much. It does however add a new flag (which interacts with -j), and a complete implementation requires coordination with jobservers such as cabal and stack. However, these changes are small and non-invasive, as it usually only involves switching over to using semaphore-compat to control the behaviour of -j.

  • The GHC jobserver protocol specifies that all communication happens through the semaphore. This means that it doesn’t matter which jobserver created the semaphore. If nothing else is competing for resources on the semaphore, GHC will acquire as many tokens as it can make use of.

  • If GHC can’t acquire any tokens from the semaphore, compilation will proceed serially (as if running with -j1). This is because jobclients always have their implicit token. As resource acquisition is done in parallel, we won’t block the world just because we are indefinitely waiting on the semaphore.

  • Different jobserver invocations create distinct semaphores (with different identifiers) through which their respective child jobclients communicate. (In practice, the uniqueness is achieved by generating new uniques names that don’t clash with any existing semaphore names.) To enable multiple invocations of e.g. cabal to cooperate over resources, these invocations should themselves be spawned by an overarching jobserver. In that case, some jobserver would create a unique semaphore, spawn cabal jobclient processes (to which the semaphore identifier is passed), and the cabal processes would in turn spawn GHC jobclient process (to which the semaphore identifier is passed).

1.5 Alternatives

1.5.1 Multiple home units

Support for multiple home units (not yet fully implemented in cabal) would provide an alternative way to saturate the number of available capabilities. This is because compilation with multiple home units is achieved using a single GHC invocation, which thus doesn’t have to worry about contention with other processes.

In general, it would be preferred to use multiple home units when possible, as it is expected to be more performant than -jsem:

  • no scheduling between different GHC invocations is necessary;

  • modules are loaded directly into the home unit graph, which avoids having to load the same interface files in different GHC invocations,

  • it doesn’t require the entire unit to finish compiling before compilation can start on another unit that depends on it: we can begin as soon as all the modules we need have been compiled.

However, it’s not always possible to compile everything with a single GHC invocation, e.g. if the build plan involves non-Haskell dependencies somewhere in the middle. In comparison, the -jsem functionality can fit into any build system that one might be using, so it supports a wider range of use cases. The implementation of jsem is also significantly simpler, as the changes required to jobservers (such as cabal and stack) are minimal.

1.5.2 One-shot mode

The other option for build systems is to use GHC in one-shot mode. In that case, the build system can control the scheduling of all the jobs. This is what hadrian and rules_haskell do when building projects (cabal currently does not).

However, modifying cabal to support this workflow would be a significant undertaking. Morever, --make mode is in general more performant than one-shot mode, as one retains more information in memory, as opposed to needing to re-obtain the information by reading interface files.

1.5.3 Other jobserver protocols

GNU make supports a Jobserver protocol [9] [5] which is the same as the GHC Jobserver protocol described above, except that:

  • it uses POSIX pipes to exchange token’s between processes.

  • participants in the protocol learn about it through environment variables and the state of file descriptors on process entry.

For example, rust’s cargo implement the GNU make Jobserver protocol [13] . A prototype implementation of the GNU make Jobserver protocol for GHC was also made by Ellie Hermaszewska [15] .

However, we have decided to depart from this design, for the following reasons:

  • Other communities have considered the Make jobserver, and decided that some aspects of the protocol are unsuitable (OCaml [10] [11] , Nix [12] , ninja [16] ). To summarise:

    • The protocol relies on spawned processes cooperating and returning tokens on termination. If this doesn’t happen, semaphore tokens can be lost entirely. In comparison, with the approach described in this proposal, implicit tokens are controlled by the server; this means that, at worst, the build will continue with reduced parallelism, and mitigation strategies are available.

    • The protocol uses anonymous file descriptors to communicate between processes. This seems to be fragile, with many edge-cases. In 2019, changes to the Linux kernel broke the Make jobserver protocol, due to subtle changes in the semantics of pipes [18].

  • We expect cabal-install and stack to be the only users of this feature in the near term. We think the proposed protocol is adequate for this use case. -jsem doesn’t provide a general solution either for mixed-language code bases which require coordination with other build tools but then there isn’t a widely adopted solution which does.

  • We can extend GHC to use the GNU make Jobserver protocol in the future, if there are users for it.

Some operating systems also have OS-specific methods of mediating parallelism between processes. For example, MacOS’s “Grand Central Dispatch” mechanism allows applications to queue up tasks to be run in parallel, and handles the scheduling. In order to implement parallelism at this level, it seems necessary to modify* the RTS and GHC’s own thread scheduling algorithms. Not only this, the implementation would be specific to a platform.

1.6 Unresolved Questions

  • What should the name of the command-line flag be? Perhaps -juse-jobserver?

  • Should we also offer configuration of this feature via environment variables?

1.7 Implementation Plan

Douglas Wilson, Sam Derbyshire and Matthew Pickering have implemented a prototype at [14] .

Matthew Pickering has implemented the feature in cabal-install in [19] .

Ongoing work from Well-Typed LLP is funded by Hasura.

1.8 Footnotes