mercredi 31 juillet 2019

Project Loom: Fibers and Continuations for the Java Virtual Machine

Pour cette semaine, je vous propose un article au sujet de l'intégration des Fibres (Fibers en anglais) dans la plate-forme Java : 

Ce concept de Fibre n'est pas nouveau, il existe déjà depuis longtemps dans d'autres langages de programmation : 
https://fr.wikipedia.org/wiki/Fibre_(informatique) 

La JVM peut gérer 1.000 Threads Java (qui sont mappé sur les threads de l'OS). Elle vise de gérer 10.000.000 Fibres : il y a un rapport de scalabilité de 10.000. C'est ce problème de performance des Threads qui est réglé par les Fibres. 

Avec des Fibres scallable, l'objectif est de pouvoir faire de la programmation synchrone avec la scallabilité de l'asynchroneLa programmation synchrone est plus simple, pour les humain, à écrire et à débugger. 

L'intégration des Fibres dans Java va bouleverser tout les schémas dynamiques actuel dit "réactif".

Lorsque les threads sont gérés par un conteneur, le passage des threads aux Fibres sera transparent pour le code applicatif.

Un deuxième prototype des Fibres est sorti en juillet 2019 : celui est utilisable. Il est basé sur le JDK 14 : cela ne veux pas dire que la version finale sera le JDK 14 (mars 2020). 


A suivre de près ... 


https://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.html

Project Loom: Fibers and Continuations for the Java Virtual Machine 
Overview
Project Loom's mission is to make it easier to write, debug, profile and maintain concurrent applications meeting today's requirements. Threads, provided by Java from its first day, are a natural and convenient concurrency construct (putting aside the separate question of communication among threads) which is being supplanted by less convenient abstractions because their current implementation as OS kernel threads is insufficient for meeting modern demands, and wasteful in computing resources that are particularly valuable in the cloud. Project Loom will introduce fibers as lightweight, efficient threads managed by the Java Virtual Machine, that let developers use the same simple abstraction but with better performance and lower footprint. We want to make concurrency simple(r) again! A fiber is made of two components — a continuation and a scheduler. As Java already has an excellent scheduler in the form of ForkJoinPool, fibers will be implemented by adding continuations to the JVM.
Motivation
Many applications written for the Java Virtual Machine are concurrent — meaning, programs like servers and databases, that are required to serve many requests, occurring concurrently and competing for computational resources. Project Loom is intended to significantly reduce the difficulty of writing efficient concurrent applications, or, more precisely, to eliminate the tradeoff between simplicity and efficiency in writing concurrent programs.
One of Java's most important contributions when it was first released, over twenty years ago, was the easy access to threads and synchronization primitives. Java threads (either used directly, or indirectly through, for example, Java servlets processing HTTP requests) provided a relatively simple abstraction for writing concurrent applications. These days, however, one of the main difficulties in writing concurrent programs that meet today's requirements is that the software unit of concurrency offered by the runtime — the thread — cannot match the scale of the domain's unit of concurrency, be it a user, a transaction or even a single operation. Even if the unit of application concurrency is coarse — say, a session, represented by single socket connection — a server can handle upward of a million concurrent open sockets, yet the Java runtime, which uses the operating system's threads for its implementation of Java threads, cannot efficiently handle more than a few thousand. A mismatch in several orders of magnitude has a big impact.
Programmers are forced to choose between modeling a unit of domain concurrency directly as a thread and lose considerably in the scale of concurrency a single server can support, or use other constructs to implement concurrency on a finer-grained level than threads (tasks), and support concurrency by writing asynchronous code that does not block the thread running it.
Recent years have seen the introduction of many asynchronous APIs to the Java ecosystem, from asynchronous NIO in the JDK, asynchronous servlets, and many asynchronous third-party libraries. Those APIs were created not because they are easier to write and to understand, for they are actually harder; not because they are easier to debug or profile — they are harder (they don't even produce meaningful stacktraces); not because they compose better than synchronous APIs — they compose less elegantly; not because they fit better with the rest of the language or integrate well with existing code — they are a much worse fit, but just because the implementation of the software unit of concurrency in Java — the thread — is insufficient from a footprint and performance perspective. This is a sad case of a good and natural abstraction being abandoned in favor of a less natural one, which is overall worse in many respects, merely because of the runtime performance characteristics of the abstraction.
While there are some advantages to using kernel threads as the implementation of Java threads — most notably because all native code is supported by kernel threads, and so Java code running in a thread can call native APIs — the disadvantages mentioned above are too great to ignore, and result either in hard-to-write, expensive-to-maintain code, or in a significant waste of computing resources, that is especially costly when code runs in the cloud. Indeed, some languages and language runtimes successfully provide a lightweight thread implementation, most famous are Erlang and Go, and the feature is both very useful and popular.
The main goal of this project is to add a lightweight thread construct, which we call fibers, managed by the Java runtime, which would be optionally used alongside the existing heavyweight, OS-provided, implementation of threads. Fibers are much more lightweight than kernel threads in terms of memory footprint, and the overhead of task-switching among them is close to zero. Millions of fibers can be spawned in a single JVM instance, and programmers need not hesitate to issue synchronous, blocking calls, as blocking will be virtually free. In addition to making concurrent applications simpler and/or more scalable, this will make life easier for library authors, as there will no longer be a need to provide both synchronous and asynchronous APIs for a different simplicity/performance tradeoff. Simplicity will come with no tradeoff.
As we will see, a thread is not an atomic construct, but a composition of two concerns — a scheduler and a continuation. It is our current intention to separate the two concerns, and implement Java fibers on top of those two building blocks, and, although fibers are the main motivation for this project, to also add continuations as a user facing abstraction, as continuations have other uses, too (e.g. Python's generators).

Goals and Scope
Fibers can provide a low-level primitive upon which interesting programming paradigms can be implemented, like channels, actors and dataflow, but while those uses will be taken into account, it is not the goal of this project to design any of those higher level constructs, nor to suggest new programming styles or recommended patterns for the exchange information among fibers (e.g. shared memory vs. message passing). As the issue of limiting memory access for threads is the subject of other OpenJDK projects, and as this issue applies to any implementation of the thread abstraction, be it heavyweight or lightweight, this project will probably intersect with others.
It is the goal of this project to add a lightweight thread construct — fibers — to the Java platform. What user-facing form this construct may take will be discussed below. The goal is to allow most Java code (meaning, code in Java class files, not necessarily written in the Java programming language) to run inside fibers unmodified, or with minimal modifications. It is not a requirement of this project to allow native code called from Java code to run in fibers, although this may be possible in some circumstances. It is also not the goal of this project to ensure that every piece of code would enjoy performance benefits when run in fibers; in fact, some code that is less appropriate for lightweight threads may suffer in performance when run in fibers.
It is the goal of this project to add a public delimited continuation (or coroutine) construct to the Java platform. However, this goal is secondary to fibers (which require continuations, as explained later, but those continuations need not necessarily be exposed as a public API).
It is the goal of this project to experiment with various schedulers for fibers, but it is not the intention of this project to conduct any serious research in scheduler design, largely because we think that ForkJoinPool can serve as a very good fiber scheduler.
As adding the ability to manipulate call stacks to the JVM will undoubtedly be required, it is also the goal of this project to add an even lighter-weight construct that will allow unwinding the stack to some point and then invoke a method with given arguments (basically, a generalization of efficient tail-calls). We will call that feature unwind-and-invoke, or UAI. It is not the goal of this project to add an automatic tail-call optimization to the JVM.
This project will likely involve different components of the Java platform, with features believed to be divided thus:
  • Continuations and UAI will be done in the JVM and exposed as very thin Java APIs. 
  • Fibers will be mostly implemented in Java in the JDK libraries, but may require some support in the JVM. 
  • JDK libraries making use of native code that blocks threads would need to be adapted to be able to run in fibers. In particular this implies changing the java.io classes. 
  • JDK libraries that make use of low-level thread synchronization (and in particular the LockSupport class), such as java.util.concurrent will need to be adapted to support fibers, but the amount of work required will depend on the fiber API, and in any event, expected to be small (as fibers expose a very similar API to threads). 
  • Debuggers, profilers and other serviceability services would need to be aware of fibers to provide a good user experience. This means that JFR and JVMTI would need to accommodate fibers, and relevant platform MBeans may be added. 
  • At this point we do not foresee a need for a change in the Java language.
It is early days for this project, and so everything — including its scope — is subject to change.


Terminology
As kernel threads and lightweight threads are just different implementations of the same abstraction, some confusion over terminology is bound to ensue. This document will adopt the following convention, and every correspondence in the project should follow suit: 
  • The word thread will refer to the abstraction only (which will be explored shortly) and never to a particular implementation, so thread may refer either to any implementation of the abstraction, whether done by the OS or by the runtime. 
  • When a particular implementation is referred, the terms heavyweight threadkernel threads and OS thread can be used interchangeable to mean the implementation of thread provided by the operating system kernel. The terms lightweight threaduser-mode thread, and fiber can be used interchangeably to mean an implementation of threads provided by the language runtime — the JVM and JDK libraries in the case of the Java platform. Those words do not (at least in these early stages, when the API design is unclear) refer to specific Java classes. 
  • The capitalized words Thread and Fiber would refer to particular Java classes, and will be used mostly when discussing the design of the API rather than of the implementation.
What Threads Are
thread is a sequence of computer instructions executed sequentially. As we are dealing with operations that may involve not just calculations but also IO, timed pauses, and synchronization — in general, instructions that cause the stream of computation to wait for some event external to it — a thread, then, has the ability to suspend itself, and to automatically resume when the event it waits for occurs. While a thread waits, it should vacate the CPU core, and allow another to run.
These capabilities are provided by two different concerns. A continuation is a sequence of instructions that execute sequentially, and may suspend itself (a more thorough treatment of continuations is given later, in the section Continuations). A scheduler assigns continuations to CPU cores, replacing a paused one with another that's ready to run, and ensuring that a continuation that is ready to resume will eventually be assigned to a CPU core. A thread, then, requires two constructs: a continuation and a scheduler, although the two may not necessarily be separately exposed as APIs.
Again, threads — at least in this context — are a fundamental abstraction, and do not imply any programming paradigm. In particular, they refer only to the abstraction allowing programmers to write sequences of code that can run and pause, and not to any mechanism of sharing information among threads, such as shared memory or passing messages.
As there are two separate concerns, we can pick different implementations for each. Currently, the thread construct offered by the Java platform is the Thread class, which is implemented by a kernel thread; it relies on the OS for the implementation of both the continuation and the scheduler.
A continuation construct exposed by the Java platform can be combined with existing Java schedulers — such as ForkJoinPoolThreadPoolExecutor or third-party ones — or with ones especially optimized for this purpose, to implement fibers.
It is also possible to split the implementation of these two building-blocks of threads between the runtime and the OS. For example, modifications to the Linux kernel done at Google (videoslides), allow user-mode code to take over scheduling kernel threads, thus essentially relying on the OS just for the implementation of continuations, while having libraries handle the scheduling. This has the benefits offered by user-mode scheduling while still allowing native code to run on this thread implementation, but it still suffers from the drawbacks of relatively high footprint and not resizable stacks, and isn't available yet. Splitting the implementation the other way — scheduling by the OS and continuations by the runtime — seems to have no benefit at all, as it combines the worst of both worlds.
But why would user-mode threads be in any way better than kernel threads, and why do they deserve the appealing designation of lightweight? It is, again, convenient to separately consider both components, the continuation and the scheduler. 
In order to suspend a computation, a continuation is required to store an entire call-stack context, or simply put, store the stack. To support native languages, the memory storing the stack must be contiguous and remain at the same memory address. While virtual memory does offer some flexibility, there are still limitations on just how lightweight and flexible such kernel continuations (i.e. stacks) can be. Ideally, we would like stacks to grow and shrink depending on usage. As a language runtime implementation of threads is not required to support arbitrary native code, we can gain more flexibility over how to store continuations, which allows us to reduce footprint.
The much bigger problem with the OS implementation of threads is the scheduler. For one, the OS scheduler runs in kernel mode, and so every time a thread blocks and control returned to the scheduler, a non-cheap user/kernel switch must occur. For another, OS schedulers are designed to be general-purpose and schedule many different kinds of program threads. But a thread running a video encoder behaves very differently from one serving requests coming over the network, and the same scheduling algorithm will not be optimal for both. Threads handling transactions on servers tend to present certain behavior patterns that present a challenge to a general-purpose OS scheduler. For example, it is a common pattern for a transaction-serving thread A to perform some action on the request, and then pass data on to another thread, B, for further processing. This requires some synchronization of a handoff between the two threads that can involve either a lock or a message queue, but the pattern is the same: A operates on some data x, hands it over to B, wakes B up and then blocks until it is handed another request from the network or another thread. This pattern is so common that we can assume that A will block shortly after unblocking B, and so scheduling B on the same core as A will be beneficial, as x is already in the core's cache; in addition, adding B to a core-local queue doesn't require any costly contended synchronization. Indeed, a work-stealing scheduler like ForkJoinPool makes this precise assumption, as it adds tasks scheduled by running task into a local queue. The OS kernel, however, cannot make such an assumption. As far as it knows, thread A may want to continue running for a long while after waking up B, and so it would schedule the recently unblocked B to a different core, thus both requiring some synchronization, and causing a cache-fault as soon as B accesses x.

Fibers
Fibers are, then, what we call Java's planned user-mode threads. This section will list the requirements of fibers and explore some design questions and options. It is not meant to be exhaustive, but merely present an outline of the design space and provide a sense of the challenges involved.
In terms of basic capabilities, fibers must run an arbitrary piece of Java code, concurrently with other threads (lightweight or heavyweight), and allow the user to await their termination, namely, join them. Obviously, there must be mechanisms for suspending and resuming fibers, similar toLockSupport's park/unpark. We would also want to obtain a fiber's stack trace for monitoring/debugging as well as its state (suspended/running) etc.. In short, because a fiber is a thread, it will have a very similar API to that of heavyweight threads, represented by the Thread class. With respect to the Java memory model, fibers will behave exactly like the current implementation of Thread. While fibers will be implemented using JVM-managed continuations, we may also want to make them compatible with OS continuations, like Google's user-scheduled kernel threads.
There are a few capabilities unique to fibers: we want a fiber to be scheduled by a pluggable scheduler (either fixed at the fiber's construction, or changeable when it is paused, e.g. with an unpark method that takes a scheduler as a parameter), and we'd like fibers to be serializable (discussed in a separate section).
In general, the fiber API will be nearly identical to that of Thread as the abstraction is the same, and we'd also like to run code that so far has run in kernel threads to run in fibers with little or no modification. This immediately suggests two design options: 
1.        Represent fibers as a Fiber class, and factor out the common API for Fiber and Thread into a common super-type, provisionally called Strand. Thread-implementation-agnostic code would be programmed against Strand, so that Strand.currentStrand would return a fiber if the code is running in a fiber, and Strand.sleep would suspend the fiber if the code is running in a fiber. 
2.        Use the same Thread class for both kinds of threads — user-mode and kernel-mode — and choose an implementation as a dynamic property set in a constructor or a setter called prior to invoking start.
A separate Fiber class might allow us more flexibility to deviate from Thread, but would also present some challenges. Because a user-mode scheduler does not have direct access to CPU cores, assigning a fiber to a core is done by running it in some worker kernel thread, and so every fiber has an underlying kernel thread, at least while it is scheduled to a CPU core, although the identity of underlying kernel thread is not fixed, and may change if the scheduler decides to schedule the same fiber to a different worker kernel thread. If the scheduler is written in Java — as we want — every fiber even has an underlying Thread instance. If fibers are represented by the Fiber class, the underlying Thread instance would be accessible to code running in a fiber (e.g. with Thread.currentThread or Thread.sleep), which seems inadvisable.
If fibers are represented by the same Thread class, a fiber's underlying kernel thread would be inaccessible to user code, which seems reasonable but has a number of implications. For one, it would require more work in the JVM, which makes heavy use of the Thread class, and would need to be aware of a possible fiber implementation. For another, it would limit our design flexibility. It also creates some circularity when writing schedulers, that need to implement threads (fibers) by assigning them to threads (kernel threads). This means that we would need to expose the fiber's (represented by Thread) continuation for use by the scheduler.
Because fibers are scheduled by Java schedulers, they need not be GC roots, as at any given time a fiber is either runnable, in which case a reference to it is held by its scheduler, or blocked, in which case a reference to it is held by the object on which it is blocked (e.g. a lock or an IO queue), so that it can be unblocked.
Another relatively major design decision concerns thread locals. Currently, thread-local data is represented by the (Inheritable)ThreadLocal class(es). How do we treat thread-locals in fibers? Crucially, ThreadLocals have two very different uses. One is associating data with a thread context. Fibers will probably need this capability, too. Another is to reduce contention in concurrent data structures with striping. That use abuses ThreadLocal as an approximation of a processor-local (more precisely, a CPU-core-local) construct. With fibers, the two different uses would need to be clearly separated, as now a thread-local over possibly millions of threads (fibers) is not a good approximation of processor-local data at all. This requirement for a more explicit treatment of thread-as-context vs. thread-as-an-approximation-of-processor is not limited to the actual ThreadLocal class, but to any class that maps Thread instances to data for the purpose of striping. If fibers are represented by Threads, then some changes would need to be made to such striped data structures. In any event, it is expected that the addition of fibers would necessitate adding an explicit API for accessing processor identity, whether precisely or approximately.
An important feature of kernel threads is timeslice-based preemption (which will be called forceful, or forced preemption here, for brevity). A kernel thread that computes for a while without blocking on IO or synchronization will be forcefully-preempted after some time. While at first glance this seems to be an important design and implementation issue for fibers — and, indeed, we may decide to support it; JVM safepoints should make it easy — not only is it not important, but having this feature doesn't make much of a difference at all (so it is best to forgo it). The reason is as follows: unlike kernel threads, the number of fibers may be very large (hundreds of thousands or even millions). If many fibers require so much CPU time that they need to often be forcefully preempted then as the number of threads exceeds the number of cores by several orders of magnitude, the application is under-provisioned by orders of magnitude, and no scheduling policy will help. If many fibers need to run long computations infrequently, then a good scheduler will work around this by assigning fibers to available cores (i.e. worker kernel threads). If a few fibers need to run long computations frequently, then it is better to run that code in heavyweight threads; while different thread implementations provide the same abstraction, there are times where one implementation is better than the other, and it is not necessary for our fibers to be preferable to kernel threads in every circumstance.
A real implementation challenge, however, may be how to reconcile fibers with internal JVM code that blocks kernel threads. Examples include hidden code, like loading classes from disk to user-facing functionality, such as synchronized and Object.wait. As the fiber scheduler multiplexes many fibers onto a small set of worker kernel threads, blocking a kernel thread may take out of commission a significant portion of the scheduler's available resources, and should therefore be avoided.
On one extreme, each of these cases will need to be made fiber-friendly, i.e., block only the fiber rather than the underlying kernel thread if triggered by a fiber; on the other extreme, all cases may continue to block the underlying kernel thread. In between, we may make some constructs fiber-blocking while leaving others kernel-thread-blocking. There is good reason to believe that many of these cases can be left unchanged, i.e. kernel-thread-blocking. For example, class loading occurs frequently only during startup and only very infrequently afterwards, and, as explained above, the fiber scheduler can easily schedule around such blocking. Many uses of synchronized only protect memory access and block for extremely short durations — so short that the issue can be ignored altogether. We may even decide to leave synchronized unchanged, and encourage those who surround IO access with synchronized and block frequently in this way, to change their code to make use of the j.u.c constructs (which will be fiber-friendly) if they want to run the code in fibers. Similarly, for the use of Object.wait, which isn't common in modern code, anyway (or so we believe at this point), which uses j.u.c.
In any event, a fiber that blocks its underlying kernel thread will trigger some system event that can be monitored with JFR/MBeans.
While fibers encourage the use of ordinary, simple and natural synchronous blocking code, it is very easy to adapt existing asynchronous APIs, turning them into fiber-blocking ones. Suppose that a library exposes this asynchronous API for some long-running operation, foo, which returns a String:

where the callback, or completion handler FooCompletion is defined like so:
We will provide an async-to-fiber-blocking construct that may look something like this:
We can then create a blocking version of the API by first defining the following class:
which we then use to wrap the asynchronous API with as synchronous version:
We can include such ready integrations for common asynchronous classes, such as CompletableFuture.

Continuations
The motivation for adding continuations to the Java platform is for the implementation of fibers, but continuations have some other interesting uses, and so it is a secondary goal of this project to provide continuations as a public API. The utility of those other uses is, however, expected to be much lower than that of fibers. In fact, continuations don't add expressivity on top of that of fibers (i.e., continuations can be implemented on top of fibers).
In this document and everywhere in Project Loom, the word continuation will mean a delimited continuation (also sometimes called a coroutine1). Here we will think of delimited continuations as sequential code that may suspend (itself) and resume (be resumed by a caller). Some may be more familiar with the point of view that sees continuations as objects (usually subroutines) representing "the rest" or "the future" of a computation. The two describe the very same thing: a suspended continuation, is an object that, when resumed or "invoked", carries out the rest of some computation.
A delimited continuation is a sequential sub-program with an entry point (like a thread), which we'll call simply the entry point (in Scheme, this is the reset point), which may suspend or yield execution at some point, which we'll call the suspension point or the yield point (the shift point in Scheme). When a delimited continuation suspends, control is passed outside of the continuation, and when it is resumed, control returns to the last yield point, with the execution context up to the entry point intact. There are many ways to present delimited continuations, but to Java programmers, the following rough pseudocode would explain it best:

A continuation is created (0), whose entry point is foo; it is then invoked (1) which passes control to the entry point of the continuation (2), which then executes until the next suspension point (3) inside the bar subroutine, at which point the invocation (1) returns. When the continuation is invoked again (4), control returns to the line following the yield point (5).
The continuations discussed here are "stackful", as the continuation may block at any nested depth of the call stack (in our example, inside the function bar which is called by foo, which is the entry point). In contrast, stackless continuations may only suspend in the same subroutine as the entry point. Also, the continuations discussed here are non-reentrant, meaning that any invocation of the continuation may change the "current" suspension point. In other words, the continuation object is stateful.
The main technical mission in implementing continuations — and indeed, of this entire project — is adding to HotSpot the ability to capture, store and resume callstacks not as part of kernel threads. JNI stack frames will likely not be supported.
As continuations serve as the basis for fibers, if continuations are exposed as a public API, we will need to support nested continuations, meaning code running inside a continuation must be able to suspend not only the continuation itself, but an enclosing one (e.g., suspend the enclosing fiber). For example, a common use for continuations is in the implementation of generators. A generator exposes an iterator, and the code running inside the generator produces another value for the iterator every time it yields. It should therefore be possible to write code like this:

In the literature, nested continuations that allow such behavior are sometimes call "delimited continuations with multiple named prompts", but we'll call them scoped continuations. See this blog post for a discussion of the theoretical expressivity of scoped continuations (to those interested, continuations are a "general effect", and can be used to implement any effect — e.g. assignment — even in a pure language that has no other side-effect; this is why, in some sense, continuations are the fundamental abstraction of imperative programming).
Code running inside a continuation is not expected to have a reference to the continuation, and the scopes normally have some fixed names (so suspending scope A would suspend the innermost enclosing continuation of scope A). However, the yield point provides a mechanism to pass information from the code to the continuation instance and back. When a continuation suspends, no try/finally blocks enclosing the yield point are triggered (i.e., code running in a continuation cannot detect that it is in the process of suspending).
As one of the reasons for implementing continuations as an independent construct of fibers (whether or not they are exposed as a public API) is a clear separation of concerns. Continuations, therefore, are not thread-safe and none of their operations creates cross-thread happens-before relations. Establishing the memory visibility guarantees necessary for migrating continuations from one kernel thread to another is the responsibility of the fiber implementation.
A rough outline of a possible API is presented below. Continuations are a very low-level primitive that will only be used by library authors to build higher-level constructs (just as java.util.Stream implementations leverage Spliterator). It is expected that classes making use of contiuations will have a private instance of the continuation class, or even, more likely, of a subclass of it, and that the continuation instance will not be directly exposed to consumers of the construct.

The run method returns true when the continuation terminates, and false if it suspends. The suspend method allows passing information from the yield point to the continuation (using the ccc callback that can inject information into the continuation instance it is given), and back from the continuation to the suspension point (using the return value, which is the continuation instance itself, from which information can be queried).
To demonstrate how easily fibers can be implemented in terms of continuations, here is a partial, simplistic implementation of a _Fiber class representing a fiber. As you'll note, most of the code maintains the fiber's state, to ensure it doesn't get scheduled more than once concurrently:

Schedulers
As mentioned above, work-stealing schedulers like ForkJoinPools are particularly well-suited to scheduling threads that tend to block often and communicate over IO or with other threads. Fibers, however, will have pluggable schedulers, and users will be able to write their own ones (the SPI for a scheduler can be as simple as that of Executor). Based on prior experience, it is expected that ForkJoinPool in asynchronous mode can serve as an excellent default fiber scheduler for most uses, but we may want to explore one or two simpler designs, as well, such as a pinned-scheduler, that always schedules a given fiber to a specific kernel thread (which is assumed to be pinned to a processor).

Unwind-and-Invoke
Unlike continuations, the contents of the unwound stack frames is not preserved, and there is no need in any object reifying this construct.
TBD
Additional Challenges

While the main motivation for this goal is to make concurrency easier/more scalable, a thread implemented by the Java runtime and over which the runtime has more control, has other benefits. For example, such a thread could be paused and serialized on one machine and then deserialized and resumed on another. This is useful in distributed systems where code could benefit from being relocated closer to the data it accesses, or in a cloud platform offering function-as-a-service, where the machine instance running user code could be terminated while that code awaits some external event, and later resumed on another instance, possibly on a different physical machine, thus making better use of available resources and reducing costs for both host and client. A fiber would then have methods like parkAndSerialize, and deserializeAndUnpark
As we want fibers to be serializable, continuations should be serializable as well. If they are serializable, we might as well make them cloneable, as the ability to clone continuations actually adds expressivity (as it allows going back to a previous suspension point). It is, however, a very serious challenge to make continuation cloning useful enough for such uses, as Java code stores a lot of information off-stack, and to be useful, cloning would need to be "deep" in some customizable way.

Other Approaches
An alternative solution to that of fibers to concurrency's simplicity vs. performance issue is known as async/await, and has been adopted by C# and Node.js, and will likely be adopted by standard JavaScript. Continuations and fibers dominate async/await in the sense that async/await is easily implemented with continuations (in fact, it can be implemented with a weak form of delimited continuations known as stackless continuations, that don't capture an entire call-stack but only the local context of a single subroutine), but not vice-versa.
While implementing async/await is easier than full-blown continuations and fibers, that solution falls far too short of addressing the problem. While async/await makes code simpler and gives it the appearance of normal, sequential code, like asynchronous code it still requires significant changes to existing code, explicit support in libraries, and does not interoperate well with synchronous code. In other words, it does not solve what's known as the "colored function" problem.


1 Whether we'll call it continuation or coroutine going forward is TBD — there is a difference in meaning, but the nomenclature does not seem to be fully standardized, and continuation seems to be used as the more general term. 


https://jaxenter.com/openjdk-loom-160508.html

OpenJDK update: Early access Loom builds available 
Sarah SchlothauerJuly 29, 2019
The JDK and Project Loom hit a new milestone. According to Oracle’s Ron Pressler on July 29, 2019, in the OpenJDK mailing list, Early Access OpenJDK binaries that include Project loom are now available for download.
Let’s take a quick look at what this project is and what kind of user feedback will help it grow.

Project Loom
What is Project Loom? From to the OpenJDK Wiki, “Project Loom is intended to explore, incubate and deliver Java VM features and APIs built on top of them for the purpose of supporting easy-to-use, high-throughput lightweight concurrency and new programming models on the Java platform”.
It uses lightweight user-mode threads called Fibers. Fibers allow for scheduling synchronous code. Read more about the Structured Concurrency method and its benefits that Project Loom explores.
For more information about the project, watch this 2018 video from Java with Oracle’s Ron Pressler and Alan Bateman. In the video, they discuss the project, its implementation and design, and show a demo.

Giving feedback
For those interested in helping out, you can help this project reach maturity. Test out the current prototype. Project Loom is currently under active development and is not stable. Any user feedback and notification of bugs and/or issues will be helpful for future improvements and releases.
The announcement states that the best way to help is by downloading the early access binaries and simply testing it out.
In the OpenJDK mailing list, Pressler notes:
We are mostly interested in feedback on the following:
– Given current Loom limitations (such as held monitors pinning threads), how hard
is it to migrate existing code to work well with fibers?
– Structured concurrency: is it helpful? What can be improved?
– Which use cases are important to you, and are they well-served by Loom?
A small write-up about your specific use cases for Project Loom will help.
Current missing features include:
  • Yielding while a native VM frame is on the stack in the case of a priviliged action, reflective invocation and MethodHandle invocation.
  • Cloning continuations 
  • Serialization of fiber/continuation 
Information on how to download available here. Currently, Loom supports Mac and Linux x86-64. Source code available here.
All info subject to change as the project evolves.

jeudi 11 juillet 2019

Non-Volatile Mapped Byte Buffers

Je vous propose un article sur l'intégration, dans le JDK 14, d'une classe tableau d'octet (ByteBuffer) qui est connecté
à de la mémoire non volatile : le contenu de cette mémoire n'est pas perdu lorsque l'alimentation est coupée (c'est de la RAM non volatile, pas un SSD).

Cette mémoire non volatile est utilisable pour la persistance des données : elles survivent à l'arrêt du programme.


Analyse :
Par rapport à cette évolution du hard (cette mémoire centrale non volatile), on comprend que la technologie matérielle évolue demande des changements des solutions logicielles. Ces changements doivent impacter les socles technologiques (socle au sens classique du terme),
mais ne devrait pas impacter le code fonctionnel : ceci est vrai lorsqu'il y a une parfaite séparation entre le code fonctionnel et les propriétés non
fonctionnelles comme la persistance.

La solution pour la persistance des données va peut être changer avec cette technologie : à fonctionnalité identique :
la persistance est une exigence (un besoin), un système de fichier, une base de donnée ou une mémoire non volatile sont des solutions.

Si cette technologie de mémoire non volatile se démocratise (inclus le prix), on peut imaginer un processeur 64 bits
avec 1To de mémoire centrale non volatile, en direct sur son bus de données (et pas via un PCI-Express comme les SSD actuel).

Il faudra modifier les OS pour leur demander d'effacer la mémoire ce qui ne doit pas être persistant (le droit à l'oubli ;-) ).
La persistance sera de base, la non persistance devra être ajouter ?

A suivre ...

https://openjdk.java.net/jeps/352

JEP 352: Non-Volatile Mapped Byte Buffers
Summary
Add new JDK-specific file mapping modes so that the FileChannel API can be used to create MappedByteBuffer instances that refer to non-volatile memory.

Goals
This JEP proposes to upgrade MappedByteBuffer to support access to non-volatile memory (NVM). The only API change required is a new enumeration employed by FileChannel clients to request mapping of a file located on an NVM-backed file system rather than a conventional, file storage system. Recent changes to the MappedByteBufer API mean that it supports all the behaviours needed to allow direct memory updates and provide the durability guarantees needed for higher level, Java client libraries to implement persistent data types (e.g. block file systems, journaled logs, persistent objects, etc.). The implementations of FileChannel and MappedByteBuffer need revising to be aware of this new backing type for the mapped file.
The primary goal of this JEP is to ensure that clients can access and update NVM from a Java program efficiently and coherently. A key element of this goal is to ensure that individual writes (or small groups of contiguous writes) to a buffer region can be committed with minimal overhead i.e. to ensure that any changes which might still be in cache are written back to memory.
A second, subordinate goal is to implement this commit behaviour using a restricted, JDK-internal API defined in class Unsafe, allowing it to be re-used by classes other than MappedByteBuffer that may need to commit NVM.
A final, related goal is to allow buffers mapped over NVM to be tracked by the existing monitoring and management APIs.
N.B. It is already possible to map a NVM device file to a MappedByteBuffer and commit writes using the current force() method, for example using Intel's libpmem library as device driver or by calling out to libpmem as a native library. However, with the current API both those implementations provide a "sledgehammer" solution. A force cannot discriminate between clean and dirty lines and requires a system call or JNI call to implement each writeback. For both those reasons the existing capability fails to satisfy the efficiency requirement of this JEP.
The target OS/CPU platform combinations for this JEP are Linux/x64 and Linux/AArch64. This restriction is imposed for two reasons. This feature will only work on OSes that support the mmap system call MAP_SYNC flag, which allows synchronous mapping of non-volatile memory. That is true of recent Linux releases. It will also only work on CPUs that support cache line writeback under user space control. x64 and AArch64 both provide instructions meeting this requirement.

Non-Goals
The goals of this JEP do not extend beyond providing access to and durability guarantees for NVM. In particular, it is not a goal of this JEP to cater for other important behaviours such as atomic update of NVM, isolation of readers and writers, or consistency of independently persisted memory states.
Recent Windows/x64 releases do support the mmap MAP_SYNC flag. However, the goal of providing this capability for that OS/CPU combination (or any other possible other platforms) is deferred to a later update.

Success Metrics
The efficiency goal is hard to quantify precisely. However, the cost of persisting data to memory should be significantly lowered relative to two existing alternatives. Firstly, it should significantly improve on the cost incurred by writing the data to conventional file storage synchronously, i.e., including the usual delays required to ensure that individual writes are guaranteed to hit disk. Secondly, the cost should also be significantly lower than that incurred by writing to NVM using a driver-based solution reliant on system calls such as libpmem. Costs might reasonably be expected to be lowered by an order of magnitude relative to synchronous file writes and by a factor of two relative to using system calls.

Motivation
NVM offers the opportunity for application programmers to create and update program state across program runs without incurring the significant copying and/or translation costs that output to and input from a persistent medium normally implies. This is particularly significant for transactional programs, where regular persistence of in-doubt state is required to enable crash recovery.
Existing C libraries (such as Intel's libpmem) provide C programs with highly efficient access to NVM at the base level. They also build on this to support simple management of a variety of persistent data types. Currently, use of even just the base library from Java is costly because of the frequent need to make system calls or JNI calls to invoke the primitive operation which ensures memory changes are persistent. The same problem limits use of the higher-level libraries and is exacerbated by the fact that the persistent data types provided in C are allocated in memory not directly accessible from Java. This places Java applications and middleware (for example, a Java transaction manager) at a severe disadvantage compared with C or languages which can link into C libraries at low cost.
This proposal attempts to remedy the first problem by allowing efficient writeback of NVM mapped to a ByteBuffer. Since ByteBuffer-mapped memory is directly accessible to Java this allows the second problem to be addressed by implementing client libraries equivalent to those provided in C to manage storage of different persistent data types.

Description
Preliminary Changes
This JEP makes use of two related enhancements to the Java SE API:
1.        Support implementation-defined Map Modes (JDK-8221397)
2.        MappedByteBuffer::force method to specify range (JDK-8221696)

Proposed JDK-Specific API Changes
1.        Expose new MapMode enumeration values via a public API in a new module
A new module, jdk.nio.mapmode, will export a single new package of the same name. A public extension enumeration ExtendedMapMode will be added to this package:
package jdk.nio.mapmode;
. . . public class ExtendedMapMode {    private ExtendedMapMode() { }

   public static final MapMode READ_ONLY_SYNC = . . .    public static final MapMode READ_WRITE_SYNC = . . . }
The new enumeration values are used when calling the FileChannel::map method to create, respectively, a read-only or read-write MappedByteBuffer mapped over an NVM device file. An UnsupportedOperationException will be thrown if these flags are passed on platforms which do not support mapping of NVM device files. On supported platforms, it is only appropriate to pass these new values as arguments when the target FileChannel instance is derived from a file opened via an NVM device. In any other case an IOException will be thrown.
2.        Publish a BufferPoolMXBean tracking persistent MappedByteBuffer statistics
The ManagementFactory class provides method List<T> getPlatformMXBeans(Class<T>) which can be used to retrieve a list of BufferPoolMXBean instances tracking count, total_capacity and memory_used for the existing categories of mapped or direct byte buffers. It will be modified to return an extra, new BufferPoolMXBean with name "mapped - 'non-volatile memory'", which will track the above stats for all MappedByteBuffer instances currently mapped with mode ExtendedMapMode.READ_ONLY_SYNC or ExtendedMapMode.READ_WRITE_SYNC. The existing BufferPoolMXBean with name mapped will continue only to track stats for MappedByteBuffer instances currently mapped with mode MapMode.READ_ONLY, MapMode.READ_WRITE or MapMode.PRIVATE.

Proposed Internal JDK API Changes
1.        Add new method writebackMemory to class jdk.internal.misc.Unsafe
public void writebackMemory(long address, long length)
A call to this method ensures that any modifications to memory in the address range starting at address and continuing up to (but not necessarily including) address + length are guaranteed to have been written back from cache to memory. The implementation must guarantee that all stores by the current thread that i) are pending at the point of call and ii) address memory in the target range are included in the writeback (i.e., there is no need for the caller to perform any memory fence operation before the call). It must also guarantee that writeback of all addressed bytes has completed before returning (i.e., there is no need for the caller to perform any memory fence operation after the call).
The writeback memory operation will be implemented using a small number of intrinsics recognised by the JIT compiler. The goal is to implement writeback of each successive cache line in the specified address range using an intrinsic that translates to a processor cache line writeback instruction, reducing the cost of persisting data to the bare minimum. The envisaged design also employs a pre-writeback and post-writeback memory synchronizaton intrinsic. These may translate to a memory synchronization instruction or to a no-op depending upon the specific choice of instruction for the processor writeback (x64 has three possible candidates) and the ordering requirements that choice entails.
N.B. A good reason for implementing this capability in class Unsafe is that it is likely to be of more general use, say for alternative data persistence implementations employing non-volatile memory.

Alternatives
Two alternatives were tested in the original prototype.
One option was to use libpmem in driver mode, i.e., 1) install libpmem as the driver for the NVM device, 2) map the file as per any other MappedByteBuffer, and 3) rely on the force method to do the update.
The second alternative was to use libpmem (or some fragment thereof) as a JNI native library to provide the required buffer mapping and writeback behaviour.
Both options proved very unsatisfactory. The first suffered from the high cost of system calls and the overhead involved in forcing the whole mapped buffer rather than some subset of it. The second suffered from the high cost of the JNI interface. Successive iterations of the second approach (adding first registered natives and then implementing them as intrinsics) provided similar performance benefits to the current draft implementation
A third alternative that was considered is to wait for Project Panama to provide access to foreign libraries and foreign datatypes mapped over NVRAM without incurring the overheads of JNI. While this is still considered to be a worthwhile option for the future it was decided that the current proposal is worth pursuing for two reasons: firstly, to allow users to experiment with the use of NVRAM from Java immediately, as it begins to become available; and secondly, to ease the transition involved in such a transition by supporting a model for use of NVRAM derived from the existing, familiar MappedByteBuffer API.

Testing
Testing will require an x64 or AArch64 host fitted with an NVM device and running a suitably up to date Linux kernel (4.16).
Testing on AArch64 may not be possible until suitable NVM devices are available for this architecture. As an alternative testing may need to proceed by mapping volatile memory and using it to simulate the behaviour of an NVM device.
Testing on both target architectures may be difficult; in particular, it may suffer from false positives. A failure in the writeback code can only be detected if it is possible to kill a JVM with those pending changes unflushed and then to detect that omission at restart.
This situation may be difficult to arrange when employing a normal JVM exit (normal shutdown may end up causing those pending changes to be written back). Given that the JVM does not have total control over the operation of the memory system it may even prove difficult to detect a problem when an abnormal exit (say a kill -KILL termination) is performed.

Risks and Assumptions
This implementation allows for management of NVM as an off-heap resource via a ByteBuffer. A related enhancement, JDK-8153111, is looking at the use of NVM for heap data. It may also be necessary to consider use of NVM to store JVM metadata. These different modes of NVM management may turn out to be incompatible or, possibly, just inappropriate when used in in combination.
The proposed API can only deal with mapped regions up to 2GB. It may be necessary to revise the proposed implementation so that it conforms to changes proposed in JDK-8180628 to overcome this restriction.
The ByteBuffer API is mostly focused on position-relative (cursor) access which limits opportunities for concurrent updates to independent buffer regions. These require locking of the buffer during update as detailed in JDK-5029431, which also implemented a remedy. The problem is mitigated to some degree by the provision of primitive value accessors which operate at an absolute index without reference to a cursor, permitting unlocked access; also by the option to use ByteBuffer slices and MethodHandles to perform concurrent puts/gets of primitive values.