Welcome to This Week in D! Each week, we'll summarize what's been going on in the D community and write brief advice columns to help you get the most out of the D Programming Language.
The D Programming Language is a general purpose programming language that offers modern convenience, modeling power, and native efficiency with a familiar C-style syntax.
This Week in D has an RSS feed.
This Week in D is edited by Adam D. Ruppe. Contact me with any questions, comments, or contributions.
Several Phobos functions have been "range-ified", which changes the signature but should mostly work the same way. The change means fewer memory allocations will happen in the Phobos library.
std.traits now has a hasUDA function, to make checking user-defined attributes easier.
Vladimir has released forum.dlang.org, version 2 (BETA) which will probably become the new web interface. Chime in now if you have comments!
DConf 2015 happened recently! Over 30 men gathered in person at Utah Valley University for about nine hours a day over three days to discuss D, with the majority of the conference also being livestreamed over Youtube to many other people.
The conference was also professionally recorded and those videos will be made available later, once editing is finished.
This Week in D summarized the Wednesday morning session last week. This week, we'll continue our coverage.
See last week's issue.
After lunch, we reconvened for three additional talks and one long Q&A session.
The first talk after lunch was by Liran Zvibel discussing his company's, Weka.io, use of D. Weka is based in Israel and has been using D since early 2014 writing max performance, high availability primary storage software. His slides are here.
After introducing his company and work, Liran briefly described their old infrastructure: a mix of C and Python with a lot of auto-generated code. They wanted better and moved to D.
For maximum performance, they wrote memory efficient code with zero copying, no allocation, and no locks. For more understandable code, they used fibers, a reactor, and a RPC framework based on D's compile-time reflection.
They had a number of debugging challenges: bugs must be fixed but the system can't go down and reproducing errors is too expensive. They couldn't use extensive text tracing because writing to a log is too slow and too bloated. To solve this problem, they wrote a custom tracing framework that uses Brian Schott's libdparse and static code generation to instrument their code and make a binary storage system. For example, it would store ID integers instead of strings for maximum efficiency. A lockless move system to shared memory allows it to qyuckly log the useful information without pausing the system.
Weka uses a custom log viewer to give their developers access to the stored trace data. He noted that access to linker sections would be nice to remove a step in the custom trace process, but it is not necessary to get the system working.
Next, Liran described the IPC system, which uses compile time reflection to generate the code to communicate with remote nodes by looking at regular D interface declarations. Editor's note: chapters eight and nine of my D Cookbook describe techniques that can form the foundation of such a system, though Weka's implementation goes further than the book elaborated upon. I'd also add that while this creates some beautiful code that is easy to alter, it does come at a cost: in today's compiler, at least (the implementation is currently inefficient), such reflection and codegen can have a major effect on compile times - Liran mentions slow builds later in the talk - so it isn't necessarily the right solution all the time. You might want to use the code gen to make the interface, but instead of mixing it in immediately, use pragma(msg) to write the code out to stdout, and have your makefile save that to a file which you can cache between builds, or something along those lines.
Liran mentioned that this approach is much easier to use than implementing it in C with external tools, while remaining extremely efficient at runtime.
Next, we heard about some custom code they wrote: they wrote some of their own assertion and reflection helpers, building on what D has built in. They crated fiber-local storage and fiber debugging helpers. They also wrote a number of no-GC efficient data structures, gc_hacks for getting private GC statistics to help with their optimization, and reflection-based accessors which notify of changes to member variables.
Liran took a moment now to exclaim that "what you should take of [this talk] is: D is great" and that you should adopt D. He mentioned that one of the interview questions they ask is to give them 30 lines of C++ and see if they can tell what it does. Given the complexity of C++, this is harder than it seems, even for experienced C++ developers. When people see D, however, they tend to find it much easier to read, even as a newbie, but especially after they get used to it.
He then started to talk about some challenges they faced: many things stemming from their strict latency requirements being incompatible with garbage collection, so they had to avoid it, a problem with the compiler taking too much memory and not scaling to use all the cores on their dev machine (compiling all at once simply didn't work for them - this relates to a number of known bugs in the compiler, which generally have workarounds, but they don't work as well as the existing workarounds for C++'s slow builds, for example), a huge executable being built, and a few bugs in the compilers - especially gdc and ldc, meaning they stuck with dmd in production, despite the poorer code optimization.
Walter mentioned they could try .di files for the build, but this doesn't solve the fundamental problem they faced. Editor's note: when he said that on the mic, the thought that crossed my mind is "that is like microoptimizing a bubble sort".... Andrei mentioned trying package-at-a-time compilation, which can be better parallelized and avoid many of the rougher edges like template instantation bugs that plague module-at-a-time compilation. Liran said they haven't been able to try that in practice yet. I would also note that fixing the parallelizing and memory usage is slated to come somewhat soon after the move to ddmd, which should happen in a few months (see Daniel Murphy's talk from the Wednesday morning session, summarized last week.)
Liran also mentioned private imports in functions which are theoretically more efficient... but in practice, a pain to use, needing to reimport everything per function instead of per module, and thus not often used by developers.
Another challenge was that many C functions that need to be inlined for good performance are not inlined when called from D. Since the D compiler doesn't see the C source, the function is always called, never inlined. To solve this, Weka ported some important C functions to D. Editor's note: link time optimization in GDC and LDC may solve this, but dmd doesn't implement it. Weka didn't use gdc and ldc due to some unique bugs with those compilers, but for many projects, they work excellently and you might try them before rewriting functions.
He also fund that module constructors and destructors found limited use thanks to import cycles and ordering issues in practice, and that using integer types smaller than int (short, byte) were painful due to an explosion of casts - value range propagation doesn't go far enough to make these both correct and convenient to use yet.
Liran summarized by generally praising D: it gives them a single language for both paths, able to replace both C and Python, it has given them a huge productivity boost and they are heavily using D's unique features, and it is paying off. Their only main downside was that large real time projects could have better support... but, remember, this talk is about their success in building a large, real time project with D!
Finally, Liran said he is looking for D freelancers to help build better infrastructure and give back to the community at the same time. If you're interested, email him at firstname.lastname@example.org.
The second after-lunch speaker was David Nadlinger (aka klickverbot), talking about druntime's implementation. You can see his slides here.
David opened with an overview of the various packages that make up druntime and some of the fundamental classes such as TypeInfo and ModuleInfo. He then described exception handling, noting that it is both compiler- and platform-specific, and gave an overview of how D's mark-and-sweep garbage collector works.
At this point, the talk got more specific, discussing just how thread-local storage is implemented and a challenge in getting it to work with shared libraries - the fact that it won't necessarily have a static per-thread table indicating where the storage is found. (Shared libraries may be loaded dynamically into several different programs.)
Walter took to the mic briefly to mention that TLS globals are not actually very efficient due to these indirect lookups - local variables are far more efficient to access.
David went back to discussing how it works, including a custom TLS implementation on dmd on Mac OS X since that operating system didn't support it natively since version 10.7. This custom implementation used functions from the rt.sections_osx module along with special linker sections generated by the compiler to store the data. The LDC implementation of TLS on OS X uses the default LLVM implementation, but with some Apple-specific extensions for GC ranges.
David also described how fibers work (including noting that TLS and exception handling implementation with fibers is harder than it sounds), then briefly described the C startup model. He made the important note that a C program doesn't quite start at main - it actually starts at _start, which is found inside the C library. Editor's note: it can actually start at any symbol, you can override this with a linker script, but _start is the default one and this trivia doesn't actually change David's point. He clears up a common misconception that C programs don't have a runtime: they do, that's how global constructors and destructors are called, and how the environment is set up!
Next, he got into more details on shared libraries and module registration. He described out the _Dmodule_ref system worked - a linked list of modules created by the compiler using C global constructors editor's note: if you have ever played with kernel code in D, you've probably seen this before. The reason it doesn't work there though, the reason the reference is null, is that you are probably skipping the C runtime initialization by writing your own _start... meaning those constructors are never run!. This is a simple system that works portably... but does not work right with shared libraries.
The shared library support on Posix needs special effort. It uses a shared druntime and needs to detect module conflicts (two different versions of the same module from different libraries)... while remaining easy to use (so custom linker scripts are out) and working around myriad linker bugs and incompatibilities Editor's note: he is talking about the GNU and LLVM linkers here! i.e., not D specific.
The solution in druntime now is a function called _d_dso_registry, a compiler generated call mostly written by Martin Nowak, which handles these issues.
David also talked about --gc-sections, an option to the linker which, in theory, should lead to smaller executables. This, again, hit pain in ease of use, linker incompatibilities, and linker bugs, however when it does work, it can result in executables from ldc about 1/4 the size as the ones from dmd. David noted that gdc's executables tend to be huge just because it adds debugging info, which can be easily stripped out.
David's slides, page 30, has a number of good references to better understand the topics he talked about, and he mentioned that link time optimization may continue to improve going forward.
Amaury Sechet (aka deadalnix) was the next speaker and talked about memory, CPU caches, concurrency fencing, and more. His slides are here.
The first point he made is that memory is slow. It takes about 300 cpu cycles to read from main memory and this situation has hit a wall: latency happens because an electronic signal simply has to take time to travel across the circuit.
Editor's note: if you've been programming a while, you'll remember when memory was so much faster than CPUs that it'd make sense to store as much as you can there (though you didn't have infinite memory back then either!). The situation is different now: storing things in main memory is not necessarily a speedup compared to calculating it again on demand. It comes down to carefully managing the cpu cache, which is what deadalnix's talk is about.
Amaury then talked about the solution to slow memory: fast caches stored on the CPU itself, physically closer so the signal has less distance to travel. The CPU also prefetches data it expects to be used, so it is available without a wait. His slides list types of cpu cache.
Amaury listed a few big tips to help mitigate the slow memory problem, including: pack your data so it takes less memory, put what you can on the stack (which is likely to be in the hot cache), access data in a linear fashion avoiding indirections and code branching (to help the prefetcher predict the right stuff to grab), and size your data so it fits on a cache line.
He briefly explained the memory management unit in a CPU before moving on to discussing multicore processors' interaction with memory, which took up the bulk of the talk.
He opens this part by noting that multicore processors are everywhere, including in mobile devices, and their presence is very visible to programmers, unlike older faster serial processors, which programmers could basically reap benefit for free - their same programs would just run faster. With multicore, changing the program may be necessary to get benefits. Editor's note: though one free benefit you might see is from the operating system: your process could get a free core while other processes the user runs is on other ones. But, indeed, to really utilize it all, you do need to do some work!
Since multicore is visible to the programmer, old languages have trouble adapting to it and newer languages need to do something to help. This tends to be enforcing semantics to keep up a single-threaded illusion for most code, limiting where memory sharing happens.
In a multicore environment, each core has their own cache which works asynchronously. As a result, reads and writes back to main memory may happen out of order and one core may overwrite another core's write. The x86 architecture is kind to programmers in this respect, but other architectures expose you to all the gory details.
The basics behind cache coherency in the CPU is that a core takes ownership of a cache line and shares when it is not dirty. The tip from this fact is to avoid sharing and writing to memory. In D, this means share immutable data and write to thread local data whenever possible. Since thread local data is the default in D, this is encouraged by the language too.
Since it works on 64 byte cache lines, if you need to share mutable memory across many threads, you may want to pad the data - make sure the shared block takes up the full 64 bytes so two shared variables in the same block don't get thrashed across different threads.
While x86 tries to keep the memory consistency automatically, it also cannot compromise performance too much. One type of operation: the StoreLoad memory barrier, does need to be explicit on x86. This is triggered with the mfence instruction editor's note: which D exposes through inline assembly and I believe intrinsics. ARM requires you to explicitly specify all memory barriers.
Amaury then says we're all doomed as testing for these problems is difficult... but then goes into more detail about how D helps, with its thread local default and explicit sharing, and also the transitive immutable can also be relied upon in a multicore environment.
shared is kinda hard to use in D... but that's a good thing because sharing is hard to use in any case, so making it look easy would be a leaky and probably wrong abstraction. It ought to be used with care.
The next topic was the garbage collector and how it needs more work to be really multicore friendly, as well as the API between the druntime and the compiler.
Amaury's project, sdc, a rewrite of the D compiler, has a proof of concept inspired by jemalloc to make a thread-local heap. It uses write barriers to help with gc pauses, but it also uses more memory than the stock D gc. The key to making it good is to generally only share immutable values...
...but, there's still a few other problems with it: an immutable delegate might have a mutable context pointer in D, exceptions may cross thread lines when a thread is terminated (the Thread.join called from the parent may return a thrown exception that killed a thread), and pure functions can be promoted to immutable after allocation, so how would the GC know to allocate it in the immutable heap?
He proposes some possible solutions: make a generational gc for immutable info, add better ref escape checks and automatically free them, moving to the stack if possible, and improve the inliner for maximum effect of these other ideas.
Finally, we closed out Wednesday with a long ask us anything segment, where people grilled Walter and Andrei.
The following is my notes from the session, roughly formated. It is NOT a transcript and likely includes some mistakes by me, but should give you a jist of the discussion.
The remaining days of DConf will be summarized in later issues, and we'll revisit links to videos once the professionally recorded files are made available. Keep reading next week!
A new page has been added to the D Wiki listing open D jobs. Take a look if you're interested, and add yours if you know of one that is available!
See more at digitalmars.D.announce.
To learn more about D and what's happening in D: