This Week in D June 28, 2015

Welcome to This Week in D! Each week, we'll summarize what's been going on in the D community and write brief advice columns to help you get the most out of the D Programming Language.

The D Programming Language is a general purpose programming language that offers modern convenience, modeling power, and native efficiency with a familiar C-style syntax.

This Week in D has an RSS feed.

This Week in D is edited by Adam D. Ruppe. Contact me with any questions, comments, or contributions.

Statistics

Next DMD release

A beta for dmd 2.068 was released this week.

DConf 2015

The videos are starting to be released. They can be found on UVU's youtube channel and on the DConf website talk pages.

When they are all finished, I'll make a list here, too.

Open D Jobs

A page has been added to the D Wiki listing open D jobs. Take a look if you're interested, and add yours if you know of one that is available!

In the community

Community announcements

See more at digitalmars.D.announce.

Significant Forum Discussions

std.experimental.color, request reviews introduces a new Phobos proposal for a color module. Also has a discussion on package layout - the old assumption that package.d would be the same as import all seems to be falling apart in actual usage, where it is more valuable to make it be a common subset of the package instead for typical users.

See more at forum.dlang.org and keep up with community blogs at Planet D.

Community Interview

Joakim conducted another interview this month and contributed it to This Week in D, this time with Dmitry Olshansky. The following is their exchange, text provided by Joakim.

Dmitry Olshansky is the author of two of the showcase modules of the D standard library, phobos: std.regex, which provides regular expressions, and std.uni, which provides unicode algorithms.

Q: How was D particularly suited for std.regex or unsuited? I mean how much did the language help you versus holding you back?

A: One thing about D is how well suited it is for parsers, with its slices. Other advantages are the built-in provisions for Unicode character types and powerful yet accessible meta-programming. It's hard to overestimate the effect of easy templates in D: we don't even call them templates, just compile-time parameters. Taking a peek at some other engines, they either stick with UTF-8 or UTF-16. With D and a bit of templates, I managed to get all three character widths with unified code. The sheer size-to-power ratio is also quite good, ~7K for 3 character widths and 3 engines, plus unittests for all of them vs. ~23K+ of C code for the newest Python regex module.

How would I do that in say C? Note that adding compile-time regex was a relatively small amount of code. With the power of "static if," code reuse in generic primitives becomes trivial. This is something even C++ is not capable of doing, not without a disproportional amount of effort. If I were to name the key enabling feature in D, it must be "static if".

The ugly side of D is quality of implementation. Things get better at a steady pace, but something more radical has to happen with CTFE. CTFE is still slow, leaks memory, and is otherwise unfit for serious work. I still run the test suite for static regex by splitting it into four separate runs so as not to run out of 4 GB of memory. In the same vein, std.regex is developed against the latest DMD. Yet for performance-sensitive code, I need to compile with LDC or GDC, which lag behind DMD. The mismatch of frontends is a constant nuisance to benchmarks of std.*.

The other problem in D is leaving too much to optional optimizations. For instance, I know some tricks to get a function inlined in C++, the dance might be different for different compilers but it does work. D simply has no facility that would guarantee that something is inlined, even when I *do* know better. @force_inline seems to be an accepted enhancement, I hope to see added it soon. (Update: Dmitry notes that it has since landed in the master branch as "pragma(inline, true)".)

More towards where it strikes std.regex - tail calls. If I could enforce tail calls, the interpreter's dispatch could be made roughly three times faster, by leveraging the so-called "threaded code" scheme. Since a tail call is recognized as an optional optimization, one may not rely on it happening in debug builds, nor is there any way to force it. However, if a tail call is not happening, the threaded code interpreter will just overflow the stack and crash. Thus it affects semantics, but I don't think I've convinced Walter that this is a problem. There is still a way to do a poor man's JIT and get direct threaded code, but that is messy and unportable.

Q: What was the thought process for the design of std.regex in the beginning and how did it evolve over time? Take us a bit through the evolution of how you got to the design of std.regex today, including any war stories you have in mind.

A: The central idea was to try multiple strategies and engines, to see which one is going to win on the common benchmarks. The 2 candidates were:

- backtracking (next to everybody uses that), which has dreaded exponential worst-case complexity
- Thompson NFA, which has stable performance on any pattern

The original idea was to later add other strategies. In particular, I wanted to generate Deterministic Finite Automata (DFA) at compile-time, it would be exceptionally fast but quite limited with respect to features (not even sub-matches, ouch).

One thing I knew for sure is I was going for a universal flat internal representation that should be enough for both, yet avoid redundancy and be suitable for optimizer passes, should they come later. Essentially, it could be viewed as a bytecode for a very specific machine(s), and the whole idea is a VM approach.

One remarkably bad design decision I recall was the idea to do with look-behind in regex. The idea was to execute the bytecode of look-behind backwards on a range that is lazily reversed. The bytecode supported bidirectional traversal so it looked like a neat idea. Turns out it causes a duplicate of main loops for all engines (counting the compile-time one), a large maintenance burden. A trivial solution was executed only a year or two later - just swap the bytecode in place during the pattern compilation step and execute it normally.

Interestingly, getting 2 interpreters of bytecode off the ground (including duplication) was quite easy compared to implementing what is called "basic unicode support level 1" in the Unicode standard. I had absolutely no idea how much extra work that would turn out to be. Once I scrolled down http://unicode.org/reports/tr18/, I knew I'm in for quite some work. During the two week rush to get it all done and fast, I managed to do a few lucky guesses, it seems. A year later with std.uni, it took me a few weeks of tinkering to get performance on par with the brutally simple 2-staged tables used in std.regex.

Lastly, static regex got some hype, as it managed to beat V8's regex and Google's RE2 on regex-dna benchmark, but it was essentially hacked in about a week, plus or minus pauses between compiler fixes. I recall Andrei saying closer to the end of GSOC period, "You know, it'd be cool if we can get that compile-time regex done".

The parser was (sort of) working with CTFE in a few days, by carefully side-stepping D frontend bugs. The compile-time engine itself was a code generator that walked the same bytecode and spit out D code for that pattern, then that source was mixed in with a bit of "static if" to the back-tracking engine. The end result turned out to a great deal faster, unless hitting a weakness of backtracking. In the end, it turned out that UTF decoding had become the bottleneck and it's soon to be removed. The original idea of creating static DFA sadly went nowhere, though I think it might be great for a certain class of patterns.

Q: Having contributed large modules to phobos, what do you like or dislike about the phobos review process?

A: Std.regex is hardly representative of today's process. It's been 3+ years and std.regex was one among the few to get a relatively swift review process. Being exceptionally good on a popular benchmark also sped things up. Still, as with any process, the good thing is extra scrutiny spent on every minor detail, and a significant push for better tests, documentation, annotations, style, proper benchmarks, etc. It may seem like the last 10%, but it may take 50% of the whole effort. For instance, the std.uni review was a superb experience with respect to raising the quality, particularly of documentation.

On the minus side, we don't have enough experts to cover every domain, big issues are rarely discussed unless it's something of common knowledge. To wit, std.logger took so looong to get even to an experimental stage, since everybody had their own idea of what logging looks like and presented real issues and matters of taste in swaths.

Overall, going for the standard library is hard and once something lands there it should be considered almost written in stone, API-wise. So for anything new, I'm sticking to the dub repository as a good testing ground.

As part of std, you'd get to make sure your stuff works on all platforms, some that you might never use, and indeed std.regex uncovered a compiler bug specific to OSX. Needless to say, I had no burning desire to buy a Mac just to debug and fix it, so it was an intriguing session of wild guesses with exchanges of printouts over email.

Q: What D code have you written outside of std.regex and how was the experience?

A: Well, back in 2010 I was so fond of D2 that I neglected its alpha quality and did my research project in it.

The subject was yet another solution to the so-called SLAM problem, which stands for Simultaneous Localization and Mapping, a process whereby a mobile robot updates its knowledge of its surroundings and then its location in there. I have to say fighting bugs in the compiler and OpenGL bindings wasn't quite pleasant, but the thing (emulator, mapper and a controller for the real robot) did work and I've published the (mediocre) paper on the method.

Since std.regex, I write either libraries or tiny scripts. Part of the reason is I need a lot of missing building blocks or I simply have to do it in some other language. Secondly, I've found joy in building tools (libraries) to see what others can achieve with it and IMO D has what it takes to create awesome libraries.

Another sizable library I did was a spin-off of std.regex, which is now known as std.uni. Again designed during GSOC 2012, admittedly it didn't enjoy any of std.regex's popularity, I think it might be because Unicode is the kind of stuff that nobody wants to know deep enough or be bothered to use beyond the basics. Still, I'm quite proud of the packed Trie data structures for character classification that I implemented in there.

I built a bunch of text processing utilities in D, with heavy use of std.regex. Among recent stuff, I published a tiny tool called gchunt, which post-processes the compiler's -vgc output and then "parses" D code backwards to highlight functions that contain GC usage and produces a wiki page with a kind of checklist for the developer.

Q: Please tell us about yourself: who you are and where you're from, what programming languages you used before D, and take us from your experience first discovering and using D to getting involved with std.regex, specifically what it was about redoing regex that attracted you.

A: I'm a Russian D enthusiast, a postgraduate student and CS researcher (~hacker) with broad areas of interest. The highlights being AI techniques, robotics, compiler construction, parallel and concurrent programming, networking and distributed systems. My BSc is applied physics and math, even though I started programming as a kid and enjoy it ever since. Later on, I finally switched over and completed CS MSc.

I think the first remotely interesting programs I wrote was in Turbo Pascal for DOS, I also messed with x86 assembly. Then came C on FreeBSD at my school, the first capable program a full-fledged archiver with a couple of algorithms: LZW or BWT + arithmetic coding.

The first language I decided to truly master was C++. Oh boy, what a choice, but being young, stubborn and stupid has its benefits. Looking back, I had to learn low-level stuff: stack, heap, OS interfaces and what a dangling pointer feels like. At the same time, I've come to appreciate the power of good abstractions - ADT, RAII, operator overloading, templates and generic programming, separation of concerns as in e.g. STL's algorithms + iterators combo.

Around the same time, I played with Python (2.3 I think) and loved its easy exploratory nature, yet despised the lack of low-level access to the OS and all the magic built-ins it relied on.

My first criterion for my language of choice was to be expressive enough to let the programmer extend existing primitives and implement new ones with usability on par with the built-ins. The second one is to be low-level enough to efficiently implement itself, or more generally leave no room below itself for anything but assembly. The third one is that the language must help the programmer to set constraints (types or otherwise) to catch most "unforced" mistakes before runtime.

I think the first time I stumbled upon D was around 2007 and the impression I got is a native Java/C# wannabe and I summarily dismissed it. It wasn't until 2010, following a year or two of heavy (ab)use of modern C++ design, when I finally got fed up with C++. At that time, D2 was hot in the forge, hot and bubbling with all manner of cool features. Frankly, I think Andrei Alexandrescu's work on ranges featured on the front page is what got me sold.

The first thing I tried was to port my toy C++ 2D game engine to D, it went surprisingly well. I cut the code size by half, the header files were gone, and there was a feeling of succinctness about the end result that I especially loved. That toy stuff is still somewhere on my drive, though it would hardly compile with the latest D compiler. I think it broke on ~DMD 2.049. ;)

Closer to the end of 2010, I got involved with the newsgroup discussions, prior to that only reading the archives. Following my interest in compilers at the university, I digged up DMDScript- a JavaScript interpreter written in D- and first ported it from D1, then hacked on it to improve its correctness to the max. That work was merged upstream and it compiles with latest D, though hardly a match for modern JS engines:

https://github.com/DigitalMars/DMDScript/commits/master

My passion for writing parsers/lexers hadn't burned out and in 2011, as D migrated to Github, I picked up on std.regex. It was in a remarkably sorry state - not only did it have problems with Unicode but the logic inside surprisingly wasn't quite regular expressions at all. I started fixing it and even pushed in a few patches that made it that much closer to being correct. Then I came to the realization that it'd be simpler to just scrap it and write my own grand thing that's better, faster, harder, perhaps as a stand-alone library.

About the same time, D got accepted into Google's Summer of Code (GSoC). I had a wild thought - I'm an MSc student, I love D, and would love to experiment with regex engines. So I wrote a GSoC proposal to replace the std one, presenting the benefits as best as I could. It was a bit of a gamble, and I certainly had no idea what I was getting myself into. Following an interview, I was chosen for one of 3 slots dlang had. And there I was, coding my open source thing in my language of choice and earning some bucks in the process. The whole situation felt too good to be true.

That GSoC was a particularly fantastic experience for me, largely thanks to Fawzi Mohamed, my primary mentor. Our fruitful discussions preceded most of the key decisions in the current std.regex design. Not to diminish the work of others - all in all, I believe we had a brilliant team of mentors, working with them added a thrill to the process.

Q: Do you make money writing D? What do you do for a living?

A: I might be one of the first guys to earn some coin with modern D (D2), due to the Google sponsorship during GSoC. Getting back to real life, I haven't done anything in D that would cover any of my bills. In a way, D affected all of my work, being a constant source of inspiration for good design.

At the moment, I'm working as CTO in a tiny web-app startup that is yet to launch our product in the wild. A shaky position to say the least and full of challenges, especially when combined with PhD studies. Of course, I've sneaked some D into the camp in the form of a few tools here and there, but at large the stack is Scala/Spray for REST services + Python/Django for the user interface.

Why not D? The reasons are many, and for starters the performance of D's HTTP app servers is not that good, nor as stable as that of mature JVM ones. Even leaving aside performance, no memory leaks or heap/stack corruption is another good side of JVM that we wholly enjoy.

Q: What's left for std.regex?

There are many aspects to highlight. I'll do a broad picture with a few examples.

Capabilities:

- Ability to use different alphabets, including user-defined character types

- Adding custom atoms, that is being able to save say [a-zA-Z_][a-zA-Z0-9_]* as id, and reuse it in other patterns as e.g. {:id}

- Matching multiple patterns at once, to make it suitable as a quick and dirty lexer

- More convenience, e.g. scanf-style primitive that uses patterns instead of %* specifiers

- An API to build patterns of regex without messing with strings and parsing of such

Optimizations: there are plenty, including something I've never heard implemented elsewhere. The key one is to remove decoding of UTF and match directly on the encoded chars: this required significant ground work in std.uni that is now done.

A more accessible example would be what I call a fence-post optimization. In a nutshell, for each contiguous set of braces, it suffices to only store one index. For example, in (([a-z]*)([0-9]*))([a-z]*), it's sufficient to save only 4 indices internally (fence posts). This would save a lot of memory, making the rest that much hotter in the CPU cache. The mapping between these 4 indices and the 5 start-end pairs (counting the whole match) for groups is required only at the very end, and thus may be done with a lookup table at virtually no cost.

Meanwhile, the architecture of std.regex has to be improved, it has accumulated some hacks and necessary evils. Also, I believe it could be split into clean composable pieces that an advanced user of the library may use to construct and fine-tune his own flavor of the engine.

I'm still considering going for a universal PEG parser before putting all that effort into std.regex, as I have a good hunch that it's both simpler for the end-user and more powerful. And it can be made exceptionally fast with CTFE, which is intriguing...

Q: What do you love about D? Hate? This is more general than the specific std.regex-related question above.

A: It's more of an addiction than love. And I suffer from withdrawal when programming in other languages. ;)

While doing some work in C+11 recently, I've found that I'm completely hooked on the many minor comforts D brings to the table: clean polymorphic lambdas, simple fixed integer types, _ as in 1_000_000 numeric literals, default initialization, UFCS, a clean syntax for templates, nested functions, built-in strings. To wit, all of these in isolation are minor and don't enable anything new, but combined however it translates to big-time savings, especially on my nerves. Another part of the addiction is that in D, the boilerplate to just get something running is very low, much like any scripting language out there - a bunch of "import this and that" and you are ready to roll. I'd even say I'm more productive with D even on small programs than with dynamic languages _because_ of type safety.

The other part is that I love things that go meta, as in generic programming, meta-programming and naturally, code generators. Striving to express concepts concisely is in our human nature, much as getting bored of pointless repetition. D does deliver lots of tools to produce high-quality, concise code that performs well, be it simple array slices or mixin templates or the uber combo of UDA's + CTFE + mixins. Many great library designs are enabled by unique D features like opDispatch, multiple subtyping with alias this is bound to enable even more.

Hate... There is not much I really "hate," but there are some decisions in D's design that I do not believe carry their weight or are even good enough. But I'll leave out the minor stuff and matters of taste and focus on the big-picture matters with concrete examples.

The big theme is the schizophrenic nature of certain language features (partly due to "forced" evolution since the D1 days).

For instance, a class is a reference type that cannot have a mutable reference with an immutable payload, unlike a pointer. Generally, immutable and classes don't mix well, immutable and struct postblit don't mix well. Similarly, there is a whole synchronized statement dedicated to synchronizing on an object's monitor field, keeping in mind that objects are thread-local by default in D2 anyway(!). Some stuff even the compiler seems to be confused about, e.g. static opCall vs constructor. There is opApply (internal iteration) vs range API (external iteration), which simply don't mix. Template constraints versus specialization:

func(T)(T a) if(is(T : int)){ ... } versus func(T:int)(T a) { ... }

Then there is inout - a wildcard for type qualifiers, aiming to solve the mutable/const/immutable code bloat/boilerplate for virtual functions (a narrow goal for a new qualifier). The end result (to me) is that it wrecks havoc both in templates instantiated with inout(T) types as well as making things messy when it occurs more than twice in an expression:

struct C {
     auto func(A,B)(inout(A)[] a, inout(B) function(inout(B)[]) fn)
     inout { // which qualifier - A's or B's is copied here, is it the same? I can add even more inouts
         struct X(T){ T value; }
         X!(typeof(a[0])) x;  // boom, something is simply not defined to work (yet?)
         return x.value;
     }
}

Not to bring delegates into the mix, another pre-D2 feature that has unresolved problems with type qualifiers. TLS by default in D2 is muddied by the GC kicking in to call dtors for your objects, adding races to your originally race-free code (that is fixable in the D runtime though).

Another of my issues with D are the rigid functional features: very few of the syntactic forms are extensible by the user. The good guys are "if," works with a type that has opCast(T:bool); "foreach," works with any range or something that can be sliced to get a range. On the other hand, both kinds of "switch" are limited to built-in types, and closures are fixed to the GC allocator. Same with associative arrays, fixed, while it would be better if it were pluggable. "new" is probably the most disappointing keyword, as that just means "allocated by the GC," a non-extensible notion; "delete" even more so, but that is to be phased out. The fixed-functional RTTI with TypeInfo and Object.factory one cannot opt out from (again D1-originated).

Finally, let me close this by saying that some of the above is fixable in the runtime or language. The last time I truly hated something about D, it has usually been a compiler bug that prevented my next shiny library artifact from working as designed. It's very frustrating to have something perfectly modeled in D, yet not working because of some defect in the compiler's implementation. :)

Learn more about D

To learn more about D and what's happening in D: