MLIR was so close

I first heard about MLIR when attending a workshop on quantum computing software in 2022. Multiple talks mentioned how they used MLIR to represent quantum programs and I was hyped immediately. What they described sounded strikingly close to a platform architecture I had been hoping to build one day, but this was fully realised, with institutional backing, and growing industry adoption! I hoped that MLIR could have been an implementation-independent standard to develop, compose, reuse, research, and deploy compilers that span all levels of abstraction, from high-level transformations to hardware-specific instructions. In that vision, you would automatically get robust and mature tooling and could integrate your experimental dialects seamlessly with solidly maintained infrastructure.

Unfortunately my excitement faded when I had a closer look. MLIR is built upon a set of brilliant ideas and it is an impressive feat of engineering. Some of its stated goals overlap significantly with what I wished, but its demonstrated goals seem to have been more immediately pragmatic. So it ended up being closer to an extensible LLVM than to the general platform for compiler research I hoped for.

More importantly, MLIR has in practice acquired a second role that is now hard to avoid: it increasingly functions as the common reference point against which new IRs, dialects, compiler tools, and research systems are defined. Once a system occupies that position, its implementation choices begin to shape the entire ecosystem. The criticism in this post is aimed at MLIR in that de facto role.

One Implementation To Compile Them All

Let’s think about how we would build something like MLIR. We would reasonably start by specifying a data model: what are operations, how do they compose into dataflow and controlflow regions, how are their inputs and outputs connected, what is the type system, and how can metadata be attached to operations? We’d then design a text format and a binary format with which the data model can be serialised. The text format is there primarily so that humans (or LLMs nowadays) can write and inspect IR snippets. The binary format is optimised to efficiently store and load large programs. We would then pick our favourite programming language and implement the data model, a parser, a pretty printer, a serialiser, and a deserialiser.

This does not imply a strict waterfall process by which we would only start writing code once we have figured out a spec. In practice there will be a process of discovery in which all parts of the system are adjusted iteratively. Conceptually, however, the implementation remains downstream from the data model. While we would of course make sure not to design a specification that is impossible to implement, we would prevent concepts from the implementation (such as the syntax of the implementation language or its memory management strategy) from bleeding into the specification of the IR.

A useful test is to ask whether it would be straightforward to write the compiler in a different programming language. Even if there will only ever be one implementation of the compiler (as appears to have been the idea for MLIR), this test is valuable as it warns us when we are choosing easy over simple. No doubt it will be a little bit quicker now to borrow aspects of the host language’s type system, but this will inevitably lead to an IR that is more difficult to work with in the future. Realistically, our system will have APIs in at least two programming languages anyway: one systems programming language (C, C++, or Rust) and the inevitable Python bindings.

MLIR appears to have been built exactly the other way around. The C++ implementation is the source of truth. On the website the MLIR Language Reference appears secondary to pages describing implementation details. New dialects are specified directly in C++ code. Since that was recognised as too tedious, you can also use the C++ code generation tool TableGen. Traits of operations (such as side effect profiles or constraints) are specified via the C++ inheritance hierarchy. The declarative subsystems all rely heavily on C++ escape hatches. All of this ties MLIR as a language not only to C++ but also to concrete implementation details of the compiler.

I understand that the mission of MLIR was to concentrate efforts so that we don’t have to reimplement the same SSA infrastructure over and over. Even the possibility of an alternative implementation seems contrary to that goal, and so one could argue that it works exactly as intended. But instead of a grand unification around MLIR, I have seen the IR ecosystem continue to fragment. In the cases that I’ve observed directly, this is due to MLIR presenting a huge complexity cliff. Because there is only one implementation, there can only be one choice of tradeoffs. You can’t start out by prototyping with a slower but simpler version and gradually adopt more complexity.

This difficulty led to the xdsl project: A Python based compiler toolkit for prototyping and research that aims to be compatible with MLIR. Their paper [1] describes the need for such an alternative implementation and calls it a “sidekick compiler framework”. To achieve compatibility, the developers of xdsl had to reverse engineer an implicit specification from MLIR’s codebase and now have to continuously track the behaviour of the canonical C++ implementation. Some subsystems had to be replaced with new language-indepedent or Python-centric abstractions. For example, the TableGen format for specifying MLIR dialects is tightly coupled to C++, so xdsl introduces an alternative description language called IRDL [2]. But even IRDL relies on including C++ code as an escape hatch:

irdl.type @op_with_attr {
  %0 = irdl.c_pred "::llvm::isa<::mlir::IntegerAttr>($_self)"
  irdl.parameters(%0)
}

The ecosystem now contains multiple partially compatible representations of the same ideas. This does not demonstrate that MLIR is easily portable. Rather, the amount of effort that has been poured into this shows that there is a strong need for an IR infrastructure ecosystem that allows for a diversity of implementations with different tradeoffs. Without an implementation-independent specification, this ecosystem will rely on reverse engineering and observational equivalence with MLIR, or remain a collection of isolated ad-hoc IRs that MLIR sought to unify.

xdsl picked ease of use, teaching, and experimentation over performance, but I can think of other sets of tradeoffs that would be very worthwhile to explore in alternative implementations. The Zig and Carbon projects have demonstrated that data oriented design techniques can be a good fit for compilers. By following similar tricks to column oriented databases, these compilers achieve impressive speedups by aligning with the realities of modern hardware. Consider e.g. this talk on the design decisions for the Carbon compiler. Could these ideas be used to develop an implementation of MLIR that is much faster than the current one? The analogy with databases could be embraced even further: a substantial part of many compiler datastructures (e.g. the def/use and use/def chains) look very similar to database indexes and much code is tasked with incrementally maintaining those navigational structures. What if we based an MLIR implementation entirely on an in-memory database? Could we use known techniques for incremental view maintenance to automatically derive incremental versions of analyses? I find those questions to be fascinating, and wished that the design of MLIR was less hostile to alternative implementations so that this could be explored much easier.

Open-Ended Syntax Extensions

MLIR is explicitly not intended to be used as a source language for end-users to write kernels in. This is a great design decision, since it allows to design the language in a way that is more adapted to tooling and infrastructure than to preferences of human programmers. It therefore came as a surprise to me that the MLIR text format is heavily extensible. Every operation, type, and attribute can specify its own special grammar and provide a custom parser and printer via C++ code.

This is used to implement syntactic sugar which admittedly is very convenient if you write or read the IR manually. Even if the IR is not meant for end-users, the ergonomics of the text syntax are important for debugging and demonstration purposes. But the arbitrary extensibility implemented by MLIR comes at the cost of making tooling much more complex and some types of tools outright impossible.

Operations do always admit a generic format, but the ecosystem emphasises the custom formats so heavily that supporting them is pretty much mandatory. Attributes and types also have a generic format: opaque strings for the text format or opaque blobs for the binary format. If your instance of MLIR was not compiled with all dialects that appear in an IR text file, it can not even tell in general where one operation ends and the next begins.

Treesitter grammars for MLIR include the standard dialects, but necessarily fall apart the moment that any dialect is used that hasn’t been baked into the grammar. The same applies for the LSP server, which needs you to compile a custom binary with your dialects. The binaries that are shipped with your package manager can be used to perform optimisations on MLIR files only if they are restricted entirely to upstream dialects, since otherwise it can not even parse your .mlir files.

Of course, general tooling would be unable to serve entirely dialect-specific functions without being aware of the dialects that it is dealing with. However, the IR could be designed in a way that degrades gracefully. For instance, the entirely arbitrary syntax for dialect attributes does not allow an LSP server to detect a reference to a symbol within an attribute of an unknown dialect; to the LSP the attribute is just an opaque string that can not be inspected any further. If attributes would at least follow a common tree structure, the LSP could have dug past the unknown parent nodes and found the symbol reference hidden within.

For a project that is based so heavily on extensibility this is puzzling. When using the upstream dialects that are included in the project, everything works well. However once you actually make use of the extensibility and develop a downstream dialect, you have to maintain a parallel universe with a custom built version of every tool. I suppose that this issue isn’t felt that much by the core developers since they can simply upstream every dialect they want to work with, but it is certainly a barrier that is felt by everyone who wants to use MLIR with their own dialect.

Let’s assume that even after a lot of tweaking the generic syntax would still remain impalatable to human eyes. In that case, there is a middle ground between tedious syntax and open-ended parser plugins. Many user facing programming languages avoid arbitrarily extensible syntax in favour of a small amount of widely applicable syntactic sugar:

If high-level programming languages that focus on developer experience manage just fine without arbitrary syntax extensions, arguably a compiler IR debug representation could have done without them as well and in exchange reaped the benefits of much simpler tooling. The type of syntactic sugar that is appropriate for an IR might be very different and is likely underexplored, but I am confident that after struggling with the generic syntax for a while some promising options would have emerged organically.

Or perhaps a declarative macro system inspired by Rust’s macro_rules! would have been sufficient. Any Rust developer will have noticed that syntax highlighting and the Rust LSP functions like go to definition still mostly work within the custom syntax provided by declarative macros. Tools would have to have access to a definition file for the dialect that specified its syntax, but at least they wouldn’t have to be recompiled. If you are opposed to the idea of having a declarative macro system in an IR, then you surely should also be opposed to a fully extensible parser. There even could have been a dialect that describes the macro rules.

MLIR does ship with the declarative assembly format with which the grammar of an operation can be described. However, this format is still tightly coupled to backing C++ code and is not required for all operations. It serves to conveniently generate custom parsers and is not designed as a standalone declarative specification that tools can use to parse otherwise unknown operations. Perhaps it could be adapted to fill this role in the future.

Dialect Interoperability

Dialects in MLIR enjoy a great amount of freedom. Their encodings of attributes, types, and operations in the text and binary formats are almost entirely unconstrained. Also the semantics and type system of dialects have very few restrictions. This sounds desirable at first: it allows to map concepts from fundamentally different and possibly exotic dialects directly into MLIR.

But this freedom comes at a price: I can’t expect anything of a dialect in the abstract. It is central to the idea of MLIR that multiple dialects are supposed to coexist and interact in one MLIR module. But how would I develop a dialect whose semantics are compatible with other dialects in an open-world extension system if I can not know what to expect of the other dialects that will be present? How would I develop tools that can deal with dialects that I have not explicitly accounted for?

When dialects don’t have to follow rules, the platform degenerates to the XML/JSON of compiler IRs. This issue of ambiguous dialect semantics is addressed by declaring the dialect’s C++ implementation of verifiers, analyses, canonicalisers, lowering, and optimisation passes to be the single source of truth. The semantics of dialect combinations then emerges naturally by compiling their supporting code into one binary and seeing what happens. I would think the result of this to be unpredictable, which is why dialects are typically combined into stacks that contain a closed set of dialects that are expected to work togetehr. A dialect stack is essentially a self contained programming language that happens to use MLIR under the hood. It appears to me that this is the de facto target use case of MLIR.

I know well that my personal bias steers me towards heavy theory and that industry tends to have its focus elsewhere. However I still believe that, especially in the realm of compilers, judiciously applying tools from programming language theory can lead to designs that work better in practice. When I worked at Quantinuum, I proposed several tweaks to the compiler IR that were entirely motivated by my own private concerns about categorical semantics. These tweaks turned out to allow many new useful features to be integrated into the design with substantially less friction than in previous more adhoc designs. There is a rich library of literature on programming language semantics and type theory which could have been relied upon to formulate an abstract machine model that MLIR would then implement.

The developers of the Ratte fuzzing system [3] ran into these issues as well. A fuzzing tool can only identify deviations from the intended behaviour of a piece of software if that intended behaviour is well specified. The Ratte project appears to have pieced together a formal semantics (in the form of a definitional interpreter) for a set of common dialects, and in the process identified a number of bugs in the code and ambiguities in the specification. Ratte doesn’t solve the specification problem for the MLIR ecosystem. Their approach works for a specific set of dialects, and while it is in principle compositional, the ecosystem is not beholden to develop dialects or stacks of dialects whose semantics can be cleanly captured in this or any other system. There remains to be no standard compositional semantics for MLIR dialects.

As with xdsl earlier, one could argue that Ratte’s formal specification demonstrates that the MLIR platform provides a solid foundation and that whatever is missing can and will be supplied in time. I am skeptical towards that interpretation. For one, the team behind Ratte had to expend a lot of effort in trying to retrofit semantics onto a system that was not designed with such things in mind. In my mind it appears to be an impolite practice to move fast and establish your tool as the de-facto standard and then expect others to navigate the jungle of corner cases that this engineering pragmatism produced. I am sure this effect is emergent from the best intentions, but it irks me still.

Retrofitting a specification, formal semantics, or type systems to an existing programming language that was designed without those concerns in mind tends to take herculean effort and produce complex results. For example, since TypeScript had to capture the programming patterns that were typical in preexisting untyped JavaScript code, its type system is necessarily enormous and complex. I have seen much more complicated types in common TypeScript libraries than I have ever seen in even the most excessively clever Haskell code. In the case of TypeScript there was no choice: in order to be widely adopted, it had to be seamlessly compatible with JavaScript and therefore accommodate all of the interesting APIs that people come up with in dynamically typed languages. In the case of MLIR, there was the chance to take a minute to make sure that the language is amenable to building simple static analysis tools in the future.

Dialect interoperability in terms of a syntax, semantics, and type system that have clear, simple, and regular composition rules appears to me the the most important piece when trying to unify the landscape of intermediate representations so that duplicate work can be avoided. When done right, this can reduce the cost of producing implementations optimised for varying tradeoffs and encourage the building of powerful and general tooling that is broadly compatible. Unfortunately this does not appear to have been a priority.

What Now?

I haven’t given up on the dream that one day we will have the kind of compiler design toolkit which I originally thought MLIR to be. The presence of MLIR makes developing such a platform harder, as it makes it more difficult to justify the need build something new. This social problem can be transformed partially into a technical one by trying to maintain compatibility with MLIR, but that brings along its own challenges. This post doesn’t really convey the idea of what I would have in mind, but I hope to someday find the resources to take on this challenge head on instead of just tinkering on it in my free time.

Bibliography

[1] M. Fehr et al., “Sidekick Compilation with xDSL,” no. arXiv:2311.07422. arXiv, June 2024. doi: 10.48550/arXiv.2311.07422.
[2] M. Fehr, J. Niu, R. Riddle, M. Amini, Z. Su, and T. Grosser, “IRDL: An IR Definition Language for SSA Compilers,” in Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego CA USA: ACM, June 2022, pp. 199–212. doi: 10.1145/3519939.3523700.
[3] P. Yu, N. Wu, and A. F. Donaldson, “Ratte: Fuzzing for Miscompilations in Multi-Level Compilers Using Composable Semantics,” in Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Rotterdam Netherlands: ACM, Mar. 2025, pp. 966–981. doi: 10.1145/3676641.3716270.