Concepts in Streaming SQL

October 29, 2010, 8:21 pm

SQLstream's marketing head Rick Saletta just wrote a layman's guide to streaming SQL. It's short, sweet, entirely buzzword-free, and a good introduction to streaming queries. So I thought I'd share the whole post:

A streaming SQL query is a continuous, standing query that executes over streaming data. Data streams are processed using familiar SQL relational operators augmented to handle time sensitive data. Streaming queries are similar to database queries in how they analyze data; they differ by operating continuously on data as they arrive and by updating results in real-time.

Streaming SQL queries process dynamic, flowing data, in contrast to traditional RDBMSs, which process static, stored data with repeated single-shot queries. Streaming SQL is simple to configure using existing IT skills, dramatically reducing integration cost and complexity. Combining the intuitive power of SQL with this simplicity of configuration enables much faster implementation of business ideas, while retaining the scalability and investment protection important for business-critical systems.

By processing transactions continuously, streaming SQL directly addresses the real-time business needs for low latency, high volume, and rapid integration. Complex, time-sensitive transformations and analytics, operating continuously across multiple input data sources, are simple to configure and generate streaming-analytics answers as input data arrive. Sources can include any application inputs or outputs, or any of the data feeds processed or generated within an enterprise. Examples include financial trading data, internet clickstream data, sensor data, and exception events. SQL can process multiple input and output streams of data, for multiple publishers and subscribers.

If you want to learn more, download the Concepts in Streaming SQL white paper.

↧

Numbers everyone should know

November 17, 2010, 3:04 pm

≫ Next: Architectural shuffling in mondrian's XMLA and olap4j servers

≪ Previous: Concepts in Streaming SQL

Jeffrey Dean recently gave a talk "Building Software Systems at Google and Lessons Learned" at Stanford (video). One of his slides was the following list of numbers:

L1 cache reference	0.5	ns
Branch mispredict	5 ns
L2 cache reference	7 ns
Mutex lock/unlock	25 ns
Main memory reference	100 ns
Compress 1K bytes w/ cheap algorithm	3,000 ns
Send 2K bytes over 1 Gbps network	20,000 ns
Read 1 MB sequentially from memory	250,000 ns
Round trip within same datacenter	500,000 ns
Disk seek	10,000,000 ns
Read 1 MB sequentially from disk	20,000,000 ns
Send packet CA->Netherlands->CA	150,000,000 ns

Everyone who wants to design high-performance, scalable systems should memorize these numbers. There are many, many lessons to be learned.

↧

Architectural shuffling in mondrian's XMLA and olap4j servers

November 22, 2010, 12:00 pm

≫ Next: An experiment with the Linux scheduler

≪ Previous: Numbers everyone should know

As a software architect, some of my most interesting work doesn't deliver any additional functionality to end-users, but reorganizes the architecture to make great things possible in future. Since mondrian is an open source project, those great things will, likely as not, be dreamt up by someone else; my job as leader of the mondrian project is to reorganize things to make that possible.

Case in point, my recent check in, change 13929. It contains three new pieces of functionality.

Make mondrian's XMLA server run off the olap4j API

Mondrian's legacy API (mondrian.olap.Connection, etc.) has been deprecated for some time; olap4j is the official API by which applications should speak to mondrian. Mondrian's XMLA server, that takes incoming SOAP requests over HTTP to execute queries or retrieve metadata, processes them using mondrian, and returns the results as SOAP or JSON over HTTP, has not used the olap4j API until now.

As part of this change, I converted the XMLA server to use olap4j. In the process, I achieved some beneficial side effects. First, I discovered and fixed a few bugs in mondrian's olap4j driver; this will make the olap4j driver more stable for everyone.

Second, I discovered a few essential pieces of metadata that the olap4j API does not return. I have not yet extended olap4j to include them: that may happen as we move towards olap4j 1.1 or olap4j 2.0, if they make sense for other olap4j stakeholders. I created the XmlaExtra interface as a loophole, to allow the XMLA server get mondrian's legacy API; this interface serves to document what's missing from olap4j.

Third, and most exciting, the XMLA server should now run against any olap4j driver. It needs to be repackaged a bit — it still lives within the mondrian codebase, in the mondrian.xmla package — but if you are developing an olap4j driver, contact me, and we can consider spinning it out.

Make mondrian into a real server — for those who want one

You'll notice that I tend to refer to mondrian as an OLAP engine. I've always hesitated to call it a 'server', because a server has an independent existence (its own process id, for instance), configuration information, and services such as authentication.

This is no accident: I deliberately architected mondrian as an engine, so that it could be embedded into another application or server and inherit those services from that application. That's why you need to tell mondrian the URI of the catalog, the JDBC information of the data warehouse, and the role that you would like mondrian to use to execute queries. It has no concept of users and passwords, because it assumes that the enclosing application is performing authentication, then mapping authenticated users to roles.

This architecture makes as much sense now as it did when I started, and it isn't going to change. Core mondrian will remain an engine. But the XMLA server, as its name suggests, performs some of the functions that one associates with a server. In particular, it reads a datasources.xml file that contains the name, catalog URI, and JDBC information of multiple catalogs. My idea was to create an alternate olap4j driver, MondrianOlap4jEngineDriver, that extends the default driver MondrianOlap4jDriver, and move the catalog functionality from the XMLA server to the new olap4j driver.

The new driver is added as part of this change, but is not complete. In a later change, I will move the catalog functionality out of the XMLA server. I don't have plans to add other server features, such as mechanisms to authenticate users or map user names to roles. But I've provided the hook where this functionality can be added, and I encourage you in the mondrian community to contribute that functionality.

Lock box

Last, I came up with an elegant (I think) solution to a problem that has been perplexing us for a while. The problem is that the JDBC API requires all parameters to be passed as strings when you are making a connection. If you are creating an olap4j connection to mondrian, and access to the underlying data warehouse is via a javax.sql.DataSource object, not a connect string, then you cannot pass in that DataSource. If you have created your own Role object to do customized access-control, you cannot pass in the object, you have to pass in the name of a role already defined in the mondrian schema (or a comma-separated list of role names).

I invented a LockBox class, that acts as a special kind of map that has some of the characteristics of a directory service. There is one lock box per server. If you have an object you wish to pass in, then you register it with the lock box, and the lock box gives you a string moniker to reference that object. That moniker is unique for the duration of the server, and near impossible for an unauthorized component guess. You can pass it to other components, and they can access the object.

The lock box automatically garbage collects unused objects. When an object is registered, the lock box returns an entry object to the caller that contains both the string moniker and the object itself. The entry is the key to a WeakHashMap, so when the client forgets the entry, the object is eligible to be garbage-collected out of the lock box. This guarantees that the lock box will not fill up over time due to clients forgetting to deregister objects.

LockBox does not purport to be a full directory service — in particular, objects are only accessible within the same JVM — but it carries out a simple purpose, efficiently and elegantly, and may be useful to other applications.

↧

An experiment with the Linux scheduler

December 19, 2010, 9:26 pm

≫ Next: "Just another big pile of data"

≪ Previous: Architectural shuffling in mondrian's XMLA and olap4j servers

I was curious to see how the Linux scheduler would manifest from a program's perspective, so today I did an experiment.

I wrote a single-threaded program running a simple loop. All the loop does is to compute the number of milliseconds since the last iteration, and store the result in a histogram. We are not so much interested in the performance of the loop (it does about a million iterations per second) but in the variations in the intervals between loop iterations. These variations are presumably caused by the Linux scheduler.

Here are the numbers I achieved, and the same numbers in a chart (with a logarithmic y-axis).

Interval (milliseconds)	Frequency
0	450,080,302
1	909,044
2	4,642
3	1,696
4	853
5	561
6	557
7	335
8	1,098
9	152
10	86
11	52
12	98
13	17
14	13
15	6
16	21
17	5
19	3
20	1
21	2
22	2
23	0
24	2
25	0
26	0
27	0
28	1
29	0

The vast majority of iterations occur zero milliseconds after the previous iteration. No surprise there; Java's clock granularity, 1 millisecond, is coarse enough to execute over a million instructions.

If the thread was never interrupted, one would expect the loop to tick forward 1 millisecond 1,000 times per second, or about 500,000 times in all. It actually ticks 909,044 times, so interrupts are at work: about 400,000 of them.

Longer delays also occur: 2, 3, 4, up to 28 milliseconds, occurring with exponentially decreasing frequency. Only 8 delays of 20 milliseconds or longer occur in the 7.5 minute run. The chart shows the exponential decay clearly. The chart plots log-frequency, and the trend line is indeed flat from 2 milliseconds onwards, so it is accurate to characterize the line as exponential.

The one surprising thing: significant bumps at 8, 12 and 16 milliseconds. Although the trend of the line is pretty consistently down, each of those interval durations has more distinctly occurrences than the previous interval. Does anyone know anything about the Linux scheduler that might explain this?

↧

"Just another big pile of data"

January 6, 2011, 3:08 pm

≫ Next: Oracle, Hudson and Jenkins

≪ Previous: An experiment with the Linux scheduler

Jeff Jonas writes about the challenges of managing data privacy when the data concerned is Big Data. He advocates taking a real-time approach to auditing user behavior:

Real-time active audits. It is now going to be essential that user activity be more rigorously analyzed, in real-time, for inappropriate behavior. Audit logs have actually been part of the problem – just another big pile of data – evidence of misuse hiding in plain sight against the backdrop of millions and millions of benign audit records.

I must say, it hadn't occurred to me that privacy management could be seen as a real-time data problem. But he's right that large data sets, when not acted upon immediately, can become part of the problem. In his words, "just another big pile of data".

↧

Oracle, Hudson and Jenkins

January 25, 2011, 11:42 pm

≫ Next: Scalable caching in Mondrian

≪ Previous: "Just another big pile of data"

I've been following the furore about the Hudson open source project with some interest and amusement. Oracle owns the trademark on the name Hudson (because the original developer worked for Sun at the time the project was created) and the community is spooked by the possibility that Oracle will enforce its trademark rights in future.

Trademark rights are indeed a big deal for an open source project, just as they are for a commercial product. An open source project builds its brand by several years of high-quality releases and effective support in its community. Whoever owns the trademark of a project controls that brand.

Here is Oracle's proposal for the future of the project, and the response of one of the project's lead developers.

It's an interesting study in the fragile dynamics of an open source project's community. Oracle clearly don't understand how fragile the power balance is. The community is spooked; not so much by Oracle's ability to enforce its trademark (they claim they would never do that) but by their presumption that they have more of a say in the project than anyone else.

My two cents? Oracle are not evil, but they are being naive and are coming across as complete dicks. If I were a member of the active Hudson community (I'm a happy user of Hudson, since it powers Pentaho's continuous integration site, but I wouldn't say that makes me a community member) I'd certainly give my +1 to fork and change the name of the project to Jenkins. There's little reason not to.

↧

Scalable caching in Mondrian

February 4, 2011, 1:13 pm

≫ Next: olap4j version 1.0 released

≪ Previous: Oracle, Hudson and Jenkins

Wouldn't it be great if Mondrian's cache could be shared between several Mondrian instances, use memory outside the JVM or even across several machines, and scale as the data size or computation effort increases? That is the vision of Pentaho's "enterprise cache" initiative.

Mondrian cell-caching architecture, including pluggable external cache.

Luc Boudreau has been leading this effort, has just checked in the first revision of the new mondrian.rolap.agg.SegmentCache interface, and has written a blog post describing how it will work. (Note: This SPI is likely to change before we release it.)

Pluggable caching will be in Mondrian release 3.3, probably Q2 or Q3 this year.In the community edition will be the SPI and a default implementation that uses JVM memory. Of course the community will be able to contribute alternative implementations. In the enterprise edition of Mondrian 3.3, there will be scalable, highly manageable implementation based on something like Terracotta BigMemory, ehCache or JBoss Infinispan.

In future releases, you can expect to see further work in the area. Maybe alternative implementations of the caching SPI, and certainly tuning of Mondrian's caching and evaluation strategies, as we apply Mondrian to some of the biggest data sets out there.

↧

olap4j version 1.0 released

April 12, 2011, 2:21 am

≫ Next: Scripted plug-ins in LucidDB and Mondrian

≪ Previous: Scalable caching in Mondrian

Today we launched version 1.0 of olap4j, the open standard API for accessing analytic databases.

It's worth mentioning that version 1.0 is a big deal for an open source project. The tag implies maturity and stability, both of which are true for olap4j. The project is over 4 years old, has two robust driver implementations, and many applications in production.

The olap4j driver for Mondrian has been the official way to access Mondrian since version 3.0, and the olap4j driver for XML/A allows access to many XML/A-compliant analytic engines, including Microsoft SQL Server Analysis Services, Mondrian, Palo, and SAP BW.

olap4j was created to address the lack of an open standard API for Java access to OLAP servers. Microsoft had created APIs for the Windows platform (OLE DB for OLAP, and later ADOMD.NET) and for web services (XML for Analysis) and in due course other vendors adopted those APIs as standards, but on Java, the main platform for enterprise applications, you were always tied to the API provided by your OLAP server vendor.

There had been previous attempts to create Java APIs for OLAP, but they foundered because the main vendors could not — or would not — overcome the technical differences between their products. Since OLAP is concerned with constructing dynamic queries to assist an end-user in interactively exploring a data set, most vendors constructed queries using a complex proprietary API to "build" a query using a sequence of transforms.

Relational database APIs such as ODBC and JDBC take a different approach: the query is a string in the SQL language. This allowed the APIs to be simpler, because the semantics of the query language need to be understood by the SQL parser and validator on the server, not by the API itself. And it has allowed the query language to be standardized without affecting the API too much. But the OLAP vendors maintained that such a simplifying approach could not be applied to OLAP.

Microsoft started to prove them wrong when in 1998 they launched SQL Server OLAP Services, the OLE DB for OLAP API, and the MDX query language. This was the first time (to my knowledge) that an OLAP vendor had built its API around a query language as opposed to a set of transforms. MDX played a major role in the success of XML/A: a web services API would have been much harder to use if the queries had built using an object model. Other vendors started to adopt OLE DB for OLAP and XML/A, leaving a void on the one platform Microsoft had no interest in: Java.

Those of us in the open source world felt that void most acutely. Open source projects are organized into discrete components, each talking a standard API, and able to replace a proprietary component by being better, cheaper, faster. If there are no standard APIs, the product stacks sprawl across many components, from client-side to server-side, all made by the same vendor; there is nowhere for open source to get a foothold, and the customer has no choice but to accept the whole hog sold by the vendor.

To redress this, I decided to create a new API. The software would be developed as an open source project, but perhaps more importantly, the specification would be created using an open standards process. As a result, the participants in olap4j read as a who's who of open source BI. Barry Klawans, then chief architect of JasperSoft, co-authored the original draft; Pentaho's chief geek, James Dixon, authored the query model; Luc Boudreau, first with the University of Montreal, then with SQL Power, and now at Pentaho, is the XMLA driver's most active committer and co-leads the project; Paul Stoellberger and Tom Barber have proven and showcased olap4j by developing the first graphical client, Saiku. Paul has also got the XMLA driver working against SAP BW. And we've worked closely with Palo developers: Michael Raue worked with us on the spec, and Vladislav Malicevic has gotten the XMLA driver working against Palo.

I knew that to be successful, olap4j needed to be simple and familiar, so I mandated that it would be an extension to JDBC and would use MDX as its query language. The other participants in the specification process took it from there.

Because olap4j is an extension to JDBC, any developer who has accessed databases from Java can easily pick it up. And it can leverage standard JDBC services such as connection pools and driver managers.

Microsoft had proven that an API could be built around the MDX language; there were differences between servers, but these would be mostly in the dialect of MDX supported; just about any server could support the basic metamodel of catalogs, cubes, dimensions, and measures. Some clients would want to build their own queries, and parse existing MDX queries; for these, we added a query model and an MDX parser to olap4j. Use of the query model and MDX parser is optional: if you have an MDX query string, you can just execute it.

We have recently added more advanced features such as scenarios (write-back) and notifications. These features are still experimental (unlike the rest of the API, they may change post-1.0) and are optional for any olap4j provider. But we hope to see more providers implementing them, and clients making use of them. And we hope to see more features added to olap4j in future versions.

The goal of olap4j was to foster development of analytic clients, servers, and integrated analytic apps by providing an open standard for connectivity. That goal has been realized. There is a native driver for mondrian and an XMLA driver that works against Microsoft SQL Server Analysis Services, SAP BW, Jedox Palo. There are several clients, both open and closed source: several components in Pentaho's own suite, the Community Dashboard Framework (CDF), Saiku, ADANS, SQL Power Wabit, and more.

People are using olap4j in ways that I couldn't imagine when I started the project four years ago. That's the exciting thing about an open source project becomes successful and starts to gain momentum: you can expect the unexpected.

Thank you to everyone who helped us get to this milestone.

Visit www.olap4j.org, and download the release 1.0 of the specification and the software.

↧

Scripted plug-ins in LucidDB and Mondrian

May 31, 2011, 12:03 am

≫ Next: Removing Mondrian's 'high cardinality dimension' feature

≪ Previous: olap4j version 1.0 released

I saw a demo last week of scripted user-defined functions in LucidDB, and was inspired this weekend to add them to Mondrian.

Kevin Secretan of DynamoBI has just contributed some extensions to LucidDB to allow you to call script code (such as JavaScript or Python) in any place where you can have a user-defined function, procedure, or transform. This feature builds on a JVM feature introduced in Java 1.6, scripting engines.

Scripted functions may be a little slower than Java user-defined functions, but what they lose in performance they more than make up in flexibility. Writing user-defined functions in Java has always been laborious: you need to write a Java class, compile it, put it in a jar, put the jar on the server's class path, and restart the server. Each time you find a bug, you need to repeat that process, and that can easily take a number of minutes each cycle. Because scripted functions are compiled on the fly, you can cycle faster, and spend more of your valuable time working on the actual application.

I am speaking about LucidDB (and SQLstream) here, but the same problems exist for Mondrian plug-ins. Scripting is an opportunity to radically speed up development of application extensions, because everything can be done in the schema file. (Or via the workbench... but that part isn't implemented yet.)

Mondrian has several plug-in types, all today implemented using a Java SPI. I chose to make scriptable those plug-ins that are defined in a mondrian schema file: user-defined function, member formatter, property formatter, and cell formatter. A small syntax change to the schema file allowed you to chose whether to implement these plug-ins by specifying the name of a Java class (as before) or an inline script.

As an example, here is the factorial function defined in JavaScript:

<UserDefinedFunction name="Factorial">
  <Script language="JavaScript">
    function getParameterTypes() {
      return new Array(new mondrian.olap.type.NumericType());
    }
    function getReturnType(parameterTypes) {
      return new mondrian.olap.type.NumericType();
    }
    function execute(evaluator, arguments) {
      var n = arguments[0].evaluateScalar(evaluator);
      return factorial(n);
    }
    function factorial(n) {
      return n     }
  </Script>
</UserDefinedFunction>

A user-defined function ironically requires several functions in order to provide the metadata needed by the MDX type system. The member, property and cell formatters are simpler. They require just one function, so mondrian dispenses with the function header, and requires just the 'return' expression inside the Script element. For example, here is a member formatter:

<Level name="name" column="column">
  <MemberFormatter>
    <Script language="JavaScript">
      return member.getName().toUpperCase();
    </Script>
  </MemberFormatter>
</Level>

You can of course write multiple statements, if you wish. Since JavaScript is embedded in the JVM, your code can call back into Java methods, and use the full runtime Java library.

There are examples of cell formatters and property formatters in the latest schema guide.

If you are concerned about performance, you could always translate this code back to a Java UDF when it is fully debugged. However, you might be pleasantly surprised by the performance of JavaScript: I was able to invoke a script function about 20,000 times per second. And I hear that there is a Janino "scripting engine" that compiles Java code into bytecode on the fly. In principle, it should be as fast as a real Java UDF.

I'd love to hear about Janino, or in fact any other scripting engine, with the Mondrian or LucidDB scripted functions.

By the way, you can expect to see scripted functions in a release of SQLstream not too far in the future. The Eigenbase project makes it easy to propagate features between projects, and this feature is too good not to share.

↧

Removing Mondrian's 'high cardinality dimension' feature

June 1, 2011, 4:26 pm

≫ Next: Roll your own high-performance Java collections classes

≪ Previous: Scripted plug-ins in LucidDB and Mondrian

I would like to remove the 'high cardinality dimension' feature in mondrian 4.0.

To specify that a dimension is high-cardinality, you set the highCardinality attribute of the Dimension element to true. This will cause mondrian to scan over the dimension, rather than trying to load all of the children of a given parent member into memory.

The goal is a worthy one, but the implementation — making iterators look like lists — has a number of architectural problems: it duplicates code; because it allows backtracking for a fixed amount, it works with small dimensions but unpredictably fails with larger ones; and because lists are based on iterators, re-starting an iteration multiple times (e.g. from within a crossjoin) can re-execute complex SQL statements.

There are other architectural features designed to help with large dimensions. Many functions can operate in an 'iterable' mode (except that here the iterators are explicit). And for many of the most data-intensive operators, such as crossjoin, filter, semijoin (non-empty), and topcount, we can push down the operator to SQL, and thereby reduce the number of records coming out of the RDBMS.

It's always hard to remove a feature. But over the years we have seen numerous inconsistencies, and if we removed this feature in mondrian 4.0, we could better focus our resources.

If you are using this feature and getting significant performance benefit, I would like to hear from you. I would like to understand about your use case, and either direct you to another feature that solves the problem, or try to develop an alternative solution in mondrian 4.0. The best place to make comments about these use cases is on the Jira case MONDRIAN-949.

↧

Roll your own high-performance Java collections classes

June 3, 2011, 12:58 pm

≫ Next: Yellowfin BI release 5.2 moves to olap4j

≪ Previous: Removing Mondrian's 'high cardinality dimension' feature

The Java collections framework is great. You can create maps, sets, lists with various element types, various performance characteristics (e.g. if you want O(1) insert, use a linked list), iterate over them, and you can decorate them to give them other behaviors.

But suppose that you want to create a high-performance, memory efficient immutable list of integers? You'd write

List<Integer> list =
  Collections.unmodifiableList(
    new ArrayList(
      Arrays.asList(1000, 1001, 1002)));

There will be 6 objects allocated in the JVM: three Integer objects, an array Object[3] to hold the Integers, an ArrayList, and an UnmodifiableRandomAccessList. Not to mention the Arrays.ArrayList and Integer[3] used to construct the list and quickly thrown away.

The resulting list is no longer high-performance. A call to say 'int n = list.get(2)' requires 3 method calls (UnmodifiableRandomAccessList.get, ArrayList.get, Integer.intValue) and 3 indirections. And the sheer number of objects created reduces the chance that a given stretch of code will be able to operate solely from the contents of L1 cache.

So, what next? Should I write my own class, like this?

public class UnmodifiableNativeIntArrayList
  implements List<Integer>
{
  ...
}

Well, maybe. But there are rather a lot of variations to cover, and each one needs to be hand-coded and tested.

Do I use library code? I searched and turned up Apache Commons Primitives, Primitive Collections for Java (PCJ), and GNU Trove (trove4j). Of these, only GNU Trove is still active.

None of the libraries supports features such as maps with two or more keys, unmodifiable collections, synchronized collections, flat collections (similar to Apache Flat3Map). It's not surprising that they don't: each combination of features would require its own class, so the size of the jar file would grow exponentially.

So, I'd like to propose an alternate approach. You configure a factory, specifying the precise kind of collection you would like, and the factory generates the collection class in bytecode. You can use the factory to quickly create as many instances of the collection as you wish. The collection implements the Java collections interfaces, plus additional interfaces that allow you to efficiently access the collection without boxing/unboxing.

The above example would be written as follows:

// Initialize the factory when the program is loaded.
// Then the bytecode gets generated just once.
static final Factory factory =
  new FactoryBuilder()
    .list()
    .elementType(Integer.TYPE)
    .modifiable(false)
    .factory();

int[] ints = {1000, 1001, 1002};
IntList list = factory.createIntList(ints);

Variants are expressed as FactoryBuilder methods:

FactoryBuilder FactoryBuilder.list()
FactoryBuilder FactoryBuilder.map()
FactoryBuilder FactoryBuilder.set()
FactoryBuilder FactoryBuilder.keyType(Class...) (for maps only)
FactoryBuilder FactoryBuilder.valueType(Class...) (for maps only)
FactoryBuilder FactoryBuilder.elementType(Class...) (for list and set only)
FactoryBuilder FactoryBuilder.sorted(boolean) (cf. the difference between Set and SortedSet)
FactoryBuilder FactoryBuilder.deterministic(boolean) (cf. the difference between HashMap and LinkedHashMap)
FactoryBuilder FactoryBuilder.modifiable(boolean)
FactoryBuilder FactoryBuilder.fixedSize(boolean) (cf. the difference between Flat3Map and Map)
FactoryBuilder FactoryBuilder.synchronized(boolean)

And so forth. Additional variants could be added as the project evolved. Templates could be fine-tuned for particular combinations of variants.

The projects I mentioned above clearly use a template system, and we could use and extend those templates. The janino facility can easily convert the generated java code into bytecode. And the JVM would be able to apply JIT (just-in-time compilation) to these classes; in fact, these classes would be more amenable to compilation, because they would be compact and final.

The existing projects have invested a lot of effort designing high-performance collections. I'd like to build on that work; this project could even be an extension to those projects.

I'd like to hear if you're interested in working with me on this.

↧

Yellowfin BI release 5.2 moves to olap4j

June 9, 2011, 10:25 am

≫ Next: Real-Time Seismic Monitoring

≪ Previous: Roll your own high-performance Java collections classes

According to their press release, Yellowfin BI version 5.2 "includes a significant OLAP overhaul, with the introduction of OLAP4j and support for PALO, BW as well as enhanced connectivity for SQL Server 2005+".

Nice to see olap4j gaining wider adoption. Though not too surprising, given connectivity options that it opens up. And bear in mind that because olap4j is open source, for every product that mentions olap4j in a press release, there may be dozens or hundreds of others that are using it and not talking about it publicly.

Increased adoption is good, whether or not vendors choose to announce it. We know if vendors run into issues, they will log them and someone would fix them. It makes olap4j better for everyone.

↧

Real-Time Seismic Monitoring

July 22, 2011, 10:51 am

≫ Next: How Mondrian names hierarchies

≪ Previous: Yellowfin BI release 5.2 moves to olap4j

Marc Berkowitz wrote a blog post describing an application of SQLstream to power a seismic monitoring project that is a collaboration between several leading research institutions.

The project is interesting in several respects:

The project involves signal processing. Unlike the "event-processing" application that we see most often at SQLstream, events arrive at a regular rate (generally 40 readings every second, per sensor). In signal processing, events are more likely to be processed using complex mathematical formulas (such as Fourier transforms) than by boolean logic (event A happened, then event B happened). Using SQLstream's user-defined function framework, we were easily able to accommodate this form of processing.
It illustrates how a stream-computing "fabric" can be created, connecting multiple SQLstream processing nodes using RabbitMQ.
One of the reasons for building a distributed system was to allow an agile approach. Researchers can easily deploy new algorithms without affecting the performance or correctness of other algorithms running in the cloud.
Another goal of the distributed system was performance and scalability. Nodes can easily be added to accommodate greater numbers of sensors. The system is not embarassingly parallel, but we were still able to parallelize the solution effectively.
Lastly, the system needs to be both continuous and real-time. "Continuous" meaning that data is processed as it arrives; a smoother, more predictable and more efficient mode of operation than ETL. "Real-time" because some of the potential outputs of the system, such as tsunami alerts, need to be delivered as soon as possible in order to be useful.

In all, a very interesting case study of what SQLstream is capable of. Marc plans to make follow-up posts describing the solution in more detail, so stay tuned.

↧

How Mondrian names hierarchies

August 24, 2011, 10:40 am

≫ Next: Changes to Mondrian's caching architecture

≪ Previous: Real-Time Seismic Monitoring

You may or may not be aware of the property mondrian.olap.SsasCompatibleNaming. It controls the naming of elements, in particular how Mondrian names hierarchies when there are multiple hierarchies in the same dimension.

Let's suppose that there is a dimension called 'Time', and it contains hierarchies called 'Time' and 'Weekly'.

If SsasCompatibleNaming is false, the dimension and the first hierarchy will both be called '[Time]', and the other hierarchy will be called '[Time.Weekly]'.

If SsasCompatibleNaming is true, the dimension will be called '[Time]', the first hierarchy be called '[Time].[Time]', and the other hierarchy will be called '[Time].[Weekly]'.

As you can see, SsasCompatibleNaming makes life simpler, if slightly more verbose, because it gives each element a distinct name. There are knock-on effects, beyond the naming of hierarchies. The most subtle and confusing effect is in the naming of levels when the dimension, hierarchy and level all have the same name. If SsasCompatibleNaming is false, then [Gender].[Gender].Members is asking for the members of the gender level, and yields two members. If SsasCompatibleNaming is true, then [Gender].[Gender].Members is asking for the members of the gender hierarchy, and yields three members (all, F and M).

Usually, however, Mondrian is forgiving in how it resolves names, and if elements have different names, it will usually find the element you intend.

The default value is false. However, that leads to naming behavior which is not compatible with other MDX implementations, in particular Microsoft SQL Server Analysis Services (versions 2005 and later).

From mondrian-4 onwards, the property will be set to true. (You won't be able to set it to false.) This makes sense, because in mondrian-4, with attribute-hierarchies, there will typically be several hierarchies in each dimension. We will really need to get our naming straight.

What do we recommend? If you are using Pentaho Analyzer, Saiku or JPivot today, we recommend that you use the default value, false. But if you are writing your own MDX (or have built your own client), try setting the value to true. The new naming convention actually makes more sense, and moving to it now will minimize the disruption when you move to mondrian-4.

I am just about to check in a change that uses a new, and better name resolution algorithm. It will be more forgiving, and standards-compliant, in how it resolves the names of calculated members. However, it might break compatibility, so it will only be enabled if SsasCompatibleNaming is true.

Are you using this property today? Let us know how it's working for you.

↧

Changes to Mondrian's caching architecture

January 14, 2012, 4:05 pm

≫ Next: olap4j releases version 1.0.1, switches to Apache license

≪ Previous: How Mondrian names hierarchies

I checked in some architectural changes to Mondrian's cache this week.

First the executive summary:

1. Mondrian should do the same thing as it did before, but scale up better to more concurrent queries and more cores.

2. Since this is a fairly significant change in the architecture, I'd appreciate if you kicked the tires, to make sure I didn't break anything.

Now the longer version.

Since we introduced external caches in Mondrian 3.3, we were aware that we were putting a strain on the caching architecture. The caching architecture has needed modernization for a while, but external caches made it worse. First, a call to an external cache can take a significant amount of time: depending on the cache, it might do a network I/O, and so take several orders of magnitude longer than a memory access. Second, we introduced external caching and introduced in-cache rollup, and for both of these we had to beef up the in-memory indexes needed to organize the cache segments.

Previously we'd used a critical section approach: any thread that wanted to access an object in the cache locked out the entire cache. As the cache data structures became more complex, those operations were taking longer. To improve scalability, we adopted a radically different architectural pattern, called the Actor Model. Basically, one thread, called the Cache Manager is dedicated to looking after the cache index. Any query thread that wants to find a segment in the cache, or to add a segment to the cache, or create a segment by rolling up existing segments, or flush the cache sends a message to the Cache Manager.

Ironically, the cache manager does not get segments from external caches. As I said earlier, external cache accesses can take a while, and the cache manager is super-busy. The cache manager tells the client the segment key to ask the external cache for, and the client does the asking. When a client gets a segment, it stores it in its private storage (good for the duration of a query) so it doesn't need to ask the cache manager again. Since a segment can contain thousands of cells, even large queries typically only make a few requests to the cache manager.

The external cache isn't just slow; it is also porous. It can have a segment one minute, and forget it the next. The Mondrian query thread that gets the cache miss will tell the cache manager to remove the segment from its index (so Mondrian doesn't ask for it again), and formulate an alternative strategy to find it. Maybe the required cell exists in another cached segment; maybe it can be obtained by rolling up other segments in cache (but they, too, could have gone missing without notice). If all else fails, we can generate SQL to populate the required segment from the database (a fact table, or if possible, an aggregate table).

Since the cache manager is too busy to talk to the external cache, it is certainly too busy to execute SQL statements. From the cache manager's perspective, SQL queries take an eternity (several million CPU cycles each), so it farms out SQL queries to a pool of worker threads. The cache manager marks that segment as 'loading'. If another query thread asks the cache manager for a cell that would be in that segment, it receives a Future<SegmentBody> that will be populated as soon as the segment arrives. When that segment returns, the query thread pushes the segment into the cache, and tells the cache manager to update the state of that segment from 'loading' to 'ready'.

The Actor Model is a radically different architecture. First, let's look at the benefits. Since one thread is managing an entire subsystem, you can just remove all locking. This is liberating. Within the subsystem, you can code things very simply, rather than perverting your data structures for thread-safety. You don't even need to use concurrency-safe data structures like CopyOnWriteArrayList, you can just use the fastest data structure that does the job. Once you remove concurrency controls such as 'synchronized' blocks, and access from only one thread, the data structure becomes miraculously faster. How can that be? The data structure now resides in the thread's cache, and when you removed the concurrency controls, you were also removing memory barriers that forced changes to be written through L1 and L2 cache to RAM, which is up to 200 times slower.

Migrating to the Actor Model wasn't without its challenges. First of all, you need to decide which data structures and actions should be owned by the actor. I believe we got that one right. I found that most of the same things needed to be done, but by different threads than previously; so the task we mainly about moving code around. We needed to refine the data structures that were passed between "query", "cache manager" and "worker" threads, to make sure that they were immutable. If, for instance, you want the query thread to find other useful work to do while it is waiting for a segment, it shouldn't be modifying a data structure that it put into the cache manager's request queue. In a future blog post, I'll describe in more detail the challenges & benefits of migrating one component of a complex software system to the Actor Model.

Not all caches are equal. Some, like JBoss Infinispan, are able to share cache items (in our case, segments containing cell values) between nodes in a cluster, and to use redundancy to ensure that cache items are never lost. Infinispan calls itself a "data grid", which first I dismissed as mere marketing, but I became convinced that it is genuinely a different kind of beast than a regular cache. To support data grids, we added hooks so that a cache can tell Mondrian about segments that have been added to other nodes in a cluster. This way, Mondrian becomes a genuine cluster. If I execute query X on node 1, it will put segments into the data grid that will make the query you are about to submit, query Y on node 2, execute faster.

As you can tell by the enthusiastic length of this post, I am very excited about this change to Mondrian's architecture. Outwardly, Mondrian executes the same MDX queries the same as it ever did. But the internal engine can scale better when running on a modern CPU with many cores; due to the external caches, the cache behave much more predictably; and you can create clusters of Mondrian nodes that share their work and memory.

The changes will be released soon as Mondrian version ~~3.3.1~~ 3.4, but you can help by downloading from the main line (or from CI), kicking the tires, and letting us know if you find any problems.

[Edited 2011/1/16, to fix version number.]

↧

olap4j releases version 1.0.1, switches to Apache license

February 7, 2012, 11:04 pm

≫ Next: Auto-generated date dimension tables

≪ Previous: Changes to Mondrian's caching architecture

I am pleased to announce the release of olap4j version 1.0.1.

As the version number implies, this is basically a maintenance release. It is backwards compatible with version 1.0.0, meaning that any driver or application written for olap4j 1.0.0 should work with 1.0.1.

There is a year's worth of bug fixes, which should help the stability and performance of the XMLA driver in particular.

But more significant than the code changes is the change in license. Olap4j is now released under the Apache License, Version 2.0 (ASL). Our goal is to maximize the number of applications that use olap4j, and the number of drivers. ASL is a more permissive license than olap4j's previous license, Eclipse Public License (EPL), so helps drive adoption.

For instance, under ASL, if you create a driver by forking an existing driver, you are not required to publish your modified source code, and you may embed the driver in a non-ASL project or product. We hope that this will increase the number of commercial olap4j drivers. (Of course, we hope you will see the wisdom of contributing back your changes, but you are not obliged to.)

Before you ask. It is quite coincidental that this license change occurred in the same week that Pentaho Data Integration (Kettle) also switched to Apache Software License. Although I'm sure that Pentaho's motivations were similar to ours.

Thanks to everyone who has contributed fixes and valuable feedback since olap4j 1.0.0, and in particular to Luc for wrangling the release out of the door.

↧

Auto-generated date dimension tables

February 21, 2012, 9:04 pm

≫ Next: How should Mondrian get table and column statistics?

≪ Previous: olap4j releases version 1.0.1, switches to Apache license

It seems that whenever I have a cross-continent flight, Mondrian gets a new feature. This particular flight was from Florida back home to California, and this particular feature is a time-dimension generator.

I was on the way home from an all-hands at Pentaho's Orlando, Florida headquarters, where new CEO Quentin Gallivan had outlined his strategy for the company. I also got to spend time with the many smart folks from all over the world who work for Pentaho, among them Roland Bouman, formerly an evangelist for MySQL, now with Pentaho, but still passionately advocating for open source databases, open source business intelligence, and above all, keeping it simple.

Roland and I got talking about how to map Mondrian onto operational schemas. Though not designed as star schemas, some operational schemas nevertheless have a structure that can support a cube, with a central fact table surrounded by star or snowflake dimension tables. Often the one thing missing is a time dimension table. Since these time dimension tables look very much the same, how easy would it be for Mondrian to generate them on the fly? Not that difficult, I thought, as the captain turned off the "fasten seatbelts" sign and I opened my laptop. Here's what I came up with.

Here's how you declare a regular time dimension table in Mondrian 4:

<PhysicalSchema>
  <Table name='time_by_day'/>
  <!-- Other tables... -->
</PhysicalSchema>

Mondrian sees the table name 'time_by_day', checks that it exists, and finds the column definitions from the JDBC catalog. The table can then be used in various dimensions in the schema.

An auto-generated time dimension is similar:

<PhysicalSchema>
  <AutoGeneratedDateTable name='time_by_day_generated' startDate='2012-01-01' endDate='2014-01-31'/>
  <!-- Other tables... -->
</PhysicalSchema>

The first time Mondrian reads the schema, it notices that the table is not present in the schema, and creates and populates it. Here is the DDL and data it produces.

CREATE TABLE `time_by_day_generated` (
  `time_id` Integer NOT NULL PRIMARY KEY,
  `yymmdd` Integer NOT NULL,
  `yyyymmdd` Integer NOT NULL,
  `the_date` Date NOT NULL,
  `the_day` VARCHAR(20) NOT NULL,
  `the_month` VARCHAR(20) NOT NULL,
  `the_year` Integer NOT NULL,
  `day_of_month` VARCHAR(20) NOT NULL,
  `week_of_year` Integer NOT NULL,
  `month_of_year` Integer NOT NULL,
  `quarter` VARCHAR(20) NOT NULL)

JULIAN	YYMMDD	YYYYMMDD	DATE	DAY_OF_WEEK_NAME	MONTH_NAME	YEAR	DAY_OF_MONTH	WEEK_OF_YEAR	MONTH	QUARTER
2455928	120101	20120101	2012-01-01	Sunday	January	2012	1	1	1	Q1
2455929	120102	20120102	2012-01-02	Monday	January	2012	2	1	1	Q1
2455930	120103	20120103	2012-01-03	Tuesday	January	2012	3	1	1	Q1

The columns present are all of the time-dimension domains:

Domain	Default column name	Default data type	Example	Description
JULIAN	time_id	Integer	2454115	Julian day number (0 = January 1, 4713 BC). Additional attribute 'epoch', if specified, changes the date at which the value is zero.
YYMMDD	yymmdd	Integer	120219	Decimal date with two-digit year
YYYYMMDD	yyyymmdd	Integer	20120219	Decimal date with four-digit year
DATE	the_date	Date	2012-12-31	Date literal
DAY_OF_WEEK_NAME	the_day	String	Friday	Name of day of week
MONTH_NAME	the_month	String	December	Name of month
YEAR	the_year	Integer	2012	Year
DAY_OF_MONTH	day_of_month	String	31	Day ordinal within month
WEEK_OF_YEAR	week_of_year	Integer	53	Week ordinal within year
MONTH	month_of_year	Integer	12	Month ordinal within year
QUARTER	quarter	String	Q4	Name of quarter

Suppose you wish to choose specific column names, or have more control over how values are generated. You can do that by including a <ColumnDefs> element within the table, and <ColumnDef>elements within that — just like a regular <Table>element.

For example,

<PhysicalSchema>
  <AutoGeneratedDateTable name='time_by_day_generated' startDate='2008-01-01 endDate='2020-01-31'>
    <ColumnDefs>
      <ColumnDef name='time_id'>
        <TimeDomain role='JULIAN' epoch='1996-01-01'/>
      </ColumnDef>
      <ColumnDef name='my_year'>
        <TimeDomain role='year'/>
      </ColumnDef>
      <ColumnDef name='my_month'>
        <TimeDomain role='MONTH'/>
      </ColumnDef>
      <ColumnDef name='quarter'/>
      <ColumnDef name='month_of_year'/>
      <ColumnDef name='week_of_year'/>
      <ColumnDef name='day_of_month'/>
      <ColumnDef name='the_month'/>
      <ColumnDef name='the_date'/>
    </ColumnDefs>
    <Key>
      <Column name='time_id/>
    </Key>
  </AutoGeneratedDateTable>
  <!-- Other tables... -->
</PhysicalSchema>

The first three columns have nested <TimeDomain>elements that tell the generator how to populate them.

The other columns have the standard column name for a particular time domain, and therefore the <TimeDomain> element can be omitted. For instance,

<ColumnDef name='month_of_year'/>

is shorthand for

<ColumnDef name='month_of_year' type='int'>
  <TimeDomain role="month"/>
</ColumnDef>

The nested <Key> element makes that column valid as the target of a link (from a foreign key in the fact table, for instance), and also declares the column as a primary key in the CREATE TABLE statement. This has the pleasant side-effect, on all databases I know of, of creating an index. If you need other indexes on the generated table, create them manually.

The <TimeDomain> element could be extended further. For instance, we could add a locale attribute. This would allow different translations of month and weekday names, and also support locale-specific differences in how week-in-day and day-of-week numbers are calculated.

Note that this functionality is checked into the mondrian-lagunitas branch, so will only be available as part of Mondrian version 4. That release is still pre-alpha. We recently started to regularly build the branch using Jenkins, and you should see the number of failing tests dropping steadily over the next weeks and months. Already over 80% of tests pass, so it's worth downloading the latest build to kick the tires on your application.

↧

How should Mondrian get table and column statistics?

April 4, 2012, 1:03 pm

≫ Next: "Big Data" is dead... long live Big Data Architecture

≪ Previous: Auto-generated date dimension tables

When evaluating queries, Mondrian sometimes needs to make decisions about how to proceed, and in particular, what SQL to generate. One decision is which aggregate table to use for a query (or whether to stick with the fact table), and another is whether to "round out" a cell request for, say, 48 states and 10 months of 2011 to the full segment of 50 states and 12 months.

These decisions are informed by the volume actual data in the database. The first decision uses row counts (the numbers of rows in the fact and aggregate tables) and the second uses column cardinalities (the number of distinct values in the "month" and "state" columns).

Gathering statistical information is an imperfect science. The obvious way to get the information is to execute some SQL queries:

-- row count of the fact table
select count(*) from sales_fact_1997;

-- count rows in an aggregate table
select count(*) from agg_sales_product_brand_time_month;

-- cardinality of the [Customer].[State] attribute
select count(distinct state) from customer;

These queries can be quite expensive. (On many databases, a row count involves reading every block of the table into memory and summing the number of rows in each. A query for a column's cardinality involves an entry scan of an index; or, worse, a table scan followed by an expensive sort if there is no such index.)

Mondrian doesn't need the exact value, but need needs an approximate value (say correct within a factor of 3) in order to proceed with the query.

Mondrian has a statistics cache, so the statistics calls only affect the "first query of the day", when Mondrian has been re-started, or is using a new schema. (If you are making use of a dynamic schema processor, it might be that every user effectively has their own schema. In this case, every user will experience their own slow "first query of the day".)

We have one mechanism to prevent expensive queries: you can provide estimates in the Mondrian schema file. When you are defining an aggregate table, specify the approxRowCount attribute of the <AggName> XML element, and Mondrian will skip the row count query. When defining a level, if you specify the approxRowCount attribute of the <Level> XML element (the <Attribute> XML element in mondrian-4), Mondrian will skip the cardinality query. But it is time-consuming to fill in those counts, and they can go out of date as the database grows.

I am mulling over a couple of features to ease this problem. (These features are not committed for any particular release, or even fully formed. Your feedback to this post will help us prioritize them, shape them so that they are useful for how you manage Mondrian, and hopefully trim their scope so that they are reasonably simple for us to implement.)

Auto-populate volume attributes

The auto-populate feature would read a schema file, run queries on the database to count every fact table, aggregate table, and the key of every level, and populate the approxRowCount attributes in the schema file. It might also do some sanity checks, such as that the primary key of your dimension table doesn't have any unique values, and warn you if they are violated.

Auto-populate is clearly a time-consuming task. It might take an hour or so to execute all of the queries. You could run it say once a month, at a quiet time of day. But at the end, the Mondrian schema would have enough information that it would not need to run any statistics queries at run time.

Auto-populate has a few limitations. Obviously, you need to schedule it, as a manual task, or a cron job. Then you need to make sure that the modified schema file is propagated into the solution repository. Lastly, if you are using a dynamic schema processor to generate or significantly modify your schema file, auto-populate clearly cannot populate sections that have not been generated yet.

Pluggable statistics

The statistics that Mondrian needs probably already exist. Every database has a query optimizer, and every query optimizer needs statistics such as row counts and column cardinalities to make its decisions. So, that ANALYZE TABLE (or equivalent) command that you ran after you populated the database (you did run it, didn't you?) probably calculated these statistics and stored them somewhere.

The problem is that that "somewhere" is different for each and every database. In Oracle, they are in ALL_TAB_STATISTICS and ALL_TAB_COL_STATISTICS tables; in MySQL, they are in INFORMATION_SCHEMA.STATISTICS. And so forth.

JDBC claims to provide the information through the DatabaseMetaData.getIndexInfo method. But it doesn't work for all drivers. (The only one I tried, MySQL, albeit a fairly old version, didn't give me any row count statistics.)

Let's suppose we introduced an SPI to get table and column statistics:


package mondrian.spi;

import javax.sql.DataSource;

interface StatisticsProvider {
   int getColumnCardinality(DataSource dataSource, String catalog, String schema, String table, String[] columns);
   int getTableCardinality(DataSource dataSource, String catalog, String schema, String table);
}

and several implementations:

A fallback implementation SqlStatisticsProvider that generates "select count(distinct ...) ..." and "select count(*) ..." queries.
An implementation JdbcStatisticsProvider that uses JDBC methods such as getIndexInfo
An implementation that uses each database's specific tables, OracleStatisticsProvider, MySqlStatisticsProvider, and so forth.

Each Dialect could nominate one or more implementations of this SPI, and try them in order. (Each method can return -1 to say 'I don't know'.)

Conclusion

Statistics are an important issue for Mondrian. In the real world, missing statistics are more damaging than somewhat inaccurate statistics. If statistics are inaccurate, Mondrian will execute queries inefficiently, but the difference with optimal performance is negligible if the statistics are within an order of magnitude; missing statistics cause Mondrian to generate potentially expensive SQL statements, especially during that all-important first query of the day.

A couple of solutions are proposed.

The auto-population tool would solve the problem in one way, at the cost of logistical effort to schedule the running of the tool.

The statistics provider leverages databases' own statistics. It solves the problem of diversity the usual open source way: it provides an SPI and lets the community provide implementations that SPI for their favorite database.

↧

"Big Data" is dead... long live Big Data Architecture

April 11, 2012, 1:16 pm

≫ Next: Data-oriented programming for the rest of us

≪ Previous: How should Mondrian get table and column statistics?

Now that just about every data-management and business intelligence product claims that it handles "Big Data", the term is approaching zero information content.

So, I'm shorting the term "Big Data". In the next few months, the marketers will realize that their audience realize that the term means nothing and, in accordance with Monash's First Law of Commercial Semantics, they'll start coming up with new terms.

Have any of those terms been spotted in the wild yet?

Though I'm still not clear what exactly Big Data is, I am fond of the term "Big Data Architecture". That term describes — fairly concisely, to the people who I want to understand me — the idea of a system where scalability is so important that it's best not to assume that there is only one of anything; where scalability is so important that it's worth revisiting all your assumptions; and where the raw performance of each component in the system is not paramount, because if the components can be composed in a scalable fashion, the system will meet its performance goals.

This architecture is going to be the standard for the kind of systems I build, so I think I'll be using the term "Big Data Architecture" for many years to come. If you can come up with got a good alternative to that one, I might just buy you a pint.

↧

Data-oriented programming for the rest of us

April 13, 2012, 2:24 am

≫ Next: A first look at linq4j

≪ Previous: "Big Data" is dead... long live Big Data Architecture

I have been a fan of LINQ for several years (my Saffron project covered many of the same themes) but I've had difficulty explaining why it isn't just a better Hibernate. In his article “Why LINQ Matters: Cloud Composability Guaranteed” (initially in ACM Queue, now in April's CACM), Brian Beckman puts his finger on it.

The idea is composability.

He writes:

Encoding and transmitting such trees of operators across tiers of a distributed system have many specific benefits, most notably:
Bandwidth savings from injecting filters closer to producers of data and streams, avoiding transmission of unwanted data back to consumers.
Computational efficiency from performing calculations in the cloud, where available computing power is much greater than in clients.
Programmability from offering generic transform and filter services to data consumers, avoiding the need for clairvoyant precanning of queries and data models at data-producer sites.

Databases have been doing this kind of stuff for years. There is a large performance difference between stored and in-memory data, and often several ways to access it, so the designers of the first databases took the decision about which algorithm to use out of the hands of the programmer. They created a query language out of a few theoretically well-behaved (and, not coincidentally, composable) logical operators, a set of composable physical operators to implement them, and a query planner to convert from one to the other. (Some call this component a “query optimizer”, but I prefer the more modest term.) Once the query planner was in place, they could re-organize not only the algorithms, but also the physical layout of the data (such as indexes and clustered tables) and the physical layout of the system (SMP and shared-nothing databases).

These days, there are plenty of other programming tasks that can benefit from the intervention of a planner that understands the algorithm. The data does not necessarily reside in a database (indeed, may not live on disk at all), but needs to be processed on a distributed system, connected by network links of varying latency, by multi-core machines with lots of memory.

What problems benefit from this approach? Problems whose runtime systems are complex, and where the decisions involve large factors. For example, “Is it worth writing my data to a network connection, which has 10,000x the latency of memory, if this will allow me to use 1000x more CPUs to process it?”. Yes, there are a lot of problems like that these days.

Composability

Beckman's shout-out to composability is remarkable because it is something the database and programming language communities can agree on. But though they may agree about the virtues of composability, they took it in different directions. The database community discovered composability years ago, but then set their query language into stone, so you couldn't add any more operators. Beckman is advocating writing programs using composable operators, but does not provide a framework for optimizing those operator trees.

LINQ stands for “Language-INtegrated Query”, but for these purposes, the important thing about LINQ is not that it is “language integrated”. It really doesn't matter whether the front end to a LINQ system uses a “select”, “where” and “from” operator reminiscent of SQL:


var results = from c in SomeCollection
              where c.SomeProperty               select new {c.SomeProperty, c.OtherProperty};

or higher-order operators on collections:

var results =
     SomeCollection
        .Where(c => c.SomeProperty         .Select(c => new {c.SomeProperty, c.OtherProperty});

or actual SQL embedded in JDBC:

ResultSet results = statement.executeQuery(
    "SELECT SomeProperty, OtherProperty\n"
      + "FROM SomeCollection\n"
      + "WHERE SomeProperty

All of the above formulations are equivalent, and each can be converted into the same intermediate form, a tree of operators.

What matters is what happens next: a planner behind the scenes converts the operator tree into an optimal algorithm. The planner understands what the programmer is asking for, the physical layout of the data sources, the statistics about the size and structure of the data, the resources available to process the data, and the algorithms that can implement available to accomplish that. The effect will be that the program always executes efficiently, even if the data and system are re-organized after the program has been written.

Query planner versus compiler

Composability is the secret sauce that powers query planners, including the one in LINQ. At first sight, a query planner seems to have a similar purpose to a programming language compiler. But a query planner is aiming to reap the large rewards, so it needs to consider radical changes to the operator tree. Those changes are only possible if the operators are composable, and sufficiently well-behaved to be described by a small number of transformation rules. A compiler does not consider global changes, so does not need a simple, composable language.

The differences between compiler and query planner go further. They run in different environments, and have different goals. Compared to a typical programming language compiler, a query planner...

... plans later. A compiler optimizes at the time that the program is compiled; query planners optimize just before it is executed.
... uses more information. A compiler uses the structure of the program; query planners use more information on the dynamic state of the system.
... is involved in task scheduling. Whereas a compiler is quite separate from the task scheduler in the language's runtime environment, the line between query planners and query schedulers is blurred. Resource availability is crucial to query planning.
... optimizes over a greater scope. A compiler optimizes individual functions or modules; query planners optimize the whole query, or even the sequence of queries that make up a job.
... deals with a simpler language. Programming languages aim to be expressive, so have many times more constructs than query languages. Query languages are (not by accident) simple enough to be optimized by a planner. (This property is what Beckman calls “composability”.)
... needs to be more extensible. A compiler's optimizer only needs to change when the language or the target platform changes, whereas a query planner needs to adapt to new front-end languages, algorithms, cost models, back-end data systems and data structures.

These distinctions over-generalize a little, but I am trying to illustrate a point. And I am also giving query planners an unfair advantage, contrasting a “traditional” compiler with a “still just a research project” planner. (Modern compilers, in particular just-in-time (JIT) compilers, share some of the dynamic aspects of query planners.) The point is that a compiler and a planner have different roles, and one should not imagine that one can do the job of the other.

The compiler allows you to write your program in at a high level of abstraction in a rich language; its task is to translate that complex programming language into a simpler machine representation. The planner allows your program to adapt to its runtime environment, by looking at the big picture. LINQ allows you to have both; its architecture provides a clear call-out from the compiler to the query planner. But it can be improved upon, and points to a system superior to LINQ, today's database systems, and other data management systems such as Hadoop.

A manifesto

1. Beyond .NET. LINQ only runs on Microsoft's .NET framework, yet Java is arguably the standard platform for data management. There should be front-ends for other JVM-based languages such as Scala and Clojure.

2. Extensible planner. Today's database query planners work with a single query language (usually SQL), with a fixed set of storage structures and algorithms, usually requiring that data is brought into their database before they will query it. Planners should be allow application developers to add operators and rules. By these means, a planner could accept various query languages, target various data sources and data structures, and use various runtime engines.

3. Rule-driven. LINQ has already rescued data-oriented programming from the database community, and proven that a query planner can exist outside of a database. But to write a LINQ planner, you need to be a compiler expert. Out of the frying pan and into the fire. Planners should be configurable by people who are neither database researchers nor compiler writers, by writing simple rules and operators. That would truly be data-oriented programming for the rest of us.

↧