Quantcast
Channel: Julian Hyde on Streaming Data, Open Source OLAP. And stuff.
Viewing all 33 articles
Browse latest View live

A first look at linq4j

0
0
This is a sneak peek of an exciting new data management technology. linq4j (short for "Language-Integrated Query for Java") is inspired by Microsoft's LINQ technology, previously only available on the .NET platform, and adapted for Java. (It also builds upon ideas I had in my earlier Saffron project.)



I launched the linq4j project less than a week ago, but already you can do select, filter, join and groupBy operations on in-memory and SQL data.

In this demo, I write and execute sample code against the working system, and explain the differences between the key interfaces Iterable, Enumerable, and Queryable.

For those of you who want to get a closer look at the real code, here's one of the queries shown in the demo:
DatabaseProvider provider =
    new DatabaseProvider(Helper.MYSQL_DATA_SOURCE);
provider.emps
    .where(
        new Predicate1<Employee>() {
            public boolean apply(Employee v1) {
                return v1.manager;
            }
        })
    .join(
        provider.depts,
        new Function1<Employee, Integer>() {
            public Integer apply(Employee a0) {
                return a0.deptno;
            }
        },
        new Function1<Department, Integer>() {
            public Integer apply(Department a0) {
                return a0.deptno;
            }
        },
        new Function2<Employee, Department, String>() {
            public String apply(Employee v1,
                                Department v2) {
                return v1.name + " works in " + v2.name;
            }
        }
    )
    .foreach(
        new Function1<String, Void>() {
            public Void apply(String a0) {
                System.out.println(a0);
                return null;
            }
        }
    );
and here is its (not yet implemented) sugared syntax:
List<String> strings =
    from emp in provider.emps,
        join dept in provider.depts on emp.deptno == dept.deptno
    where emp.manager
    orderBy emp.name
    select emp.name + " works in " + dept.name;
For more information, visit the linq4j project's home page.

From the ashes of the database revolution...

0
0

With NoSQL and Hadoop, the database world has undergone a revolution. The fighting reached its peak a couple of years ago, but things have calmed down since, and now is a good time to take stock of old and new style data management technologies.

From this revolution, we can learn a lot about what databases should, and should not be. At the end of this post, I propose a system, called Optiq, that would restore to NoSQL/Hadoop systems some of the good features of databases.

Learning from history

Revolutions tend to follow patterns. George Orwell allegorized the progress of the Russian Revolution in his novel Animal Farm. He described the injustices that were the trigger for the revolution, the new egalitarian value system established after the revolution, and the eventual corruption of those values. Revolutions are an opportunity to introduce new ideas, not all of them good ones. For example, the French revolution put in place a decimal system, and though they kept the kilogramme and the metre, they were forced to quickly relinquish the 10 hour day and the 10 day week when the workers discovered that they'd been conned out of 30% of their weekend time.

We see all of these forces at play in the database revolution. The triggers for the revolution were the requirements that traditional RDBMSs could not meet (or were not meeting at the time). The revolution adopted a new paradigm, introduced some new ideas, and threw out some old ideas. I am interested in which of those old ideas should be reinstated under the new regime.

I am a database guy. I was initially skeptical about the need for a revolution, but a couple of years ago I saw that Hadoop and NoSQL were gaining traction, had some good ideas, growing momentum, and were here to stay. My prediction is that traditional and new data management systems will grow more and more similar in appearance over the next 5-10 years. Traditional RDBMSs will adopt some of the new ideas, and the new systems will support features that make them palatable to customers accustomed to using traditional databases.

But first, some terminology. (As George Orwell would agree, what you call something is almost as important as what it is.)

  • I call the new breed of systems "data management systems", not "databases". The shift implies something less centralized, more distributed, and about processing as well as just storing and querying data. Or maybe I'm confusing terminology with substance.
  • I distinguish NoSQL systems from Hadoop, because Hadoop is not a data management system. Hadoop is a substrate upon which many awesome things can, and will, be built, including ETL and data management systems.
  • NoSQL systems are indeed databases, but they throw out several of the key assumptions of traditional databases.
  • I'm steering clear of the term "Big data" for reasons I've already made clear.

The good stuff

In the spirit of post-revolutionary goodwill, let's steer clear of our pet gripes and list out what is best about the old and new systems.

Good features from databases

  • SQL language allows integration with other components, especially components that generate queries and need to work on multiple back-ends.
  • Management of redundant data (such as indexes and materialized views), and physically advantageous data layout (sorted data, clustered data, partitioned tables)
  • ACID transactions
  • High-performance implementations of relational operators
  • Explicit schema, leading to concise, efficient queries

Good features from Hadoop/NoSQL systems

  • Easy scale-out on commodity hardware
  • Non-relational data
  • User-defined and non-relational operators
  • Data is not constrained by schema

Scale and transactions

Scale-out is key. The new data systems all run at immense scale. If traditional databases scaled easily and cheaply, the revolution would probably not have happened.

There are strong arguments for and against supporting ACID transactions. Everyone agrees that transactions have high value: without them, it is more difficult to write bug-free applications. But the revolutionaries assert that ACID transactions have to go, because it is impossible to implement them efficiently. Newer research suggests that there are ways to implement transactions at acceptable cost.

In my opinion, transactions are not the main issue, but are being scapegoated because of the underlying problem of scalability. We would not be having the debate — indeed, the whole NoSQL movement may not have occurred — if conventional databases had been able to scale as their users wanted.

To be honest, I don't have a lot of skin in this game. As an analytic database technology, Optiq is concerned more with scalability than transactions. But it's interesting that transactions, like the SQL language, were at first declared to be enemies of the revolution, and are now being rehabilitated.

Schema

Relational databases require a fixed schema. If your data has a schema, and the schema does not change over time, this is a good thing. Your queries can be more concise because you are not defining the same fields, types, and relationships every time you write a query.

Hadoop data does not have a schema (although you can impose one, after the event, using tools such as Pig and Hive).

The ideal would seem to be that you can provide a schema if the data conforms to a fixed format, provide a loose schema if, say, records have variable numbers of fields, or operate without one. In Hadoop, as in ETL tools, data is schema-less in early stages of a pipeline, stronger typing is applied in later stages of the pipeline, as fields are parsed and assigned names, and records that do not conform to the required schema are eliminated.

Location, control and organization of data

Traditional databases own their storage. Their data resides in files, sometimes entire file systems, that can only be accessed by the database. This allows the database to tightly control the access to and organization of the data. But it means that the data cannot be shared between systems, even between databases made by the same vendor.

Hadoop clusters are expensive, but if several applications share the same cluster, the utilization is kept high, and the cost is spread across more departments' budgets. Applications may share data sets, not just processing resources, but they access the data in place. (That place may or may not be HDFS.) Compared to copying data into an RDBMS, sharing data reduces the saves both time and money.

Lastly, the assumption that data is shared encourages applications to use straightforward formats for their data. A wide variety of applications can read the data, even those not envisioned when the data format was designed.

SQL and query planning

SQL is the hallmark of an RDBMS (at for those of us too young to remember QUEL). SQL is complicated to implement, so upstart open source projects, in their quest to implement the simplest thing that could possibly work have been inclined to make do with less powerful "SQL-like" languages. Those languages tend to disappoint, when it comes to interoperability and predictability.

But I contend that SQL support is a consequence of a solid data management architecture, not the end in itself. A data management system needs to accept new data structures and organizations, and apply them without rewriting application code. It therefore needs a query planner. A query planner, in turn, requires a metadata catalog and a theoretically well-behaved logical language, usually based on relational algebra, for representing queries. Once you have built these pieces, it is not a great leap to add SQL support.

The one area that SQL support is essential is tool integration. Tools, unless written for that specific database, want to generate SQL as close to the SQL standard as possible. (I speak from personal experience, having written Mondrian dialects for more than a dozen "standards compliant" databases.) Computer-generated SQL is not very smart — for example, you will often see trivial conditions like "WHERE 1 = 1" and duplicate expressions in the SELECT clause — and therefore needs to be optimized.

Flat relational data

There is no question that (so-called "flat") relational data is easier for the database to manage. And, we are told, Ted Codd decreed forty years ago that relational data is all we should ever want. Yet I think that database users deserve better.

Codd's rules about normalization have been used to justify a religious war, but I think his point was this. If you maintain multiple copies of the same information, you'll get into trouble when you try to update it. One particular, and insidious, form of redundant information is the implicit information in the ordered or nested data.

That said, we're grown ups. We know that there are risks to redundancy, but there are significant benefits. The risks are reduced if the DBMS helps you manage that redundancy (what are indexes, anyway?), and the benefits are greater if your database is read much more often than it is updated. Why should the database not return record sets with line-items nested inside their parent orders, if that's what the application wants? No reason that I can think of.

In summary, a data management system should allow "non-flat" data, and operations on that data, while keeping a semantics based, as far as possible, on the relational algebra.

Introducing Optiq

Optiq aims to add the "good ideas" from traditional databases onto a new-style Hadoop or NoSQL architecture.

To a client application, Optiq appears to be a database that speaks SQL and JDBC, but Optiq is not a database. Whereas a database controls storage, processing, resource allocation and scheduling, Optiq cedes these powers to the back-end systems, which we call data providers.

Optiq is not a whole data management system. It is a framework that can mediate with one or more data management systems. (Optiq could be configured and distributed with a scheduler, metadata layer, data structures, and algorithms, so that it comes out of the box looking like a database. In fact, we hope and expect that some people will use it that way. But that is not the only way it is intended to be used.)

The core of the framework is the extensible query planner. It allows providers to specify their own type systems, operators, and optimizations (for example, switching to a materialized view, or eliminating a sort if the underlying file is already sorted). It also allows applications to define their own functions and operators, so that their application logic can run in the query-processing fabric.

An example

You might describe Optiq as a database with the hood open, accessible to anyone who wants to tinker with the engine. Here is a simple example:

Class.forName("net.hydromatic.optiq.jdbc.Driver");
Connection connection =
  DriverManager.getConnection("jdbc:optiq:");
OptiqConnection optiqConnection =
  connection.unwrap(OptiqConnection.class);
JavaTypeFactory typeFactory = optiqConnection.getTypeFactory();
optiqConnection.getRootSchema().add(
  "HR",
  new CsvSchema("/var/flatfiles/hr", typeFactory));
ResultSet resultSet =
  connection.createStatement().executeQuery(
    "SELECT e.name, e.sal, d.name AS department\n" +
    "FROM hr.emps AS e, hr.depts AS d\n" +
    "WHERE e.deptno = d.deptno\n" +
    "ORDER BY e.empno");
while (resultSet.next()) {
  System.out.println(
    "emp=" + resultSet.getString(1) +
    ", sal=" + resultSet.getInt(2) +
    ", department=" + resultSet.getString(3));
}
resultSet.close();

The program requires a directory, /var/flatfiles/hr, containing the files EMPS.csv and DEPTS.csv. Each file has a header record describing the fields, followed by several records of data.

There is no other data or metadata, and in fact CsvSchema is an extension, not a built-in part of the system.

When the connection is opened, the virtual database is empty. There are no tables, nor even any schemas. The getRootSchema().add( ... ) call registers a schema with a given name. It is like mounting a file-system.

Once the CsvSchema is registered with the connection with the name "HR", Optiq can retrieve the table and column metadata to parse and optimize the query. When the query is executed, Optiq calls CsvSchema's implementations of linq4j's Enumerable interface to get the contents of each table, applies built-in Java operators to join and sort the records, and returns the results through the usual JDBC ResultSet interface.

This example shows that Optiq contains a full SQL parser, planner and implementations of query operators, but it makes so few assumptions about the form of data and location of metadata that you can drop in a new storage plugin in a few lines of code.

Design principles

The design of the Optiq framework is guided by the following principles.

  • Do not try to control the data, but if you know about the data organization, leverage it.
  • Do not require a schema, but if you know about the shape of the data, leverage it.
  • Provide the SQL query language and JDBC interface, but allow other languages/interfaces.
  • Support linq4j as a backend, but allow other protocols.
  • Delegate policy to the data providers.

Let's see how Optiq brings the "good ideas" of databases to a NoSQL/Hadoop provider.

Applying these principles to schemas, Optiq can operate with no, partial, or full schema. Data providers can determine their own type system, but are generally expected to be able to operate on records of any type: that may be a single string or binary field, and may contain nested collections of records. Since Optiq does not control the data, if operating on a schema-less provider like Hadoop, Optiq would apply its schema to already loaded data, as Pig and Hive do. If Optiq is assured that the data is clean (for example, a particular field is always an integer) then it may be able to optimize.

Optiq's type system allows records to contain nested records, and provides operators to construct and destruct nested collections. Whereas SQL/JDBC queries do not stretch the type system, linq4j gives Optiq a workout: it needs to support the Java type system and operations such as selectMany and groupBy that operate on collection types.

Lastly, on breaking down the rigid boundary between database and application code.

My goal in data-oriented programming is to allow applications, queries, and extension functions and operators to be written in the same language — and if possible using the same programming model, and on the same page of code — and distributed to where query processing is taking place.

The paradigms should be the same, as far as possible. (MapReduce fails this test. Even though MapReduce is Java, one would not choose to write algorithms in this way if there was not the payoff of a massively scalable, fault-tolerant execution infrastructure. Scalding is an example of a DSL that succeeds in making queries fairly similar to "ordinary programming".)

That said, Optiq is not going to fully solve this problem. It will be a research area for years to come. LINQ made a good start. Optiq has a query planner, and is open and extensible for front-end query languages, user-defined operators, and user-defined rules. Those tools should allow us to efficiently and intelligently push user code into the fabric of the query-processing system.

Conclusion

Optiq attempts to create a high-level abstraction on top of Hadoop/NoSQL systems that behaves like a database but does not dilute the strengths of the data provider. But it brings in only those features of databases necessary to create that abstraction; it is a framework, not a database.

Watch this space for further blog posts and code. Or catch me at Hadoop Summit next week and ask me for a demo.

Pesky quoted identifiers in SQL

0
0

The SQL that Mondrian generates is, until now, different than the SQL that most people would write by hand. Most people don't use spaces or punctuation in table and column names, and don't enclose identifiers in quotation marks when writing SQL DDL, DML or queries. Mondrian, on the other hand, religiously quotes every identifier, whether it needs it or not.

The two styles are not compatible because on many databases (Oracle is one example) unquoted identifiers are implicitly converted to upper-case. If you use lower-case table and column names in Mondrian's schema, they will not match the upper-case identifiers created during DDL.

For instance, if you create a table in Oracle using

CREATE TABLE emp (
  empno INTEGER,
  ename VARCHAR2(30),
  deptno INTEGER);

then Oracle creates a table called EMP with columns EMPNO, ENAME and DEPTNO. When you query it using

SELECT ename FROM emp WHERE deptno = 20;

the effect is as if you had written

SELECT ENAME FROM EMP WHERE DEPTNO = 20;

Now, if you'd told Mondrian that the table was called "emp", Mondrian tries to be helpful. It generates the query

SELECT "ename" FROM "emp" WHERE "deptno" = 20;

Of course, there is no table called "emp", only one called "EMP", so on case-sensitive databases such as Oracle this causes an error. You then need to go back to your schema and change

<Table name="emp"/>
to
<Table name="EMP"/>

and all other table and column names in your schema. Yuck!

There is now a simpler way. The Schema XML element has a quoteSql attribute:

<Schema name='FoodMart' metamodelVersion='4.0' quoteSql='false'>

If you set quoteSql='false', Mondrian will not quote identifiers when generating SQL. (Actually, it will still quote them if they contain spaces and such. But we recommend that if you use quoteSql='false', you use sensible table names containing only alphanumeric characters and '_'.)

More details can be found in MONDRIAN-887. It is only fixed in the lagunitas branch (i.e., mondrian-4.0 alpha), and only in new-style schemas (not mondrian-3 style schemas automatically upgraded). Give it a try and let me know how it works for you.

Mondrian in Action

0
0
I am delighted to announce an upcoming book all about Mondrian, called "Mondrian in Action". Some chapters are available in electronic form now, and the final print version is scheduled to hit the shelves in Spring, 2013. For one day only, there is a 50% discount if you pre-order the book and join the early-access program.

Whoever heard of a successful open source project that didn't have a book? Mondrian has become successful without one... but our long-suffering users have had to piece together documentation from the scrappy online documentation, forum posts and mailing list archives. For years, my answer has been, "A book is a great idea, but I'm just too busy writing software!" Finally, I've teamed up with two Mondrian and Pentaho experts, Bill Back (@billbackbi) and Nicholas Goodman (@nagoodman), and we've set out to create the definitive guide. Now all the information will be in one place, right there between on your desk between your keyboard and your coffee mug.

Mondrian in Action serves several audiences. It explains to end-users and CIOs how Mondrian analytics can unlock the value in business data. For schema developers and DBAs, it describes in depth to how to create and administer a Mondrian system. Mondrian in Action covers the upcoming Mondrian 4 release, and includes chapters on security, multi-tenancy, and integration with other tools such as the Pentaho Business Analytics suite and Saiku, advanced analytics and visualizations, and integration with Big Data technologies.

The book is part of the acclaimed "In Action" series from Manning Publications. Books in this series are known for their direct approach to the subject, and concrete, practical examples. For countless open source projects, the definitive guide is an "In Action" book.

Speaking of open source, the Manning Early-Access Program (MEAP) is a 'release early, release often' process. This helps the community shape the book while it is being written. One chapter of the book is available to all. If you pre-order the book, you will get electronic access to draft chapters as they are completed, and a print copy when it is released. There is a forum where you can post questions and give feedback on the book.

Manning are offering a 50% discount for one day only. (It expires at midnight, eastern time, on November 1st.) Go to the publisher's site, http://www.manning.com/back, and enter discount code dotd1101au.

Mondrian in Action in Action

0
0
I couldn't resist. When the Mondrian in Action book is published in the spring, it will look something like this...

Artist's conception of the upcoming bestseller, "Mondrian in Action"

Explaining holidays to a 3 year old

0
0
Your birthday involves presents and cake. The day after your birthday you are allowed to say "I am x", where x is the number of candles on your cake.

4th of July is the birthday of our country, which is called America. America has decided that it doesn't want presents or a cake, just fireworks.

Thanksgiving is a big meal with your family. We eat a turkey, which is basically a very large chicken.

Christmas celebrates the first birthday, a very long time ago, of a baby who was very special, a bit like a king but without a palace or anything, who grew up to be very, very nice indeed, so nice that he made everyone else a bit nicer. And there were shepherds and animals. Also, kind of like a birthday for everyone, because everyone gets presents, although giving them is the important part. Also, we have an indoor tree with lights and things on it. Also, we have a meal exactly like Thanksgiving.

New Year. We all need new calendars tomorrow. Train calendars are best.

Happy New Year, everyone. I hope you get the train calendar your heart desires.

Announcing Lingual

0
0
The last few months, I've been collaborating on a project with Chris Wensel, the author of Cascading. Last week we announced Lingual, an open source project that puts a SQL interface on top of Cascading.

Architecturally, Lingual combines the Cascading engine with my own Optiq framework. Optiq provides the SQL interface (including JDBC), reads table and column definitions from Cascading's metadata store, and few custom Optiq rules target relational operations (project, filter, join and so forth) onto a Cascading operator graph. The queries are executed, on top of Hadoop, using Cascading's existing runtime engine.

Not everyone has heard of Cascading, so let me explain what it is, and why I think it fits well with Optiq. Cascading is a Java API for defining data flows. You write a Java program to build data flows using constructs such as pipes, filters, and grouping operators, Cascading converts that data flow to a MapReduce job, and runs it on Hadoop. Cascading was established early, picked the right level of abstraction to be simple and useful, and has grown to industry strength as it matured.

As a result, companies who are doing really serious things with Hadoop often use Cascading. Some of the very smartest Hadoop users are easily smart enough to have built their own Hadoop query language, but they did something even smarter — they layered DSLs such as Scalding and Cascalog on top of Cascading. In a sense, Optiq-powered SQL is just another DSL for Cascading. I'm proud to be in such illustrious company.

Newbies always ask, "What is Hadoop?" and then a few moments later, "Is Hadoop a database?". (The answer to the second question is easy. Many people would love Hadoop to be an "open source Teradata", but wanting it doesn't make it so. No Virginia, Hadoop is not a database.)

A cottage industry has sprung up of bad analogies for Hadoop, so forgive me if I make another one: Hadoop is, in some sense, an operating system for the compute cluster. After mainframes, minicomputers, and PCs, the next generation of hardware is the compute cluster. Hadoop is the OS, and MapReduce is the assembly language for that hardware — all powerful, but difficult to write and debug. UNIX came about to serve the then-new minicomputers, and crucial to its success was the C programming language. C allowed developers to be productive while writing code almost as efficient as assembler, and it allowed UNIX to move beyond its original PDP-7 hardware.

Cascading is the C of the Hadoop ecosystem. Sparse, elegantly composable, low-level enough to get the job done, but it abstracts away the nasty stuff unless you really want to roll up your sleeves.

It makes a lot of sense to put SQL on top of Cascading. There has been a lot of buzz recently about SQL on Hadoop, but we're not getting caught up in the hype. We are not claiming that Lingual will give speed-of-thought response times (Hadoop isn't a database, remember?), nor will it make advanced predictive analytics will be easy to write (Lingual is not magic). But Hadoop is really good at storing, processing, cleaning and exporting data at immense scale. Lingual brings that good stuff to a new audience.

A large part of that SQL-speaking audience is machines. I'd guess that 80% of the world's SQL statements are generated by tools. Machine-generated SQL is pretty dumb, so it essential that you have an optimizer. (As author of a tool that speaks SQL — Mondrian — and several SQL engines — Broadbase, LucidDB, SQLstream — I have been on both sides of this problem.) Once you have an optimizer, you can start doing clever stuff like re-organizing your data to make the queries run faster. Maybe the optimizer will even help.

Lingual is not a "SQL-like language". Because it is based on Optiq, Lingual is a mature implementation of ANSI/ISO standard SQL. This is especially important for those SQL-generating tools, which cannot rephrase a query to work around a bug or missing feature. As part of our test suite, we ran Mondrian on PostgreSQL, and captured the SQL queries it issued and the results the database gave. Then we replayed those queries — over 6,200 of them — to Lingual and checked that Lingual gave the same results. (By the way, putting Optiq and Cascading together was surprisingly easy. The biggest challenge we had was removing the Postgres-isms from thousands of generated queries.)

Lingual is not the only thing I've been working on. (You can tell when I'm busy by the deafening silence on this blog.) I've also been working on Apache Drill, using Optiq to extend SQL for JSON-shaped data, and I'll blog about this shortly. Also, as Optiq is integrated with more data formats and engines, the number of possibilities increases. If you happen to be at Strata conference tomorrow (Wednesday), drop me a line on twitter and we can meet up and discuss. Probably in the bar.

Optiq latest

0
0

Optiq has been developing steadily over the past few months. Those of you who watch github will know most of this already, but I thought I'd summarize what's been going on.

(This post originally appeared as an email to the optiq-dev mailing list. Since I compose email messages a lot faster than blog posts, and the email message contained a lot of stuff that you'd all find interesting, it made sense to recycle it. Hope you don't mind.)

There are two exciting new projects using Optiq:
This week I attended the Strata conference in Santa Clara, and met lots of people who are interested in Optiq for various reasons. There are at least 4 back-end platforms or front-end languages that people would like to see. I can't describe them all here, but this space. Some exciting stuff will play out in this forum over the next few months.

One of my personal favorite projects is to get Optiq running on compressed, in-memory tables managed by a JSR-107-compliant cache/data-grid such as ehCache or Infinispan. ArrayTable and CloneSchema are the beginnings of that project. The end result will be a high-performance, distributed, in-memory SQL database... how cool is that? (Certainly, my own Mondrian project will be able to make massive use of it.)

And, some people were asking for the Splunk adapter (the so-called "JDBC driver for Splunk") to be improved. Good to hear that it's proving useful.

Now regarding the code.

One person noted that "mvn clean install" should just work for any maven-based project, and it doesn't. He's right. It should. I fixed it. Now it does.

I made some breaking API changes last week, so I upped the version to 0.2.

Expect the version numbers to continue to move erratically, because in our current development mode, it doesn't seem to make sense to have specific milestones. We're basically working on general stability rather than a few big features. We are trying to maintain backwards compatibility, but if we need to change API, we'll do it. I'll help dependent projects such as Lingual and Drill migrate to the new API, and make it as easy as possible for the rest of you.

Over the last week I'd been working on the code generation that powers SQL scalar expressions and built-in functions. This code generation is, obviously, used by the Java provider, but it can also be used by other providers. For instance, Lingual generates Java strings for filters that it passes to Janino. I've been working on OptiqSqlOperatorTest to get more and more of the built-in SQL functions to pass, and I've added OptiqAssert.planContains so that we can add tests to make sure that the minitiae of java code generation are as efficient as possible.

I still need to tell you about the extensions I've been making to Optiq SQL to support Drill (but useful to any project that wants to use nested data or late-binding schemas), but that will have to wait for its own blog post. Watch this space.


Need help

0
0
I was amused by this note I just received via email.

Subject: Need help
To: jhyde@users.sourceforge.net
From: <retracted>

Respected Sir, we are doing a final year project  as Student Data warehouse for BE degree and we came to know to about olap4j at the end of our project  , we are presently in unknown way and we are seeking your help, since we are left with only 15 days for project submission, so if we could get any sample application which is built on olap4j ,will help us to understanding in usage of APIs for our project ,since i find it too difficult in usage of APIs and we are out of time , so any help from your side would greatly be appreciated and remembered
Thanks in advance
I get quite a few like these. (I suppose they are a fact of life for any open source developer.)

The spelling in this one is much better than most. And usually the subject like is more like 'Need help, please, please!!!!!'. But I always wonder how anyone who uses punctuation in such an arbitrary way could ever write code that works. Probably the author's supervisor is wondering the same thing.

Gathering requirements for olap4j 2.0

0
0
It's time to start thinking about olap4j version 2.0.

My initial goal for olap4j version 1.0 was to decouple application developers from Mondrian's legacy API. We've far surpassed that goal. Many applications are using olap4j to connect to OLAP servers like Microsoft SQL Server Analysis Services, Palo and SAP BW. And projects are leveraging the olap4j-xmlaserver sister project to provide an XMLA interface on their own OLAP server. The need is greater than ever to comply with the latest standards.

The difference between products and APIs is that you can't change APIs without pissing people off. Even if you improve the API, you force the developers of the drivers to implement the improvements, and the users of the API get upset because they don't have their new drivers yet. There are plenty of improvements to make to olap4j, so let's try to do it without pissing too many people off!

Since olap4j version 1.0, there has been a new release of Mondrian (well, 4.0 is not released officially yet, but the metamodel and API are very nearly fully baked) and a new release of SQL Server Analysis Services, the home of the de facto XMLA standard.

Also, the Mondrian team have spun out their XMLA server as a separate project (olap4j-xmlaserver) that can run against any olap4j driver. If this server is to implement the latest XMLA specification, it needs the underlying olap4j driver to give it all the metadata it needs.

Here's an example of the kind of issue that we'd like to fix. In olap4j 1.x, you can't tell whether a hierarchy is a parent-child hierarchy. People have asked for a method
boolean isParentChild();
Inspired by the the STRUCTURE attribute of the MDSCHEMA_HIERARCHIES XMLA request, we instead propose to add
enum Structure {
FULLYBALANCED,
RAGGEDBALANCED,
RAGGED,
NETWORK
}
Structure isParentChild();
We can't add this without requiring a new revision of all drivers, but let's be careful gather all the requirements so we can do it just this once.

Here are my goals for olap4j 2.0:

  • Support Analysis Services 2012 metamodel and XMLA as of Analysis Services 2012.
  • Create an enum for each XMLA enum. (Structure, above, is an example.)
  • Support Mondrian 4.0 metamodel. Many of the new Mondrian features, such as measure groups and attributes, are already in SSAS and XMLA.
  • Allow user-specified metadata, such as those specified in Mondrian's schema as annotations, to be passed through the olap4j API and XMLA driver.
  • We'll know that we've done the right thing if we can remove MondrianOlap4jExtra.
I'd also like to maintain backwards compatibility. As I already said, drivers will need to be changed. But any application that worked against olap4j 1.1 should work against olap4j 2.0, and any driver for olap4j 2.0 should also function as an olap4j 1.x driver. That should simplify things for the users.

I'll be gathering a detailed list of API improvements in the olap4j 2.0 specification. If you have ideas for what should be in olap4j version 2.0, now is the time to get involved!

Efficient SQL queries on MongoDB

0
0
How do you integrate MongoDB with other data in your organization? MongoDB is great for building applications, and it has its own powerful query API, but it's difficult to mash up data between MongoDB and other tools, or to make tools that speak SQL, such as Pentaho Analysis (Mondrian), connect to MongoDB.

Building a SQL interface isn't easy, because MongoDB's data model is such a long way from SQL's model. Here are some of the challenges:
  • MongoDB doesn't have a schema. Each database has a number of named 'collections', which are the nearest thing to a SQL table, but each row in a collection can have a completely different set of columns.
  • In MongoDB, data can be nested. Each row consists of a number of fields, and each field can be a scalar value, null, a record, or an array of records.
  • MongoDB supports a number of relational operations, but doesn't use the same terminology as SQL: the find method supports the equivalent of SELECT and WHERE, while the aggregate method supports the equivalent of SELECT, WHERE, GROUP BY, HAVING and ORDER BY.
  • For efficiency, it's really important to push as much of the processing down to MongoDB's query engine, without the user having to re-write their SQL.
  • But MongoDB doesn't support anything equivalent to JOIN.
  • MongoDB can't access external data.

I decided to tackle this using Optiq. Optiq already has a SQL parser and a powerful query optimizer that is powered by rewrite rules. Building on Optiq's core rules, I can add rules that map tables onto MongoDB collections, and relational operations onto MongoDB's find and aggregate operators.

What I produced is a effectively a JDBC driver for MongoDB. Behind it is a hybrid query-processing engine that pushes as much of the query processing down to MongoDB, and does whatever is left (such as joins) in the client.

Let's give it a try. First, install MongoDB, and import MongoDB's zipcode data set:
$ curl -o /tmp/zips.json http://media.mongodb.org/zips.json
$ mongoimport --db test --collection zips --file /tmp/zips.json
Tue Jun 4 16:24:14.190 check 9 29470
Tue Jun 4 16:24:14.469 imported 29470 objects
Log into MongoDB to check it's there:
$ mongo
MongoDB shell version: 2.4.3
connecting to: test
> db.zips.find().limit(3)
{"city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" : "AL", "_id" : "35004"}
{"city" : "ADAMSVILLE", "loc" : [ -86.959727, 33.588437 ], "pop" : 10616, "state" : "AL", "_id" : "35005"}
{"city" : "ADGER", "loc" : [ -87.167455, 33.434277 ], "pop" : 3205, "state" : "AL", "_id" : "35006"}
> exit
bye
Now let's see the same data via SQL. Download and install Optiq:
$ git clone https://github.com/julianhyde/optiq.git
$
mvn install
Optiq comes with a sample model in JSON format, and the sqlline SQL shell. Connect using the mongo-zips-model.json Optiq model, and use sqlline's !tables command to list the available tables.
$ ./sqlline
sqlline> !connect jdbc:optiq:model=target/test-classes/mongo-zips-model.json admin admin
Connecting to jdbc:optiq:model=target/test-classes/mongo-zips-model.json
Connected to: Optiq (version 0.4.2)
Driver: Optiq JDBC Driver (version 0.4.2)
Autocommit status: true
Transaction isolation: TRANSACTION_REPEATABLE_READ
sqlline> !tables
+------------+--------------+-----------------+---------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE |
+------------+--------------+-----------------+---------------+
| null | mongo_raw | zips | TABLE |
| null | mongo_raw | system.indexes | TABLE |
| null | mongo | ZIPS | VIEW |
| null | metadata | COLUMNS | SYSTEM_TABLE |
| null | metadata | TABLES | SYSTEM_TABLE |
+------------+--------------+-----------------+---------------+
Each collection in MongoDB appears here as a table. There are also the COLUMNS and TABLES system tables provided by Optiq, and a view called ZIPS defined in mongo-zips-model.json.

Let's try a simple query. How many zip codes in America?
sqlline> SELECT count(*) FROM zips;
+---------+
| EXPR$0 |
+---------+
| 29467 |
+---------+
1 row selected (0.746 seconds
Now a more complex one. How many states have a city called Springfield?
sqlline> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';
+-----+
| C |
+-----+
| 20 |
+-----+
1 row selected (0.549 seconds)
Let's use the SQL EXPLAIN command to see how the query is implemented.
sqlline> !set outputformat csv
sqlline> EXPLAIN PLAN FOR
. . . .> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';

'PLAN'
'EnumerableAggregateRel(group=[{}], C=[COUNT($0)])
EnumerableAggregateRel(group=[{0}])
EnumerableCalcRel(expr#0..4=[{inputs}], expr#5=['SPRINGFIELD'], expr#6=[=($t0, $t5)], STATE=[$t3], $condition=[$t6])
MongoToEnumerableConverter
MongoTableScan(table=[[mongo_raw, zips]], ops=[[<{city: 1, state: 1, _id: 1}, {$project ...}>]])
'
1 row selected
(0.115 seconds)
The last line of the plan shows that Optiq calls MongoDB's find operator asking for the "city", "state" and "_id" fields. The first three lines of the plan show that the filter and aggregation are implemented using in Optiq's built-in operators, but we're working on pushing them down to MongoDB.

Finally, quit sqlline.
sqlline> !quit
Closing: net.hydromatic.optiq.jdbc.FactoryJdbc41$OptiqConnectionJdbc41

Optiq and its MongoDB adapter shown here are available on github. If you are interested in writing your own adapter, check out optiq-csv, a sample adapter for Optiq that makes CSV files appear as tables. It has own tutorial on writing adapters.

Check back at this blog over the next few months, and I'll show how to write views and advanced queries using Optiq, and how to use Optiq's other adapters.

Improvements to Optiq's MongoDB adapter

0
0
It’s been a while since I posted to this blog, but I haven’t been idle. Quite the opposite; I’ve been so busy writing code that I haven’t had time to write blog posts. A few months ago I joined Hortonworks, and I’ve been improving Optiq on several fronts, including several releases, adding a cost-based optimizer to Hive and some other initiatives to make Hadoop faster and smarter.

More about those other initiatives shortly. But Optiq’s mission is to improve access to all data, so here I want to talk about improvements to how Optiq accesses data in MongoDB. Optiq can now translate SQL queries to extremely efficient operations inside MongoDB.

MongoDB 2.2 introduced the aggregation framework, which allows you to compose queries as pipelines of operations. They have basically implemented relational algebra, and we wanted to take advantage of this.

As the following table shows, most of those operations map onto Optiq’s relational operators. We can exploit that fact to push SQL query logic down into MongoDB.

MongoDB operatorOptiq operator
$projectProjectRel
$matchFilterRel
$limitSortRel.limit
$skipSortRel.offset
$unwind-
$groupAggregateRel
$sortSortRel
$geoNear-

In the previous iteration of Optiq’s MongoDB adapter, we could push down project, filter and sort operators as $project, $match and $sort. A bug pointed out that it would be more efficient if we evaluated $match before $project. As I fixed that bug yesterday, I decided to push down limit and offset operations. (In Optiq, these are just attributes of a SortRel; a SortRel sorting on 0 columns can be created if you wish to apply limit or offset without sorting.)

That went well, so I decided to go for the prize: pushing down aggregations. This is a big performance win because the output of a GROUP BY query is often a lot smaller than its input. It is much more efficient for MongoDB aggregate the data in memory, returning a small result, than to return a large amount of raw data to be aggregated by Optiq.

Now queries involving SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, OFFSET, FETCH (or LIMIT if you prefer the PostgreSQL-style syntax), not to mention sub-queries, can be evaluated in MongoDB. (JOIN, UNION, INTERSECT, MINUS cannot be pushed down because MongoDB does not support those relational operators; Optiq will still evaluate those queries, pushing down as much as it can.)

Let's see some examples of push-down in action.

Given the query:
SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state
Optiq evaluates:
db.zips.aggregate(
   {$project: {STATE: '$state'}},
   {$group: {_id: '$STATE', C: {$sum: 1}}},
   {$project: {STATE: '$_id', C: '$C'}})
and returns
STATE=WV; C=659
STATE=WA; C=484
...
Now let’s add a HAVING clause to find out which states have more than 1,500 zip codes:
SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state
HAVING COUNT(*) > 1500
Optiq adds a $match operator to the previous query's pipeline:
db.zips.aggregate(
   {$project: {STATE: '$state'}},
   {$group: {_id: '$STATE', C: {$sum: 1}}},
   {$project: {STATE: '$_id', C: ‘$C'}},
   {$match: {C: {$gt: 1500}}})
and returns
STATE=NY; C=1596
STATE=TX; C=1676
STATE=CA; C=1523
Now the pièce de résistance. The following query finds the top 5 states in terms of number of cities (and remember that each city can have many zip-codes).
SELECT state, COUNT(DISTINCT city) AS cdc
FROM zips
GROUP BY state
ORDER BY cdc DESC
LIMIT 5
COUNT(DISTINCT x) is difficult to implement because it requires the data to be aggregated twice — once to compute the set of distinct values, and once to count them within each group. For this reason, MongoDB doesn’t implement distinct aggregations. But Optiq translates the query into a pipeline with two $group operators. For good measure, we throw in ORDER BY and LIMIT clauses.

The result is an awe-inspiring pipeline that includes two $group operators (implementing the two phases of aggregation for distinct-count), and finishes with $sort and $limit.

db.zips.aggregate(
  {$project: {STATE: '$state', CITY: '$city'}},
  {$group: {_id: {STATE: '$STATE', CITY: '$CITY'}}},
  {$project: {_id: 0, STATE: '$_id.STATE', CITY: '$_id.CITY'}},
  {$group: {_id: '$STATE', CDC: {$sum: {$cond: [ {$eq: ['CITY', null]}, 0, 1]}}}},
  {$project: {STATE: '$_id', CDC: '$CDC'}},
  {$sort: {CDC: -1}}, {$limit: 5})
I had to jump through some hoops to get this far, because MongoDB’s expression language can be baroque. In one case I had to generate
{$ifNull: [null, 0]}
in order to include the constant 0 in a $project operator. And I was foiled by MongoDB bug SERVER-4589 when trying to access the values inside the zips table's loc column, which contains (latitude, longitude) pairs represented as an array.

In conclusion, Optiq on MongoDB now does a lot of really smart stuff. It can evaluate any SQL query, and push down a lot of that evaluation to be executed efficiently inside MongoDB.

I encourage you to download Optiq and try running some sophisticated SQL queries (including those generated by the OLAP engine I authored, Mondrian).

Table macros

0
0
Table macros are a new Optiq feature (since release 0.6) that combine the efficiency of tables with the flexibility of functions.

Optiq offers a convenient model for presenting data from multiple external sources via a single, efficient SQL interface. Using adapters, you create a schema for each external source, and a table for each data set within a source.

But sometimes the external data source does not consist of a fixed number of data sets, known ahead of time. Consider, for example, Optiq’s web adapter, optiq-web, which makes any HTML table in any web page appear as a SQL table. Today you can create an Optiq model and define within it several tables.

Optiq-web’s home page shows an example where you can create a schema with tables “Cities” and “States” (based on the Wikipedia pages List of states and territories of the United States and List of United States cities by population) and execute a query to find out the proportion of the California’s population that live in cities:

SELECT COUNT(*) "City Count",

  SUM(100 * c."Population" / s."Population") "Pct State Population"

FROM "Cities" c, "States" s

WHERE c."State" = s."State" AND s."State" = 'California';
But what if you want to query a URL that isn’t in the schema? A table macro will allow you to do this:
SELECT * FROM TABLE(
  web(‘http://en.wikipedia.org/wiki/List_of_countries_by_population’));
web is a function that returns a table. That is, a Table object, which is the definition of a table. In Optiq, a table definition doesn’t need to be assigned a name and put inside a schema, although most do; this is a free-floating table. A table just needs to be able to describe its columns, and to be able to convert itself to relational algebra. Optiq invokes it while the query is being planned.

Here is the WebTableMacro class:

public class WebTableMacro {
  public Table eval(String url) {
    Map operands = new HashMap();
    operands.put(“url”, url);
    return new WebTable(operands, null);

  }

}
And here is how you define a WEB function based upon it in your JSON model:
{
  version: '1.0',
  defaultSchema: ‘ADHOC’,
  schemas: [
    {

      name: 'ADHOC',
      functions: [
        {

          name: ‘WEB’,
          className: 'com.example.WebTableMacro'
        }

      ]
    }
  ]


}
Table macros are a special kind of table function. They are defined in the same in the model, and invoked in the same way from a SQL statement. A table function can be used at prepare time if (a) its arguments are constants, and (b) the table it returns implements TranslatableTable. If it fails either of those tests, it will be invoked at runtime; it will still produce results, but will have missed out on the advantages of being part of the query optimization process.

What kind of advantages can the optimization process being? Suppose a web page that produces a table supports URL parameters to filter on a particular column and sort on another. We could write planner rules that push take a FilterRel or SortRel on top of a WebTableScan and convert them into a scan with extra URL parameters. A table that came from the web function would be able to participate in that process.

The name ‘table macros’ is inspired by Lisp macros— functions that are invoked at compile time rather than run time. Macros are an extremely powerful feature in Lisp and I hope they will prove to be a powerful addition to SQL. But to SQL users, a more familiar name might be ‘parameterized views’.

Views and table macros are both expanded to relational algebra before the query is optimized. Views are specified in SQL, whereas table macros invoke user code (it takes some logic to handle those parameters). Under the covers, Optiq’s views are implemented using table macros. (They always have been — we’ve only just got around to making table macros a public feature.)

To sum up. Table macros are powerful new Optiq feature that extend the reach of Optiq to data sources that have not been pre-configured into an Optiq model. They are a generalization of SQL views, and share with views the efficiency of expanding relational expressions at query compilation time, where they can be optimized. Table macros will help bring a SQL interface to yet more forms of data.

Viewing all 33 articles
Browse latest View live




Latest Images