Open Source Business Intelligence Market Overview
September 1, 2009
Contributed by Steve Holub
|
| |
The following survey provides a list of open source software
tools used in Business Intelligence and Data Warehousing
systems. The tool selection criteria was based on the frequency
and currency of the releases (solutions with updates within two
years or less only are included) and on whether the product has
released a stable build which could be used in a production
environment. This study looked at BI tools in the following
categories: Databases, ETL tools (Extract, Transform, Load),
Master Data Management, BI Reporting tools, and Data Mining.
There are some open source software bundles that overlap
categories;. In this case, we divided the software bundle into
its separate parts for ease of categorization.
We define 'fully open source' to mean software solutions
that offer all of their source code to the public under an open
source license. This means that any software which offers
additional features or functionality (typically with a support
subscription), and does not offer the source code freely to the
public, was not considered fully open source. As well, any
features mentioned in the tables below refer to the open source
version of the software. Organizations that do not provide
commercial support for their product, but do point their
customers to third-party support solutions receive a
"Y" with a notation (*), to indicate that
commercial support is available. Our analysis was based on a
review of the OSS tools market as of July, 2009.
Database Management Systems (DBMS)
There are a wide variety of database management systems
(DBMS) available for data warehouse solution. Standard
relational databases, such as MySQL and PostgreSQL, have
dominated the open source database market for some time. More
recently, new open source database systems have emerged that
target the unique requirements of data warehousing.
Column-oriented databases (such as Infobright, LucidDB, and
MonetDB) increase data reading performance by storing data in
columns (rather than rows) which is important for a data
warehouse where read-optimization takes precedence over
write-optimization, and where typical queries look at the
attributes of a column for a set of records rather than the
components of a row.
There have been several developments in the area of
distributed-computing (i.e. the 'cloud'). The distributed
database solutions presented below are built on top of the open
source Java framework for distributed systems: Hadoop.
Hadoop-based databases are best suited for data warehouses
which number in the hundreds of millions of rows (or greater).
These distributed data warehouse systems have typically been
used for web analytics of high-traffic websites. Finally,
although there are several open source embeddable database
solutions, none of them are scalable to the extent required by
most data warehouse systems, and for that reason, were excluded
from this study.
DBMS Software,
Website,
Company/Sponsor |
DB Type |
Fully Open Source? |
Commercial Support? |
Latest Version Release |
Cloudbase cloudbase.sourceforge.net Business.com |
MapReduce, distributed |
 |
|
June 2009 |
Firebird www.firebirdsql.org Firebird Foundation/Borland |
Relational |
 |
* |
April 2009 |
HBase hadoop.apache.org/hbase
Apache |
Column-oriented, distributed |
 |
|
May 2009 |
Hive hadoop.apache.org/hive Apache |
MapReduce, distributed |
 |
* |
April 2009 |
Hypertable www.hypertable.org Doug Judd/Zvents |
Column-oriented, distributed |
 |
|
May 2009 |
Infobright www.infobright.com Infobright |
Column-oriented |
|
 |
June 2009 |
Ingres www.ingres.com Ingres |
Relational |
|
 |
November 2008 |
LucidDB www.luciddb.org LucidEra/Eigenbase |
Column-oriented |
 |
|
May 2009 |
MonetDB monetdb.cwi.nl Centrum Wiskunde & Informatica |
Column-oriented |
 |
 |
July 2009 |
MySQL www.mysql.com Sun/Oracle |
Relational |
|
 |
April 2009 |
Palo OLAP Server www.jedox.com Jedox |
Multidimensional |
|
 |
March 2009 |
PostgreSQL www.postgresql.org PostgreSQL |
Relational |
 |
* |
June 2009 |
Extract, Transform and Load (ETL) Tools
Data Extraction, Transformation and Loading (ETL) is the
most time consuming component of the data warehouse development
lifecycle. Typically for this reason alone, it is important to
have a good tool which allows the user to interact with as many
different source systems as possible, while maintaining an
acceptable level of usability and performance and providing
metadata regarding the transformations undertaken against the
data. There are several open source ETL tools available today
that provide good performance, good error handling, and
metadata management.
ETL Software,
Website,
Company/Sponsor |
Fully Open Source? |
Commercial Support? |
Latest Version Release |
Apatar www.apatar.com Apatar |
 |
 |
April 2009 |
Clover.ETL www.cloveretl.com OpenSYS |
|
 |
July 2009 |
KETL www.ketl.org Kinetic Networks |
 |
* |
October 2008 |
Open Data Integrator Project open-dm-di.dev.java.net Sun/Oracle |
 |
|
September 2008 |
Palo ETL Server www.jedox.com Jedox |
|
 |
December 2008 |
Pentaho Data Integration (Kettle) kettle.pentaho.org Pentaho |
|
 |
May 2009 |
Talend Open Studio www.talend.com Talend |
|
 |
May 2009 |
Master Data Management
Business Intelligence environment are often sourced from
many different source systems - from spreadsheets to ERP
applications to 3rd party data - which may have the same key
dimension data (eg. Customer, Product, Salesrep) repeated in
more than one source. Master data management (MDM) tools are
used to identify duplicate records and consolidate them into
one unified record which is then stored in the data
warehouse.
The same customer could be viewed as a potential lead by the
marketing department, while the billing department is treating
her as an existing client. Potentially, this customer could be
receiving both bills for a service and marketing material to
encourage her to purchase the very same service which she is
already consuming. Data warehouse systems consolidate these two
customer records into one through the use of MDM tools ensuring
that the customer information and status is consistent across
the business system.
MDM Software, Website, Company/Sponsor |
Fully Open Source? |
Commercial Support? |
Latest Version Release |
DataCleaner datacleaner.eobjects.org Eobjects |
 |
* |
April 2009 |
SQL Power DQguru www.sqlpower.ca/dqguru SQL Power Group |
 |
 |
April 2009 |
Mural mural.dev.java.net Sun/Oracle |
 |
|
September 2008 |
Open Source Data Quality and Profiling project sourceforge.net/projects/dataquality
Arrah |
 |
|
July 2008 |
Business Intelligence Reporting & Analytics Tools
Business intelligence (BI) reporting & analytics tools
are what business users typically use to access their data
warehouse data, so it is imperative that the tool be intuitive
and user-friendly. There are a variety of open source solutions
which deliver a variety of features that target different
groups of users (e.g. report developers, decision makers,
non-specialized users, etc.), so it important to explore as
many tools as possible before deciding on which software
package is right for your organization. In the table below, we
focus on four common features found in BI reporting software:
standard reporting, ad hoc reporting, OLAP reporting, and
dashboards.
Standard reports are those commonly found in business
which can address a variety of different business reporting
requirements (e.g. monthly sales by store by product).
Ad-hoc reporting allows non-technical business users
to develop 'on-the-fly' reports without having any knowledge of
SQL or the underlying database structure.
OLAP reporting refers to online analytical processing
which is a technique specifically designed to improve
performance of business intelligence queries by pre-aggregating
data (and other common analytical tasks) and by modeling the
data multi-dimensionally. The benefit of having OLAP
functionality is the ease with which users can drill-down and
roll-up into different levels of data hierarchies (e.g.
country, to province, to region, to store).
Dashboards are user-defined visual summaries of key
performance indicators for a business which allows the user to
interact with the charts and graphs (e.g. drilling-down into
specific business areas).
Reporting & Analytics Software,
Website,
Company/Sponsor |
Standard Reporting |
Ad-hoc Reporting |
OLAP |
Dash- boards |
Fully Open Source? |
Commercial Support? |
Latest Version Release |
BIRT www.eclipse.org/birt Eclipse Foundation/ Actuate |
 |
 |
 |
|
|
 |
June 2009 |
DataVision datavision.sourceforge.net Jim Menard |
 |
 |
|
|
 |
|
July 2008 |
Jaspersoft BI Suite www.jasperforge.org Jaspersoft |
 |
 |
 |
|
|
 |
June 2009 |
OpenI www.openi.org OpenI |
 |
 |
 |
 |
 |
 |
June 2009 |
OpenReports www.oreports.com Open Source Software Solutions |
 |
 |
 |
|
|
 |
May 2009 |
OpenRPT www.xtuple.com/openrpt xTuple |
 |
 |
|
|
|
 |
January 2009 |
Palo Worksheet www.jedox.com Jedox |
 |
 |
 |
 |
|
 |
March 2009 |
Pentaho BI Platform community.pentaho.com Pentaho |
 |
 |
 |
 |
|
 |
May 2009 |
SQL Power Wabit www.sqlpower.ca/wabit SQL Power Group |
 |
 |
 |
 |
 |
 |
July 2009 |
Data Mining Tools
Data mining techniques are used in business intelligence
systems to determine patterns and relationships amongst data
attributes and potential outcomes. Data warehouses are the most
common source of data for mining business information because
they are typically the most comprehensive and extensive data
source available within the company, and are well suited for
time-series analysis. Data mining has the potential of
discovering correlations that are hidden within the sea of
data. For example, by using data mining techniques, a company
with sufficient customer data could determine which contacts
(based on their attributes, such as age, location, marital
status, etc.) are more likely to respond to a direct mail
marketing campaign, which in turn could help the company reduce
costs by targeting only those customers who are the most likely
to respond to the advertisement.
Reporting & Analytics Software, Website, Company/Sponsor |
Fully Open Source? |
Commercial Support? |
Latest Version Release |
Knime www.knime.org University of Konstanz |
 |
 |
May 2009 |
Pentaho Data Mining (Weka) weka.pentaho.org Pentaho |
|
 |
June 2009 |
R Project www.r-project.org R Development Core Team |
 |
* |
June 2009 |
RapidMiner (YALE) www.rapidminer.com Rapid-i |
|
 |
March 2009 |