Thursday, September 29, 2016

Hadoop Format Speed Tests: Parquet,ORC, w;w/o compression

For Hadoop/HDFS, which format is faster?

ORC vs RCfile


According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).


Parquet vs ORC


On Stackoverflow, contributor Rahul posted an extensive list of results he did comparing ORC vs. Parquet, along with different compressions.  You can find the full results here:  http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy.

Below are the results that were posted by Rahul:

       
Table A - Text File Format- 2.5GB

Table B - ORC - 652MB

Table C - ORC with Snappy - 802MB

Table D - Parquet - 1.9 GB

Parquet was worst as far as compression for my table is concerned.

My tests with the above tables yielded following results.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec 

       
 

Friday, September 23, 2016

Columnstore Indexes - a reference

With the release of SQL Server 2016, which included a more complete codebase for columnstore indexes, I recently discovered a listing of 86 blog posting by Niko Neugebauer.  Perhaps all you might want to know about columnstore indexes.

You can find it here: Columnstore Indexes Blog by Niko Neugebauer

Monday, September 5, 2016

Use a web browser to verify the hadoop IPC port is accessible

A quick, simple method to verify that your Hadoop / hdfs / IPC port is accessible is to simply use a web browser.  Setting up PolyBase, you'll need to create an external data source. This is where you point to the Hadoop IPC port.  PolyBase is not known for its robust error messages, so just because you created a data sources successfully does not mean you it will work.

Lets say your Hadoop hdfs port has this address:


                   hdfs://192.168.1.120:8020

Simply replace hdfs with http and try to connect with a browser. Like this:

                   http://192.168.1.120:8020/

You should get the message:

       It looks like you are making an HTTP request to a Hadoop 
       IPC port. This is not the correct port for the web 
       interface on this daemon.

If you don't get this message, you'll need to work out why your Hadoop system is not responding.



Command(s) completed successfully.


PolyBase does not validate your Hadoop Location when you create a new external data source. For example, we created our new, but invalid data source with the following command. One that has a totally invalid hdfs location value:

               CREATE EXTERNAL DATA SOURCE [invalid] 
               WITH ( 
                TYPE = HADOOP
                , LOCATION = N'hdfs://192.168.1.130:8020'
                , RESOURCE_MANAGER_LOCATION = N'192.168.1.130:8050'
                )
               GO

PolyBase simple returned this value:  Command(s) completed successfully. The point, is that just because you can create a Hadoop external data source does not mean that it is valid.

Tuesday, August 30, 2016

Free Power BI visuals

OKVis is out to enhance and extend custom visuals and tools for Power BI. And they are offering it free of charge - under the listed MIT License. When we last checked, they had six (6) custom visuals listed:

  • synoptic panel
  • smart filter
  • sparkline
  • bullet chart
  • card with states
  • candlestick

If you are using Power BI, be sure to take a look. And give a thumbs up the the sponsor of OKVis - sqlbi (http://www.sqlbi.com/). You can find both the visuals and the tools here:  http://okviz.com/.

ROLAP - now a viable option

The time for ROLAP has arrived:


Most operational data stores (ODS) and data warehouses (DW) reside on a rowstore relational database. It's a great tool for data organization, but rowstores and their related indexes don't mesh with the demanding needs of analytics, business intelligence, reporting and self-service oriented cubes. So it was not uncommon to push selected data down from the data warehouse into customized MOLAP cubes. MOLAP cubes offered business analysts great self-service data mobility, usually using a connected EXCEL pivot table, or more expensive tools such as Tableau.

Ideal for the analyst, but pushing data down from a data warehouse or operational data store to a MOLAP cube was lots of added work, risk and most importantly, loss of data timeliness. Once the data warehouse or ODS was updated, yet another process had to be started to update and process the MOLAP cube. Once a day updates were not uncommon for MOLAP cubes. Fast data it was not! And if you have reviewed the literature on large MOLAP cubes, you've found that they can get unwieldy.

MDX was another road block. Using a BI tool such as Excel, analysts were shielded from the sometimes complex MDX commands. But problems arose when analysts wanted to do custom queries against the MOLAP cube. Queries that would be far easier against a traditional data warehouse using SQL.

Columnar databases to the rescue.


With its release of SQL Server 2016, Microsoft has a full featured relational database that can be run as a columnar database. By simply upgrading to SQL Server 2016, and converting your existing rowstore indexes into a columnstore, your data warehouse or ODS can now support a ROLAP cube. Efficient and fast. And no more MOLAP processing.  Here at Realized Design, we have done some limited testing using ROLAP, and have been pleased with the results. For the relational backend, we have used both SQL Server 2016 and the specialized analytic/columnar database EXASOL. In both cases, ROLAP proved comparable to a MOLAP design. But without all the extra processing. And thus far more elegant. You can find a detailed analysis of our work so far with EXASOL here: EXASOL review at RealizedDesign.

But don't just take our word for it. Here are two additional links where individuals have successfully used ROLAP cubes against very large databases. And they have been please with the results.

In 2014, Karen Gulati did exploratory ROLAP work using SQL Servers new column store indexes. See: Harnessing the Power of both worlds.

Also in 2014, Hilmar Buchta did some work with SSAS ROLAP against an MSFT Parallel Data Warehouse  See:  Parallel Data Warehouse (PDW) and ROLAP - Hilmar Buchta. 


So, just when you thought SSAS multi-dimensional was dead, its back!

Next Steps:  Build an SSAS ROLAP Cube using EXASOL

Monday, August 29, 2016

PolyBase Error ...app.MRAppMaster not found

On the HortonWorks community site, we noticed an unusual error relating to the PolyBase pushdown capability and the resulting error:


Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
 

We have not seen this error, but wanted to include it in our listing of identified PolyBase errors. The user later noted that their solution required a change to the mapred-site.xml file, as follows:

       
 <property> <name>mapreduce.app-submission.cross-platform</name> <value>True</value> </property>
       

You can find the full user thread here:  Yarn ClassPath Value for Polybase Pushdown

Wednesday, August 17, 2016

EnableRolapDistinctCountOnDataSource

We are digging into SSAS ROLAP, and based on our research, the setting that enables the database engine to manage distinct counts should be updated to 1. To adjust this feature, you have to modify the ini file and then restart the SSAS engine.

Why we made the change


Our research found two key postings, both respected SQL Server professionals:

In 2014, Karen Gulati did exploratory ROLAP work using SQL Servers new column store indexes. One key finding was to update the EnableRolapDistinctCountOnDataSource setting to 1. See: Harnessing the Power of both worlds.

Also in 2014, Hilmar Buchta did some work with SSAS ROLAP against an MSFT Parallel Data Warehouse, and determined to update the setting.
See:  Parallel Data Warehouse (PDW) and ROLAP - Hilmar Buchta.

Summary

  1. Locate msmdsrv.ini
  2. Make a backup of the original, unaltered ini file. (backups are our friend)
  3. Locate <EnableRolapDistinctCountOnDataSource>
  4. Change the setting to 1
  5. Save the file and restart SSAS

Details

msmdsrv.ini

Locate msmdsrv.ini and make a backup. Typically it is located in the folder: 
C:\Program Files\Microsoft SQL Server\MSASxx.MSSQLSERVER\OLAP\Config
(where xx is the version)

Make the Change

Locate the entry for EnableRolapDistinctCountOnDataSource, make the change and save the file.
With the change made, restart the SSAS service.

Before
       
    <EnableRolapDistinctCountOnDataSource>0</EnableRolapDistinctCountOnDataSource>

After
       
   <EnableRolapDistinctCountOnDataSource>1</EnableRolapDistinctCountOnDataSource>

FIX: SSAS 2012 to back-end SQL Server 2014

We are not using this configuration, but since we did stumble upon it, we just wanted to make a note of it in case any readers are using this configuration.