20131207

DV footprints on Disk and in Memory, Part 2

My previous blogpost, comparing footprints of DV Leaders (Tableau 8.1, Qlikview 11.2, Spotfire 6) on disk (in terms of size of application file with embedded dataset with 1 million rows) and in Memory (calculated as RAM-difference between freshly-loaded (without data) application and  the same application when it will load appropriate application file (XLSX or DXP or QVW or TWBX) got a lot of feedback from DV Blog visitors. It even got mentioning/reference/quote from Tableau Weekly #9 here:


http://us7.campaign-archive1.com/?u=f3dd94f15b41de877be6b0d4b&id=26fd537d2d&e=5943cb836b and the full list of Tableau Weekly issues is here: http://us7.campaign-archive1.com/home/?u=f3dd94f15b41de877be6b0d4b&id=d23712a896

The majority of feedback asked to do a similar Benchmark - the footprint comparison for larger dataset, say with 10 millions of rows. I did that but it required more time and work,  because the footprint in memory for all 3 DV Leaders depends on the number of visualized Datapoints (Spotfire for years used the term Marks for Visible Datapoints and Tableau adopted these terminology too, so I used it from time to time as well, but I think that the correct term here will be "Visible Datapoints").


Basically I used the same dataset as in previous blogpost with main difference that I took subset with 10 millions of rows as a opposed to 1 Million rows in previous Benchmarks. The Diversity of used Dataset with 10 Million rows is here (each row has 15 fields as in previous benchmark):
[googleapps domain="docs" dir="spreadsheet/pub" query="key=0AuP4OpeAlZ3PdGFyUUl6VmdSWWVubk5sbjZ3Z256Znc&single=true&gid=3&output=html&widget=true" width="250" height="350" /]


I removed from benchmarks for 10 million rows the usage of Excel 2013 (Excel cannot handle more the 1,048,576 rows per worksheet) and PowerPivot 2013 (it is less relevant for given Benchmark). Here are the DV Footprints on disk and in Memory for Dataset with 10 Million rows and different number of Datapoints (or Marks: <16, 1000, around 10000, around 100000, around 800000):
[googleapps domain="docs" dir="spreadsheet/pub" query="key=0AuP4OpeAlZ3PdGFyUUl6VmdSWWVubk5sbjZ3Z256Znc&single=true&gid=4&output=html&widget=true" width="480" height=400" /]


Main observations and notes from benchmarking of footprints with 10 millions of rows as following:




  • Tableau 8.1 requires less (almost twice less) disk space for its application file .TWBX then Qlikview 11.2 (.QVW) for its application file (.QVW) or/and Spotfire 6 for its application file (.DXP).




  • Tableau 8.1 is much smarter when it uses RAM then Qlikview 11.2 and Spofire 6, because it takes advantage of number of Marks. For example for 10000 Visible Datapoints Tableau uses 13 times less RAM than Qlikview and Spotfire and for 100000 Visible Datapoints Tableau uses 8 times less RAM than Qlikview and Spotfire!




  • THe Usage of more than say 5000 Visible Datapoints (even say more than a few hundreds Marks) in particular Chart or Dashboard often the sign of bad design or poor understanding of the task at hand; the human eye (of end user) cannot comprehend too many Marks anyway, so what Tableau does (in terms of reducing the footprint in Memory when less Marks are used) is a good design.




  • For Tableau in results above I reported the total RAM used by 2 Tableau processes in memory TABLEAU.EXE itself and supplemental process TDSERVER64.EXE (this 2nd 64-bit process almost always uses about 21MB of RAM). Note: Russell Christopher also suggested to monitor TABPROTOSRV.EXE but I cannot find its traces and its usage of RAM during benchmarks.




  • Qlikview 11.2 and Spotfire 6 have similar footprints in Memory and on Disk.



4 comments:

  1. Hi Andre - You missed one process which is applicable to the Tableau scenario: tabprotosrv.exe. This process actually loads the database driver necessary to connect to a data source and executes queries on behalf of Tableau.exe. It is there to make sure that a "bad database driver" from a vendor doesn't crash Tableau.exe itself.

    ReplyDelete
  2. Why is "PowerPivot 2013 less relevant for given benchmark"?

    ReplyDelete
  3. This all looks great on paper, but what is missing is the most important aspect - true performance. At the cost of disk and memory nowadays being extremely low, saving a bit of space on disk is borderline irrelevant unless we're talking huge magnitudes of difference. Especially considering the systems we're using are not bringing all of their data into analysis. I've used all vendors mentioned and where we are at in evaluations, Spotfire and Qlik can handle situation where we need to do work with finding that needle in the haystack, sometimes using 100 billion rows. Yes, we are really using that much data in the CPG space, and so are our competitors that I recently left. Tableau was ready to shut down and barely could do anything at that point. Spotfire and Qlik were able to handle this. We benchmarked this on the exact same (huge) machine, and with the same data, with everything performance tested out. In the end, as a decision maker, I care about performance, and how the systems scale. Tableau did not. This was with their pre-sales folks helping. Tableau did not want to have the meetings taken to I.T. so we could really get under the covers in an enterprise setting. They wanted business users who simply play with their charts all day to be the decision makers. We have not finalized our decision just yet, but again, I stress analyzing true performance, and not small pieces of the overall picture.

    ReplyDelete
  4. Andrei Pandre16/1/14 07:34

    Hi Laura and thanks for your comment. I agree that performance is a huge factor, especially is special projects you are describing. However I suggest for you to share more specifics and details. For example, which version of Tools you used? If you used Tableau 7 or Tableau 8.0, then you used only 32-bit executables. Tableau 8.1 is 64-bit and improve its performance with large datasets. Another detail worth to share is configuration of your "HUGE" machine (how much RAM, how many CPUs and Cores, how much disk space involved, etc.).

    A Size of your Dataset is very important too. Assume (like you said) you have 100 billions of Rows and each Row (in average) occupying 1KB of RAM. It means we are talking about 100 Terabytes of uncompressed Data and about 10 Terabytes of compressed Data (assuming Compression Ratio is 10:1). Most huge Servers you can buy (say from Dell or HP) can handle around 1 Terabyte of RAM and it means that 9 Terabytes of your hypothetical Dataset have to be on Disk. It means that a lot of Disk swapping and thrashing will be involved and Qlikview does not do it properly. Their relatively recent new functionality called Direct Discovery (enabling access to disk-residing data) is very slow and immature and I doubt it can handle 10TB (compressed) dataset. Spotfire probably better in interacting with disk, but again I need to see details before jumping to conclusion.

    In terms of 64-bit Tableau, I was able to handle the Dataset with 2 Billions of Rows (using large piece of hardware), but I never tried 100 Billions of Rows with Tableau, so I cannot say for certain.

    In addition I wish to say that in my experience projects with huge datasets are usually not well designed and not well defined and some simplifications can be easily found in order to reduce a dataset.

    Additionally in many cases the Clustering of Datasets is advised in order split dataset into smaller separate datasets and to find a Needle in Haystack.

    ReplyDelete