19 March 2024
Key takeways
Every Analysis Services (AS) tabular model has at its core an in-memory columnar database called Vertipaq. This optimised engine hosts and stores the model in RAM, transforming, organising and compressing the information in columns with their own structure that are physically separated from each other.
Columnar databases allow us to perform quick measures over a single column. However, calculations involving many columns require more time (and CPU cycles) to reorganise the data and compute the final measure.
There are many different ways through which it is possible to optimise tabular models. They differ essentially according to the model’s characteristics and main goal.
This article presents ten best practice tips that can be applied to boost the engine performance of any tabular data model.
1) Optimising the data model
Theoretically, the best way to improve your tabular model is to bring together all the information into one single table. However, this approach would lead to a very slow and expensive ETL process, which is not advisable. Therefore, the best practice is to construct your tabular model, especially complex ones with many relationships between big tables, using the “Kimball modelling” methodology (also known as “Dimensional Modeling” or “Star Schemas”). This approach will optimise “fact” tables and “dimensions”, which facilitates advanced analytical scenarios and simplifies data interpretation.
2) Removing unused columns
Vertipaq Analyser is a very helpful and costless tool that can be used to explore and analyse storage structures involving a certain Analysis Services data model. By analysing tables and column sizes, it is easier to find the most expensive ones. These, when not required for data analysis, should always be eliminated. Columns with high cardinality (i.e. with a high number of distinct values), such as a timestamp or the primary key in a fact table, are very expensive, and by removing them, you will free a lot of memory and consequently highly improve your data model performance.
3) Reducing the number of values but not the information
Every column included in an AS tabular model will be compressed and stored by the Vertipaq engine using either one of these techniques: value encoding or hash encoding.
Value encoding is an algorithm that can only be applied to numeric columns. With this mathematical technique, Vertipaq tries to reduce the number of bits required to store the data based on the range of values included in the column.
Hash encoding (also known as dictionary encoding) is a methodology that can be efficiently applied to any data type. It works by building a dictionary containing the distinct values of a column and replacing them with indexes (integer numbers) that point to the original value.
Sometimes, when having columns with many different values, it is possible and recommendable to split their content into columns (each one with a smaller number of distinct values) that can be combined in order to obtain the original one. By doing this, we will be decreasing the dictionary size and subsequently optimising data compression. For example, if you have a date time column containing a timestamp, it is much more efficient to split the data into two columns: one for the date and the other for the time.
4) Reducing precision
There are some cases where the precision of a specific column in your data model is not relevant. In these cases, rounding the column values (either date times or numeric) will reduce its cardinality and consequently save a lot of space in memory.
5) Choosing measures over calculated columns
Calculated columns can be very expensive because they can generate big dictionaries. In some cases, this can be avoided by establishing a simple expression involving columns from the same table where the calculated column was created. For example, in the table shown below, the calculated column “Sales Amount” was obtained by multiplying “Quantity” by “Unit Price”.
Table 1: A typical Sales table with redundant data.
A possible way to optimise this information is to store only “Quantity” and “Unit Price” and simply use the following DAX formula to obtain the “Sales Amount”:
Sales[Sales Amount] = SUMX(Sales, Sales[Quantity]* Sales[Unit Price])
6) Choosing the sort order
Besides hash and value encoding techniques, there is a third complementary compression methodology used by Vertipaq called run-length encoding (RLE). This technique aims to reduce the size of a column by avoiding repeated values. In RLE, each repeated value will be associated with the number of continuous rows containing it. The efficiency of this particular approach strongly depends on the repetition pattern in the column. Equal values repeated for many continuous rows (i.e. sorted columns) will result in better compressions. Therefore, when dealing with large tables, it is very relevant to find the best way to sort your table, thus improving the RLE efficiency. Least-changing columns should be considered first in the sort order as they are likely to generate many repeated values.
7) Filter out unused dimension values
If you have in your model dimension records that are not being referenced in fact tables then, in order to save some space, the best practice is to filter them out. For example, if you have in your date dimension 3.000.000 records while just 92 of them are being used in the fact table, you should filter the dimension to retain only these values. In bigger and more complex models, this technique is really relevant in terms of memory savings and model performance.
8) Define the Encoding Hint
Value encoding offers better performance for columns used in aggregations, while hash encoding is preferred for group-by-columns (usually columns from dimension tables).
When Analysis Services starts processing a table, it will analyse some samples from each numeric column in order to determine which type of encoding will be applied. It is possible that, after the column is partially processed, based on new incoming information, it is necessary to restart the process to change the encoding previously defined. In order to avoid that, given your prior knowledge about the data, it is good practice to set your preference concerning the encoding method to be applied. For that, you just need to choose in the “EncodingHint” property of the column the method that should be running.
9) Use less expensive data types
Some datatypes, for SSAS Tabular, are more expensive to store than others. For example, using Date over Date Time or Integer IDs instead of Strings can improve the performance of your data model. The size of each datatype differs only according to the range of values it represents.
10) Optimise your DAX
Data Analysis Expression (DAX) is a language that allows us to generate new information from the data already imported in our tabular model. There are many aspects that influence the performance of your queries. Usually, they are related to complex and inefficient DAX calculations (even in small models). However, there are some good practices that can be applied in order to optimise your DAX:
- Use variables instead of repeating measures inside the if branch (it decreases the number of times a given expression is calculated);
- Use DIVIDE() function instead of “/”. The divide function has a third parameter that can be specified and returned whenever the denominator is 0. If you are certain that your denominator will never be 0, then it is best to use the “/” operator.
- Do not substitute BLANK values with zeros or other strings if it is not a requirement. By default, all the rows with BLANK values are automatically filtered out in the report. Replacing those values affects the performance negatively.
- Use SELECTEDVALUE() instead of HASONEVALUE(). Many times, the function HASONEVALUE() is used to check if there is only one distinct value in a given column. After that, usually, VALUES() function is applied to retrieve it. SELECTEDVALUE() by itself performs both steps.
- Use SEARCH() and set the last parameter. The last parameter will define the value that should be returned if the search string is not found. This approach is much more efficient than using error functions along with SEARCH().
- Use SELECTEDVALUE() instead of VALUES(). VALUES() function will return an error if more than one value is returned. In order to avoid it many people use error functions which strongly affect performance. SELECTEDVALUE() is an optimised function that returns BLANK whenever it finds multiple values.
- If you want to group some columns and return the resulting aggregation, always use SUMMARIZECOLUMNS() instead of SUMMARIZE(). SUMMARIZECOLUMNS() is more recent and efficient.
- Use = 0 instead of ISBLANK() || = 0 . A BLANK value is always associated with the base value of each data type. For numeric columns, BLANK corresponds to 0. Using = 0 will internally perform both checks (ISBLANK() and the comparison to 0). In order to compare only with 0 you should apply IN{} operator.
- Use the function ISBLANK() instead of the comparison operator = ISBLANK.
- Avoid using iterator functions inside measure expressions. Measures are calculated iteratively by default. Using iterative functions such as ADDCOLUMNS(), SUMX(), MAXX(), CONCATENATEX(), etc. inside measures will generate nested iterations, which cause a real loss in query performance (typically, iterators with a simple logic are not a problem). There are two ways to identify this type of function: usually, aggregation functions end with X; many iterator functions take a table as the first argument and an expression as the second.