6 September 2024
Key takeways
The volume of data we deal with has grown exponentially. Large amounts of data – Large Datasets -which used to be the exception, are now the rule. Harnessing its full potential can be a decisive factor between success and failure for companies. It’s up to them to choose the best tool and apply the best practices to make the most of Large Datasets.
This article will delve into the characteristics of a Large Dataset, providing a comprehensive understanding of its main features. By understanding these characteristics, you will be better equipped to navigate the challenges and implement the best practices for effectively using Large Datasets.
1. What is a Large Dataset?
In Power BI, a Large Dataset is characterised mainly by two aspects:
1.1 Size:
It is exceptionally large, usually exceeding hundreds of millions, or even billions, of records, and thus easily surpasses the 1GB limit that the Power BI Pro licence allows you to publish.
1.2 Complexity:
Contains a large number of tables and relationships between them, which are often complex, with metrics and calculations that are also complex.
These two characteristics, combined with the need for immediate access to data, make it difficult, or almost impossible, to align with Power BI’s capacity and performance limitations.
2. Main challenges of Large Datasets
Using Large Datasets can present several challenges if they are not well managed. Poor management can have consequences for performance, usability and even data quality. The main challenges in using Large Datasets are:
2.1 Report Performance
As the volume of data increases, so does the time it takes for Power BI to retrieve the data. This often results in slow and unresponsive reports, negatively impacting the user experience. It’s crucial to be aware of these potential issues and implement strategies to manage them effectively.
In addition, due to their complexity, the queries launched consume many resources, namely memory and CPU, which can cause the system to overload, contributing to poor report performance.
2.2 Limitations of the Data Model
As mentioned earlier, Power BI has a limit of 1GB of data in memory in the Pro licence. Although other licences allow larger limits, managing datasets that exceed 1GB, which is often the case when dealing with large datasets, becomes more complex and requires meticulous planning to ensure full access. Combining in-memory data with Direct Query mode can help with this limitation because it allows you to access all the data without storing it in memory. However, the Direct Query mode can also impact the report’s performance since the queries are launched directly from the source when the report is opened, which means it takes longer to return results. In addition, Direct Query has a limit on the number of queries running in parallel. Therefore, reports with many visuals will be even slower.
2.3 Data quality and consistency
Ensuring good data quality and consistency becomes more complicated when dealing with a large dataset. This is because confirming the data and even cleaning up the data, such as eliminating duplicate or blank data and reworking calculations, takes longer due to the amount of data that has to be processed.
2.4 Data Refresh
Large Datasets take a long time to update because they have a large amount of data, and each update can take several hours. This can lead to delays in updates, as they may finish later or even be blocked. Consequently, the report will not show the latest data, which can mislead those consulting it.
3. Best practices for dealing with Large Datasets
To meet the challenges mentioned above, have a good relationship with Large Datasets and get the best out of them, the following suggestions can be implemented:
3.1 Aggregated tables
The visuals in a report don’t always need to consult the lowest granularity of facts. They, more often than not, consult a higher granularity. One way to take advantage of these observations and improve the performance of a report linked to a large dataset is to use so-called Aggregate Tables.
Aggregated tables summarise factual data at a higher level of aggregation, which makes them smaller in terms of the number of rows and the resources and memory consumed, making them lighter to process in Power BI. In addition, aggregated tables contain pre-calculated calculations such as SUMs and COUNTs, e.g. purchase value, which are often used in reports. Having these values already pre-calculated in the table makes their calculation and subsequent presentation in the report quicker and more efficient since the calculations that were previously made at the time of the query are now already calculated in the table and do not require complex operations to obtain.
Aggregated tables can be created directly in Power BI, via Power Query, or the database. When in Power BI, simply use the ‘Manage Aggregations’ functionality and define the correspondence of columns and aggregations between the aggregated table and the factual one. Note that the factual must be in Direct Query mode.
After this, contrary to what we might think, the metrics should continue to consult the factual in Direct Query; that is, they should continue to have the factual in their formula and not the aggregate. What will happen is that the Power BI engine will be responsible for deciding whether, taking into account the filters made in the calculations and visuals, it can consult the aggregate to return the result or whether it has to consult the table in Direct Query. Having the Power BI engine take charge of this choice also avoids complexity in the metrics since you don’t have to make the aggregate vs. factual choice in the metric formula.
In short, using aggregated tables is a great help in improving performance because it avoids consulting the factual for all calculations. This detail makes the user experience much more pleasant and faster when consulting the report.
3.2 Modelling
Whenever possible, a star or snowflake model should be implemented to remove as much complexity as possible from the data model. These types of models simplify the relationships between tables and reduce the complexity of queries, leading to shorter response times.
In addition, the relationships defined should have the cardinality and direction most suited to the model to ensure the most efficient data retrieval.
The creation of calculated columns or tables should also be avoided. This type of operation is difficult for Power BI, so it should be done in the database whenever possible, and only those columns and tables that are strictly necessary should be brought into Power BI to avoid taking up unnecessary memory.
3.3 Visualisations
In terms of visualisations, some suggestions can be implemented to optimise the experience of using the report:
- Stay under 10 visualisations, including slicers, per page.
- Maximise the number of visuals that, by default, query the aggregated table, i.e. have a higher granularity, and only query a lower granularity, which will already query the factual in Direct Query, if necessary. This can be done using the DrillDown functionality.
- Avoid using custom visualisations. Priority should be given to using Power BI’s native visualisations.
- Use slicers and filters to pre-filter the data in the report to reduce the amount of data processed at one time.
Large Datasets are a reality, and we already have functionalities, such as aggregated tables, at our disposal that allow us to face this new challenge and make the most of it.