Skip to main content
BI4ALL BI4ALL
  • Expertise
    • Artificial Intelligence
    • Data Strategy & Governance
    • Data Visualisation
    • Low Code & Automation
    • Modern BI & Big Data
    • R&D Software Engineering
    • PMO, BA & UX/ UI Design
  • Knowledge Centre
    • Blog
    • Industry
    • Customer Success
    • Tech Talks
  • About Us
    • Board
    • History
    • Partners
    • Awards
    • Media Centre
  • Careers
  • Contacts
English
GermanPortuguês
Last Page:
    Knowledge Center
  • APACHE SPARK – Best Practices

APACHE SPARK – Best Practices

Página Anterior: Blog
  • Knowledge Center
  • Blog
  • Fabric: nova plataforma de análise de dados
1 Junho 2023

Fabric: nova plataforma de análise de dados

Placeholder Image Alt
  • Knowledge Centre
  • APACHE SPARK – Best Practices
20 March 2024

APACHE SPARK – Best Practices

APACHE SPARK – Best Practices

Key takeways

Apache Spark architecture

Tuning and Best Practices

Proper sizing and configuration

Introduction

Apache Spark is a Big Data tool that processes large datasets in a parallel and distributed way. It is an extension of the already known programming model from Apache Hadoop—MapReduce—that facilitates the development of processing applications for large data volumes. Spark reveals much superior performance compared to Hadoop, as in some cases, it reaches a performance of almost 100x bigger.

Another advantage is that all components work integrated within the same framework, like Spark Streaming, Spark SQL, and GraphX, unlike Hadoop, where it is required to use tools that integrate with it but are distributed separately, like Apache Hive. Another important aspect is that Spark can be programmed in four different languages: Java, Scala, Python, and R.

Spark has several components for different types of processing, all built on Spark Core, which is the component that offers the basic functions for the processing functions like map, reduce, filter and collect:

  • Spark Streaming for processing in real-time
  • GraphX, which performs processing on graphs
  • SparkSQL to use SQL in queries and processing the data on Spark
  • MLlib, which is a machine learning library with different algorithms for several activities like clustering

Fig 1. Apache Spark Components

 

Spark Architecture

This section will explain the primary functionalities of Spark Core. First, it will show the application architecture and then the basic concepts of the programming model for data dataset processing. Spark application architecture is constituted by three major parts:

  • Driver Program is the major application that manages the creation and the one that executes the processing defined by the programmers;
  • Cluster Manager is an optional component, which is only necessary if Spark is executed in a distributed way. It is responsible for administering the machines that will be used as workers;
  • The Workers execute the tasks sent by the Driver Program. If Spark is executed locally, the machine will have both the roles of Driver Program and Worker. Fig.2 shows the Spark architecture and its major components.

Fig 2. Spark Architecture

 

Apart from the architecture, it is important to know the principal components of the programming model from Spark. There are three fundamental concepts that will be used in all developed applications:

  • Resilient Distributed Datasets (RDD): They abstract a distributed dataset in the cluster, usually executed in the primary memory. These can be stored in traditional archiving systems like HDFS (Hadoop Distributed File System) and some in NoSQL databases like Cassandra and HBase. RDDs are the primary objects in the programming model of Spark because they are where the data is processed.
  • Operations: They represent transformations or actions that are made within an RDD. A Spark program is normally defined as a sequence of transformations or actions that are performed in a dataset.
  • Spark Context: Context is the object that connects Spark to the program being developed. It can be accessed as a variable in a program that uses its resources.

 

Tuning and Best Practices

Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked by any resource in the cluster: CPU, network bandwidth, or memory. Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes it is required to do some tuning. In this section, we will show some techniques for tuning Apache Spark for optimal efficiency:

 

Do NOT collect large RDDs

A memory exception will be thrown if the dataset is too large to fit in memory when doing an RDD.collect(). Functions like take or takeSample are sufficient to get only a certain number of elements instead.

 

Do NOT use count() when you do NOT need to return the exact number of rows

Rather than return the exact number of rows in the RDD, you can check if it is empty with a simple if(take(1).length == 0).

 

Avoid groupByKey on Large Datasets

There are two functions, reduceByKey and groupByKey, and both will produce the same results. However, the latter will transfer the entire dataset across the network. At the same time, the former will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling.

Below is a diagram to understand what happens with reduceByKey. There is more than one pair on the same machine with the same key being combined before the data is shuffled.

Fig 3. ReduceByKey

 

On the other hand, when calling groupByKey – all the key-value pairs are shuffled around. This is many unnecessary data being transferred over the network.

To determine which machine to shuffle a pair to, Spark calls a partitioning function on the pair’s key. Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time, so if a single key has more key-value pairs than can fit in memory, an out-of-memory exception occurs.

Fig 4. GroupByKey

Shuffling can be a great bottleneck. Having many big HashSets (according to your dataset) could also be a problem. However, it is more likely that you will have a large amount of ram than network latency, which results in faster reads/writes across distributed machines.

Here are more functions to prefer over groupByKey:

  • combineByKey can be used when you are combining elements but your return type differs from your input value type.
  • foldByKey merges the values for each key using an associative function and a neutral “zero value”.

 

Avoid the flatMap-join-groupBy pattern

When two datasets are already grouped by key, and you want to join them and keep them grouped, you can just use cogroup. That avoids all the overhead associated with unpacking and repacking the groups.

 

Use coalesce to repartition in decrease number of partition

Use the coalesce function if you decrease the number of partitions of the RDD instead of repartition. Coalesce is useful because it avoids a full shuffle. It uses existing partitions to minimize the amount of data that is shuffled.

 

When to use Broadcast variable

Spark computes the task’s closure before running each tasks on the available executors. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD. If there is a huge array that is accessed from Spark Closures, e.g. some reference data, this array will be shipped to each spark node with closure. For instance, if we have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node). If broadcast is used it will be distributed once per node using efficient P2P protocol. Once the value is broadcasted to the nodes, it cannot be changed to make sure each node have the exact same copy of data. The modified value might be sent to another node later that would give unexpected results.

 

Joining a large and a small RDD

If the small RDD is small enough to fit into the memory of each worker, we can turn it into a broadcast variable and turn the entire operation into a map-side join for the larger RDD. This way, the larger RDD does not need to be shuffled at all. This can easily happen if the smaller RDD is a dimension table.

 

Joining a large and a medium RDD

If the medium-size RDD does not fit fully into memory but its key set does, it is possible to exploit this. As a join will discard all elements of the larger RDD that do not have a matching partner in the medium size RDD, it can be used the medium key set to do this before the shuffle. If there is a significant amount of entries that get discarded this way, the resulting shuffle will need to transfer a lot less data. It is important to note that the efficiency gain here depends on the filter operation actually reducing the size of the larger RDD.

 

Use the right level of parallelism

Unless the level of parallelism for each operation is high enough, clusters will not be fully utilized. Spark automatically sets the number of partitions of an input file according to its size and for distributed shuffles. Spark creates one partition for each block of the file in HDFS with 64MB by default. When creating an RDD it is possible to pass a second argument as a number of partitions, e.g.:

val rdd= sc.textFile(“file.txt”,5)

The above statement will create an RDD of textFile with 5 partitions. The RDD should be created with the number of partitions equal to the number of cores in the cluster in order for all partitions to be processed as parallel and resources to be used equally.

DataFrame creates a number of partitions equal to spark.sql.shuffle.partitions parameter. spark.sql.shuffle.partitions’s default value is 200.

 

How to estimate the number of partitions, executor’s and driver’s parameters (YARN Cluster Mode)

yarn.nodemanager.resource.memory-mb = ((Node’s Ram GB – 2 GB) * 1024) MB

Total Number of Node’s Core = yarn.nodemanager.resource.cpu-vcores

  • Executor’s parameters (Worker Node):
    • Executor (VM) x Node = ((total number of Node’s core) / 5) – 1
      • 5 is the upper bound for cores per executor because more than 5 cores per executor can degrade HDFS I/O throughput
      • If the total number of Node’s core is less than or equal to 8 we divide It by 2
      • If the total number of Node’s core is equal to 1 the Executor x Node is equal to 1
    • numExecutors (Number of executors to launch for this session) = number of Nodes * Executor (VM) x Node
      • The Driver is included in executors.
    • executorCores (Number of cores to use for each executor) = (total number of Node’s core – 5 ) / Executor x Node
    • executorMemory (Amount of memory to use per executor process) = (yarn.nodemanager.resource.memory-mb – 1024) / (Executor (VM) x Node + 1)

For the executorMemory the memory allocation is based on the algorithm:

Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction, where memoryFraction = spark.storage.memoryFraction and safetyFraction = spark.storage.safetyFraction.

The default values of spark.storage.memoryFraction and spark.storage.safetyFraction are respectively 0.6 and 0.9 so the real executorMemory is:

executorMemory = ((yarn.nodemanager.resource.memory-mb – 1024) / (Executor (VM) x Node +1)) * memoryFraction * safetyFraction.

  • Driver’s parameters (Application Master Node):
    • driverCores = executorCores
    • driverMemory = executorMemory

 

Consider the following example:

3 Worker nodes and one Application Master Node each with 16 vCPUs, 52 GB memory

yarn.nodemanager.resource.memory-mb = (52 – 2) * 1024 = 51200 MB

yarn.scheduler.maximum-allocation-mb = 20830 MB (must be greater than executorMemory)

  • Executor’s params (Worker Node):
    • Executor x Node = (16) / 5 = 2
    • numExecutors = 2 * 4 = 8
    • executorCores = (16 – 5) / 2 = 5
    • executorMemory = ((51200 – 1024) / 3) * 0.6 * 0.9 = 16725,33 MB * 0.6 * 0.9 = 9031,68
    • MB
  • Driver’s params (Application Master Node):
    • driverCores = 5
    • driverMemory = 16725,33 MB * 0.6 * 0.9 = 9031,68 MB

References

  • https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/chapter1.html

Author

João Gaspar

João Gaspar

Senior Consultant

Share

Suggested Content

Data sovereignty: the strategic asset for businesses Blog

Data sovereignty: the strategic asset for businesses

In 2025, data sovereignty has become the new engine of competitiveness — turning massive volumes of information into innovation, efficiency, and strategic advantage.

Modern Anomaly Detection: Techniques, Challenges, and Ethical Considerations Blog

Modern Anomaly Detection: Techniques, Challenges, and Ethical Considerations

Anomaly Detection identifies unusual data patterns to prevent risks, using machine learning techniques

Optimising Performance in Microsoft Fabric Without Exceeding Capacity Limits Blog

Optimising Performance in Microsoft Fabric Without Exceeding Capacity Limits

Microsoft Fabric performance can be optimised through parallelism limits, scaling, workload scheduling, and monitoring without breaching capacity limits.

Metadata Frameworks in Microsoft Fabric: YAML Deployments (Part 3) Blog

Metadata Frameworks in Microsoft Fabric: YAML Deployments (Part 3)

YAML deployments in Microsoft Fabric use Azure DevOps for validation, environment structure, and pipelines with approvals, ensuring consistency.

Metadata Frameworks in Microsoft Fabric: Logging with Eventhouse (Part 2) Blog

Metadata Frameworks in Microsoft Fabric: Logging with Eventhouse (Part 2)

Logging in Microsoft Fabric with Eventhouse ensures centralised visibility and real-time analysis of pipelines, using KQL for scalable ingestion.

Simplifying Metadata Frameworks in Microsoft Fabric with YAML Blog

Simplifying Metadata Frameworks in Microsoft Fabric with YAML

Simplify metadata-driven frameworks in Microsoft Fabric with YAML to gain scalability, readability, and CI/CD integration.

video title

Lets Start

Got a question? Want to start a new project?
Contact us

Menu

  • Expertise
  • Knowledge Centre
  • About Us
  • Careers
  • Contacts

Newsletter

Keep up to date and drive success with innovation
Newsletter

2025 All rights reserved

Privacy and Data Protection Policy Information Security Policy
URS - ISO 27001
URS - ISO 27701
Cookies Settings

BI4ALL may use cookies to memorise your login data, collect statistics to optimise the functionality of the website and to carry out marketing actions based on your interests.
You can customise the cookies used in .

Cookies options

These cookies are essential to provide services available on our website and to enable you to use certain features on our website. Without these cookies, we cannot provide certain services on our website.

These cookies are used to provide a more personalised experience on our website and to remember the choices you make when using our website.

These cookies are used to recognise visitors when they return to our website. This enables us to personalise the content of the website for you, greet you by name and remember your preferences (for example, your choice of language or region).

These cookies are used to protect the security of our website and your data. This includes cookies that are used to enable you to log into secure areas of our website.

These cookies are used to collect information to analyse traffic on our website and understand how visitors are using our website. For example, these cookies can measure factors such as time spent on the website or pages visited, which will allow us to understand how we can improve our website for users. The information collected through these measurement and performance cookies does not identify any individual visitor.

These cookies are used to deliver advertisements that are more relevant to you and your interests. They are also used to limit the number of times you see an advertisement and to help measure the effectiveness of an advertising campaign. They may be placed by us or by third parties with our permission. They remember that you have visited a website and this information is shared with other organisations, such as advertisers.

Política de Privacidade