



Horizon 2020 European Union funding for Research & Innovation



# On the implications of Heterogeneous Memory Tiering on Spark In-memory Analytics

Manolis Katsaragakis<sup>\*</sup>, **Dimosthenis Masouros**<sup>\*</sup>, Lazaros Papadopoulos<sup>\*</sup>, Francky Catthoor⁺<sup>α</sup>, Dimitrios Soudris<sup>\*</sup>

\*Microprocessors and Digital Systems Laboratory, ECE, National Technical University of Athens(NTUA), Greece \*Katholieke Universiteit Leuven(KUL), Belgium αIMEC, Leuven, Belgium

> {mkatsaragakis, dmasouros, lpapadop, dsoudris}@microlab.ntua.gr francky.catthoor@imec.be



### Introduction

 Over 2.5 quintillion bytes generated daily<sup>1</sup>

#### End Users

 Adoption of novel in-memory processing frameworks for large scale data analytics

#### Providers

- Integration of heterogeneous memory technologies and multi-tier memory architectures.
  - DRAM along with PMEM on the same server
  - Disaggregated DRAM



1. Data never sleeps, https://www.domo.com/solution/data-never-sleeps-6

M. Katsaragakis, et. al

### Introduction

 Over 2.5 quintillion bytes generated daily<sup>1</sup>

#### End Users

 Adoption of novel in-memory processing frameworks for large scale data analytics

#### Providers

- Integration of heterogeneous memory technologies and multi-tier memory architectures.
  - DRAM along with PMEM on the same server
  - Disaggregated DRAM



1. Data never sleeps, <u>https://www.domo.com/solution/data-never-sleeps-6</u>

M. Katsaragakis, et. al

Provide an **exploration** and **performance analysis** of **Spark applications** over an **heterogeneous multi-tier** memory system

- Key questions w.r.t. the effect of memory tiering on Spark analytics
- Key takeaways in terms of:
  - Performance Implications
  - Performance Bottlenecks
  - Performance predictability

# Spark (quick) Background





M. Katsaragakis, et. al

# Spark (quick) Background



Spark Driver

Orchestrator that determines the **tasks** to be performed based on a piece of code U Extremely efficient

Requires huge amount of memory

Perfect candidate for multitier/disaggregated systems!



M. Katsaragakis, et. al

#### **Spark Benchmarks**

- Benchmarks derived from HiBench<sup>1</sup> suite:
  - Diverse domains
    - micro-operations, ML, web search
  - Diverse set of input workloads:
    - tiny, small, large

- Pseudo-distributed, standalone mode:
  - Spark driver and executors on the same node
  - HDFS file system

| Application                      | Abbr.       | Data size range<br>(tiny,small,large)                                                       |  |
|----------------------------------|-------------|---------------------------------------------------------------------------------------------|--|
| Sorting of text input data       | sort        | 32KB, 320MB, 3.2GB                                                                          |  |
| Performs shuf-<br>fle operations | repartition | 3.2KB, 3.2MB, 32MB                                                                          |  |
| Alternating<br>Least Squares     | als         | 100, 1.000, 10.000 (users)<br>100, 1.000, 10.000 (products)<br>200, 2.000, 20.000 (ratings) |  |
| Naive Bayes classification       | bayes       | 25.000, 30.000, 100.000 (pages)<br>10, 100, 100 (classes)                                   |  |
| Random forest                    | rf          | 10, 100, 1.000 (examples)<br>100, 500, 1.000 (features)                                     |  |
| Latent Dirich-<br>let Allocation | lda         | 2.000, 5.000, 10.000 (docs)<br>1.000, 2.000, 3.000 (vocabulary)<br>10, 20, 30 (topics)      |  |
| PageRank                         | pagerank    | 50, 5.000, 500.000 (pages)                                                                  |  |

\*https://github.com/Intel-bigdata/HiBench

#### **Hardware Testbed**

- Dual-socket Intel Xeon 5218R
  - 40 threads/socket
- Symmetric DRAM topology
  - 2x32GB DDR4 DRAM DIMMs per socket
- Assymetric Intel Optane DCPM topology
  - 2x256GB (socket 1) vs 4x256GB (socket 2)
  - App Direct mode
- 4 Memory tiers with difference latency and bandwidth
  - Tier binding through numact1 command



|      |        | Idle Latency (ns) | Bandwidth (GB/s) |
|------|--------|-------------------|------------------|
|      | Tier 0 | 77.8              | 39.3             |
| Tier | Tier 1 | 130.9             | 31.6             |
|      | Tier 2 | 172.1             | 10.7             |
|      | Tier 3 | 231.3             | 0.47             |

How do applications perform on different tiers?









M. Katsaragakis, et. al





How do applications perform on different tiers?

<u>**Takeaway:</u>** Performance degradation depends on the nature of each application and input workload size</u>

Memory Reads 🔀 Memory Writes Tier 0 Tier 1 Tier 2 Tier 3 als lda pagerank 1500 750 25 Time (sec) **Takeaway:** Performance degradation 1000 500 depends on the nature of each 20 Ó application and input workload size 500 250 15 0 0 1e9 1e10 1e10 1.0 # Mem. Accesses 4 6 4 0.5 2 2 0 0.0 tiny small large tiny small large Performance drop is proportional to the number of **RD+WR** accesses

What is the core bottleneck of performance degradation?

M. Katsaragakis, et. al

What is the core bottleneck of performance degradation?



**Takeaway:** Performance degradation depends on the nature of each application and input workload size

**Takeaway:** Performance is highly affected by the number of RD and WR operations on PMEM, with the latter having even more impact by design.

#### **Energy Implications of Memory Tiering**



How about energy consumption?

#### **Bandwidth vs. Latency**

- Limit cores' available bandwidth to memory and execute on Tier 2
  - Intel's Memory Bandwidth Allocation(MBA) tool\*
  - **2**0, 40, 60, 80, 100%





- Average execution time and variance are **not** affected by available bandwidth
- Our applications do <u>not</u> saturate bandwidth

#### **Takeaway**: Performance is dominated by latency and **bandwidth is not saturated**

\*https://github.com/intel/intel-cmt-cat

### Spark "Sizing" vs. Performance

- Different number of executors and cores/executor
  - Executor colocation with concurrent access to memory
- Baseline (default execution) → single executor, 40 cores





How do different deployment approaches affect performance?

Takeaway: Increased number of executors that compete over shared memory resources leads to further performance degradation, with persistent memory being even more susceptible to resource contention.

#### Spark "Sizing" vs. Performance

- Different number of executors and cores/executor
  - Executor colocation with concurrent access to memory
- Baseline (default execution) → single executor, 40 cores





How do different deployment approaches affect performance?

Takeaway: Increased number of executors that compete over shared memory resources leads to further performance degradation, with persistent memory being even more susceptible to resource contention.

**Takeaway:** Certain benchmarks are not affected by altering deployment's sizing

M. Katsaragakis, et. al

### Spark "Sizing" vs. Performance

- Different number of executors and cores/executor
  - Executor colocation with concurrent access to memory
- Baseline (default execution) → single executor, 40 cores



How do different deployment approaches affect performance?

Takeaway: Increased number of executors that compete over shared memory resources leads to further performance degradation, with persistent memory being even more susceptible to resource contention.

**Takeaway:** Certain benchmarks are not affected by altering deployment's sizing

Takeaway: Bigger workload size can lead to performance boost due to amortization of interference degradation from parallel processing

M. Katsaragakis, et. al

#### **Performance Predictability**

- Pearson Correlation
- How execution time correlates with:
- 1) System-level events (e.g., IPC, LLC misses)?
  - No linear correlation for the majority of the benchmarks
    Complex ML models needed
- 2) Hardware specs of each tier (Latency/Bandwidth)?
  - Very high linear correlation for all benchmarks
    - → Linear models can be utilized



#### **Conclusions**

- In-memory applications + Multi-tier memory architectures emerging
- Spark perfect candidate
  - In-memory computations
  - Vast amount of memory requirements

In this work :

- Performance analysis of Spark applications over heterogeneous multi-tier memory system
- Key takeaways
  - Spark applications highly affected by slower memory tiers (due to latency)
  - Slower memory tiers can be utilized without performance drop in certain cases
  - Promising signs for performance predictions using ML



{mkatsaragakis,dmasouros}@microlab.ntua.gr