# Improving Fairness in Memory Scheduling Using a Team of Learning Automata

Aditya Kajwe and Madhu Mutyam

Department of Computer Science & Engineering, Indian Institute of Tehcnology - Madras

June 14, 2014

## Outline



- **Related Work** 2
- Our Learning Automata-based Algorithm

### Experiments



-

#### DRAM scheduling

- The order in which memory access requests from the CPU are processed at DRAM.

. . . . . . .

#### DRAM scheduling

- The order in which memory access requests from the CPU are processed at DRAM.
- Impacts main memory fairness, throughput & power consumption.

#### DRAM scheduling

- The order in which memory access requests from the CPU are processed at DRAM.
- Impacts main memory fairness, throughput & power consumption.

#### Metrics for evaluating a scheduling algorithm

 harmonic speedup, execution time, sum-of-IPCs, maximum slowdown, weighted speedup

#### DRAM scheduling

- The order in which memory access requests from the CPU are processed at DRAM.
- Impacts main memory fairness, throughput & power consumption.

#### Metrics for evaluating a scheduling algorithm

 harmonic speedup, execution time, sum-of-IPCs, maximum slowdown, weighted speedup

- harmonic speedup = 
$$\frac{N}{\sum_{i} \frac{IPC_{i}^{alone}}{IPC_{i}^{shared}}}$$

#### DRAM scheduling

- The order in which memory access requests from the CPU are processed at DRAM.
- Impacts main memory fairness, throughput & power consumption.

#### Metrics for evaluating a scheduling algorithm

- harmonic speedup, execution time, sum-of-IPCs, maximum slowdown, weighted speedup

- harmonic speedup = 
$$\frac{N}{\sum_{i} \frac{IPC_{i}^{alone}}{IPC_{i}^{shared}}}$$

- Provides a good balance between fairness and system performance

- ATLAS [2]: prioritizes threads that have attained the least service

A = A = A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A

- ATLAS [2]: prioritizes threads that have attained the least service
- PAR-BS [5]: processes DRAM requests in batches, and uses the SJF principle within a batch

## Related Work

- ATLAS [2]: prioritizes threads that have attained the least service
- PAR-BS [5]: processes DRAM requests in batches, and uses the SJF principle within a batch
- MORSE [4]: extends lpek et.al's learning technique [1] to target arbitrary figures of merit.

## **Related Work**

- ATLAS [2]: prioritizes threads that have attained the least service
- PAR-BS [5]: processes DRAM requests in batches, and uses the SJF principle within a batch
- MORSE [4]: extends lpek et.al's learning technique [1] to target arbitrary figures of merit.
- MISE [6]: estimates slowdown of each application and accordingly redistributes bandwidth

## **Related Work**

- ATLAS [2]: prioritizes threads that have attained the least service
- PAR-BS [5]: processes DRAM requests in batches, and uses the SJF principle within a batch
- MORSE [4]: extends lpek et.al's learning technique [1] to target arbitrary figures of merit.
- MISE [6]: estimates slowdown of each application and accordingly redistributes bandwidth

### Thread Cluster Memory Scheduling (TCMS) [3]

- divides threads into two clusters

## Related Work

- ATLAS [2]: prioritizes threads that have attained the least service
- PAR-BS [5]: processes DRAM requests in batches, and uses the SJF principle within a batch
- MORSE [4]: extends lpek et.al's learning technique [1] to target arbitrary figures of merit.
- MISE [6]: estimates slowdown of each application and accordingly redistributes bandwidth

### Thread Cluster Memory Scheduling (TCMS) [3]

- divides threads into two clusters
- latency-sensitive cluster > bandwidth-sensitive cluster

## Related Work

- ATLAS [2]: prioritizes threads that have attained the least service
- PAR-BS [5]: processes DRAM requests in batches, and uses the SJF principle within a batch
- MORSE [4]: extends lpek et.al's learning technique [1] to target arbitrary figures of merit.
- MISE [6]: estimates slowdown of each application and accordingly redistributes bandwidth

### Thread Cluster Memory Scheduling (TCMS) [3]

- divides threads into two clusters
- latency-sensitive cluster > bandwidth-sensitive cluster
- periodically shuffles priority in the bandwidth cluster

A simple model for dynamic decision making in unknown environments.

A simple model for dynamic decision making in unknown environments.

Structure of FALA (Finite Action Learning Automaton) Formally, a FALA can be described by the quadruple  $(A, B, \tau, p(k))$ :

A simple model for dynamic decision making in unknown environments.

Structure of FALA (Finite Action Learning Automaton)

Formally, a FALA can be described by the quadruple  $(A, B, \tau, p(k))$  :

•  $A = \{\alpha_1, \alpha_2, ..., \alpha_r\}$  : finite set of actions.

A simple model for dynamic decision making in unknown environments.

### Structure of FALA (Finite Action Learning Automaton)

Formally, a FALA can be described by the quadruple  $(A, B, \tau, p(k))$  :

- $A = \{\alpha_1, \alpha_2, ..., \alpha_r\}$  : finite set of actions.
- B : set of all possible reinforcements

A simple model for dynamic decision making in unknown environments.

### Structure of FALA (Finite Action Learning Automaton)

Formally, a FALA can be described by the quadruple  $(A, B, \tau, p(k))$  :

- $A = \{\alpha_1, \alpha_2, ..., \alpha_r\}$  : finite set of actions.
- B : set of all possible reinforcements
- au : learning algorithm to update  $\mathbf{p}(k)$

A simple model for dynamic decision making in unknown environments.

### Structure of FALA (Finite Action Learning Automaton)

Formally, a FALA can be described by the quadruple  $(A, B, \tau, p(k))$  :

- $A = \{\alpha_1, \alpha_2, ..., \alpha_r\}$  : finite set of actions.
- B : set of all possible reinforcements
- au : learning algorithm to update  $\mathbf{p}(k)$
- $\mathbf{p}(k) = [p_1(k), p_2(k), ..., p_r(k)]^T$ : action probability vect at instant k

A simple model for dynamic decision making in unknown environments.

#### Structure of FALA (Finite Action Learning Automaton)

Formally, a FALA can be described by the quadruple  $(A, B, \tau, p(k))$  :

- $A = \{\alpha_1, \alpha_2, ..., \alpha_r\}$  : finite set of actions.
- B : set of all possible reinforcements
- au : learning algorithm to update  $\mathbf{p}(k)$
- $\mathbf{p}(k) = [p_1(k), p_2(k), ..., p_r(k)]^T$ : action probability vect at instant k

Higher the probability value for a thread, higher is its priority for DRAM scheduling.

< 回 ト < 三 ト < 三 ト



1. Choose action (schedule a memory request) based on action probability vector.



- 1. Choose action (schedule a memory request) based on action probability vector.
- 2. Get reinforcement (harmonic speedup) from the system.



- 1. Choose action (schedule a memory request) based on action probability vector.
- 2. Get reinforcement (harmonic speedup) from the system.
- Update the action probabilities (thread priorities) using equation 2.



- 1. Choose action (schedule a memory request) based on action probability vector.
- 2. Get reinforcement (harmonic speedup) from the system.
- Update the action probabilities (thread priorities) using equation 2.
  - This cycle repeats forever

# The Learning Algorithm $\boldsymbol{\tau}$

Linear Reward-Inaction  $(L_{R-I})$  [7] is one learning algorithm:

$$egin{aligned} p_i &= p_i + \lambda \cdot eta \cdot (1 - p_i) \ p_j &= p_j - \lambda \cdot eta \cdot p_j, \quad orall j 
eq i \end{aligned}$$

The above 2 equations can be combined using vector notation:

$$\mathbf{p}(k+1) = \mathbf{p}(k) + \lambda\beta(k)(\mathbf{e_i} - \mathbf{p}(k))$$

# The Learning Algorithm $\boldsymbol{\tau}$

Linear Reward-Inaction  $(L_{R-I})$  [7] is one learning algorithm:

$$egin{aligned} p_i &= p_i + \lambda \cdot eta \cdot (1 - p_i) \ p_j &= p_j - \lambda \cdot eta \cdot p_j, \quad orall j 
eq i \end{aligned}$$

The above 2 equations can be combined using vector notation:

$$\mathbf{p}(k+1) = \mathbf{p}(k) + \lambda \beta(k) (\mathbf{e_i} - \mathbf{p}(k))$$

Equation for a team of N FALA

$$\mathbf{p}_{i}(k+1) = \mathbf{p}_{i}(k) + \lambda\beta(k) \left[\mathbf{e}_{\alpha_{i}(k)} - \mathbf{p}_{i}(k)\right], 1 \le i \le N$$
(2)

Aditya Kajwe and Madhu Mutyam (IITM) Improving Fairness in Memory Scheduling

Image: A match a ma

# The Learning Algorithm $\boldsymbol{\tau}$

Linear Reward-Inaction  $(L_{R-I})$  [7] is one learning algorithm:

$$egin{aligned} p_i &= p_i + \lambda \cdot eta \cdot (1 - p_i) \ p_j &= p_j - \lambda \cdot eta \cdot p_j, \quad orall j 
eq i \end{aligned}$$

The above 2 equations can be combined using vector notation:

$$\mathbf{p}(k+1) = \mathbf{p}(k) + \lambda\beta(k)(\mathbf{e}_{\mathbf{i}} - \mathbf{p}(k))$$
(1)

Equation for a team of N FALA

$$\mathbf{p}_{i}(k+1) = \mathbf{p}_{i}(k) + \lambda\beta(k) \left[\mathbf{e}_{\alpha_{i}(k)} - \mathbf{p}_{i}(k)\right], 1 \le i \le N$$
(2)

The automata implicitly cooperate to perform a stochastic search over the space of rewards [7] : coordination among multiple memory controllers.

A B A B A
 A
 B
 A
 A
 B
 A
 A
 B
 A
 A
 B
 A
 A
 B
 A
 A
 B
 A
 A
 B
 A
 A
 B
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A

# Scheduling

Algorithm 1 Request prioritization in each memory controller

- Sampled action first: Select a request according to the action probability vector.
- 2: Row hit first: Select a request which hits the row-buffer.
- 3: Oldest first: Select the oldest request.

#### Algorithm 2 Sampling an action

```
1: cum\_prob[0] = p[0]

2: for count \leftarrow 1, (numThreads - 1) do

3: if rnd < cum\_prob[count - 1] then

4: break

5: else

6: cum\_prob[count] = cum\_prob[count - 1] + p[count]

7: end if

8: end for

9: action \leftarrow count - 1
```

(日) (周) (三) (三)

- Storage cost per controller: 3.3 Kbits (TCMS = 2.6 Kbits)

- Storage cost per controller: 3.3 Kbits (TCMS = 2.6 Kbits)
- Additional logic is required for calculating the reward and updating  $\mathbf{p}(k)$

- Storage cost per controller: 3.3 Kbits (TCMS = 2.6 Kbits)
- Additional logic is required for calculating the reward and updating  $\mathbf{p}(k)$
- Calculating HS on-the-fly: Requires instantaneous  $IPC_i^{alone}$ . We use overall  $IPC_i^{alone}$ , obtained by running a benchmark alone on the same baseline system, to get a rough estimate of HS.

- Storage cost per controller: 3.3 Kbits (TCMS = 2.6 Kbits)
- Additional logic is required for calculating the reward and updating  $\mathbf{p}(k)$
- Calculating HS on-the-fly: Requires instantaneous  $IPC_i^{alone}$ . We use overall  $IPC_i^{alone}$ , obtained by running a benchmark alone on the same baseline system, to get a rough estimate of HS.
- Updating **p**(*k*) is not on critical path. Can be performed in many tens of CPU cycles.

- Storage cost per controller: 3.3 Kbits (TCMS = 2.6 Kbits)
- Additional logic is required for calculating the reward and updating  $\mathbf{p}(k)$
- Calculating HS on-the-fly: Requires instantaneous  $IPC_i^{alone}$ . We use overall  $IPC_i^{alone}$ , obtained by running a benchmark alone on the same baseline system, to get a rough estimate of HS.
- Updating **p**(*k*) is not on critical path. Can be performed in many tens of CPU cycles.
- As an approximation, we consider the latency for determining the reward for a scheduling decision to be 90 cycles.

くほと くほと くほと

- Storage cost per controller: 3.3 Kbits (TCMS = 2.6 Kbits)
- Additional logic is required for calculating the reward and updating  $\mathbf{p}(k)$
- Calculating HS on-the-fly: Requires instantaneous  $IPC_i^{alone}$ . We use overall  $IPC_i^{alone}$ , obtained by running a benchmark alone on the same baseline system, to get a rough estimate of HS.
- Updating **p**(*k*) is not on critical path. Can be performed in many tens of CPU cycles.
- As an approximation, we consider the latency for determining the reward for a scheduling decision to be 90 cycles.

くほと くほと くほと

## **Experimental Setup**

- Modified version gem5 simulator

Aditya Kajwe and Madhu Mutyam (IITM) Improving Fairness in Memory Scheduling

3

→ 3 → 4 3

- Modified version gem5 simulator
- 16 CPU cores and 4 memory controllers

- Modified version gem5 simulator
- 16 CPU cores and 4 memory controllers
- PARSEC: Eight multi-threaded benchmarks with *simmedium* input set.

- Modified version gem5 simulator
- 16 CPU cores and 4 memory controllers
- PARSEC: Eight multi-threaded benchmarks with *simmedium* input set.
- SPEC CPU2006: Eight multiprogrammed workloads of varying memory intensity run for 500mn instructions

- Modified version gem5 simulator
- 16 CPU cores and 4 memory controllers
- PARSEC: Eight multi-threaded benchmarks with *simmedium* input set.
- SPEC CPU2006: Eight multiprogrammed workloads of varying memory intensity run for 500mn instructions

Experiments

# Results





PARSEC

SPEC CPU2006

Aditya Kajwe and Madhu Mutyam (IITM) Improving Fairness in Memory Scheduling

æ

Experiments

# Scalability



Aditya Kajwe and Madhu Mutyam (IITM) Improving Fairness in Memory Scheduling

June 14, 2014

3

イロト イヨト イヨト イヨト

12 / 15

- Improve the reward mechanism

- Improve the reward mechanism
- Evaluate on a wider variety of workloads (SPLASH and NAS benchmarks)

- Improve the reward mechanism
- Evaluate on a wider variety of workloads (SPLASH and NAS benchmarks)
- Compare against more recent scheduling algorithms (MISE)

- Improve the reward mechanism
- Evaluate on a wider variety of workloads (SPLASH and NAS benchmarks)
- Compare against more recent scheduling algorithms (MISE)
- A more accurate hardware feasibility analysis

- Improve the reward mechanism
- Evaluate on a wider variety of workloads (SPLASH and NAS benchmarks)
- Compare against more recent scheduling algorithms (MISE)
- A more accurate hardware feasibility analysis
- Evaluate on a synthetic workload where the outcome should be predictable.

- Improve the reward mechanism
- Evaluate on a wider variety of workloads (SPLASH and NAS benchmarks)
- Compare against more recent scheduling algorithms (MISE)
- A more accurate hardware feasibility analysis
- Evaluate on a synthetic workload where the outcome should be predictable.

- A learning technique is exploited to give improvement in fairness without much additional hardware cost.

► < Ξ ►</p>

- A learning technique is exploited to give improvement in fairness without much additional hardware cost.
- Scalable and works on multiprogrammed as well as parallel workloads

- A learning technique is exploited to give improvement in fairness without much additional hardware cost.
- Scalable and works on multiprogrammed as well as parallel workloads

### Questions ?

・ロト ・四ト ・ヨト ・ヨト

References

### E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana.

Self-optimizing memory controllers: A reinforcement learning approach.

In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 39-50. Washington, DC, USA, 2008. IEEE Computer Society.



-

#### Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter.

Atlas: A scalable and high-performance scheduling algorithm for multiple memory controllers. In M. T. Jacob, C. R. Das, and P. Bose, editors, HPCA, pages 1-12, IEEE Computer Society, 2010.

#### Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter.

Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '43, pages 65-76, Washington, DC, USA, 2010, IEEE Computer Society,

#### J. Mukundan and J. Martinez.

#### Morse: Multi-objective reconfigurable self-optimizing memory scheduler.

In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1-12, Feb

#### O Mutlu and T Moscibroda

Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 63-74. Washington, DC, USA, 2008, IEEE Computer Society,



### L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu.

Mise: Providing performance predictability and improving fairness in shared main memory systems. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). HPCA '13, pages 639-650, Washington, DC, USA, 2013. IEEE Computer Society.



Networks of Learning Automata. Springer, 2004.

(日) (同) (三) (三)