A.2. Multi-core Computing

The goal of this module is to introduce you to multi-core computing (i.e., running a program on multiple cores within the same computer).

Go through the tabs below in sequence…

Learning Objectives

• Understand the need for and the basics of multi-core computers
• Understand and apply the concepts of parallel speedup and efficiency
• Understand and quantify the relationship between idle time, speedup, and efficiency

Basic Concept

A multi-core processor provides multiple processing units, or cores, that are capable of executing computer code independently of each other. Multi-core processors have become ubiquitous. This is because starting in the early 2000’s it became increasingly difficult, and eventually impossible, to increase the clock rate of processors (due to well-documented power/heat issues). As a solution to this problem, microprocessor manufacturers started producing multi-core processors. For a program to exploit the compute power of a multi-core processor, it must use multi-threading (see operating systems and concurrent programming courses/textbooks). Although there are a lot of fascinating (and often difficult) aspects of multi-threading, conceptually it just means that a program comprises a set of tasks, some of which can run at the same time on different cores. This is called parallelism and we call this kind of programs parallel programs.

Each task in a parallel program performs some computation on some input data, which can be in RAM or on disk, and which produces some output data. For instance, we could have a 5-task program where each task renders a different frame of a movie. Or we could have a program in which tasks do different things altogether. For instance, a 2-task program could have one task apply some analysis to a dataset and another task render a live visualization of that dataset.

As mentioned in the Single Core Computing module, we do not consider time sharing. That is, we will only consider executions in which at most one task runs on a core at a given time. Operating systems do allow time-sharing (as explained in the Time Sharing tab of the Single Core Computing module). But time sharing incurs performance penalties. The typical approach when aiming for high performance is to avoid time sharing altogether. Therefore, in all that follows, a task that begins executing on a core executes uninterrupted and by itself on that same core until completion.

Speedup and Efficiency

A common motivation for running the tasks of a program on multiple cores is speed. For example, if you have tasks that a single core can complete in one hour, it will take four hours to complete four tasks. If you have two cores in a computer, now you can complete the same four tasks in less time, ideally in two hours. With parallelism we can decrease program execution time.

Unfortunately, most real-world programs do not have ideal parallelism behavior. In other words, they do not run $p$ times faster when executed on $p$ cores. Instead, they execute less than $p$ times faster. This may seem surprising, but comes about due to many reasons. For instance, when two tasks execute concurrently on two different cores, they still compete for the memory hierarchy, e.g., the L3 cache and the memory bus. We refer you to computer architecture courses/textbooks for more details. In this module, we assume that two tasks running on two different cores do not compete for the memory hierarchy. But even so, there are other reasons why a program cannot achieve ideal parallelism. Before we get to these reasons, let us first define two crucial metrics: Parallel Speedup and Parallel Efficiency.

Parallel Speedup

Parallel speedup, or just speedup, is a metric used to quantify the reduction in execution time of a parallel program due to the use of multiple cores. It is calculated by dividing the execution time of the program when executed on a single core by the execution time of this same program when executed on multiple cores. Let $p$ be the number of cores used to execute a program. The speedup on $p$ cores is:

\begin{align} \text{Speedup}(p) & = \frac{\text{Execution Time with 1 core}}{\text{Execution Time with p cores}}\; \end{align}

For instance, if a program runs in 3 hours on 1 core but runs in 2 hours on 2 cores, then its speedup is:

\begin{align} \text{Speedup}(2) & = \frac{3}{2} = 1.5\; \end{align}

In this example, we would be somewhat “unhappy” because although we have 2 cores, we only go 1.5 times faster. We would likely be hoping to go twice as fast. Let’s quantify this “disappointment” formally using another metric!

Parallel Efficiency

Parallel efficiency, or just efficiency, is a metric that captures how much useful work the cores can do for a program, or how much “bang” do you get for your “buck”. The “bang” is the speedup, and the “buck” is the number of cores.

More formally, the efficiency of an execution on $p$ cores is:

\begin{align} \text{Efficiency}(p) & = \frac{\text{Speedup}(p)}{p}\ \end{align}

If the speedup on 2 cores is 1.5, then the efficiency on 2 cores is:

\begin{align} \text{Efficiency}(2) & = \frac{1.5}{2} = 0.75 = \text{75%}\ \end{align}

Ideally, the efficiency would be 100% (which corresponds to going $p$ times faster with $p$ cores). In the above example, it is only 75%. This means that we are “wasting” some of the available compute capacity of our computer during the program’s execution. We have 2 cores, but our performance is as if we had only 1.5 cores. In other terms, we are wasting half the compute power of a core.

Practice Questions

[A.2.p1.1] Consider a parallel program that runs in 1 hour on a single core of a computer. The program’s execution on 6 cores has 80% parallel efficiency. What is the program’s execution time when running on 6 cores?

Let $S$ be the speedup on 6 cores for this program. Since the efficiency is equal to $S/6$, we have $S/6 = 0.8$, which gives us $S = 4.8$. Therefore, the program runs in 60/4.8 = 12.5 minutes.

[A.2.p1.2] A parallel program has a speedup of 1.6 when running on 2 cores, and runs 10 minutes faster when running on 3 cores. Give a formula for $T(1)$ (the execution time on one core in minutes) as a function of $T(3)$ (the execution time on three cores in minutes).

Because the speedup on 2 cores is 1.6, we have: $T(2) = T(1) / 1.6$

And the 10-minute time reduction gives us: $T(3) = T(2) - 10$

Therefore,

$T(3) = T(1) / 1.6 - 10$

which we can rewrite as:

$T(1) = 1.6 \times (T(3) + 10)$

At this point, you may be wondering why, in practice, parallel efficiency can be less than 100%. One reason is idle time. This is the time during which one or more cores are not able to work while other cores are working.

Consider a parallel program that consists of $n$ tasks, each of them running in the same amount of time on a core. We run this program on a computer with $p$ cores. If $n$ is not divisible by $p$, then at least one core will be idle during program execution. For example, if we have 8 tasks, that each run for 1 hour; and 5 cores, all cores will be busy running the first 5 tasks in parallel. But once this phase of execution is finished, we have 3 tasks left and 5 available cores. So 2 cores will have nothing to do for 1 hour. In this situation, we say that the load is not well-balanced across cores. Some cores will run two tasks, while others will run only one task.

There is a direct relationship between idle time and parallel efficiency, assuming idle time is the only cause of loss in parallel efficiency. The parallel efficiency is the sum of the core non-idle times divided by the product of the number of cores by the overall execution time.

The above statement may sound complicated, but it is very intuitive on an example. Consider a 2-core computer that executes a multi-task program in 35 minutes. One core computes for the full 35 minutes, while the other core computes for 20 minutes and then sits idle for 15 minutes.
This execution is depicted in the figure below:

Figure 1: Example 35-minute execution on a 2-core computer. The white area is the core idle time, the yellow area is the core compute time.

What the above statement says is that the parallel efficiency is the yellow area divided by the area of the whole rectangle. The white area is the number of idle core minutes in the execution (in this case $1 \times 15$ minutes). The more white in the figure, the lower the parallel efficiency. In this example, the parallel efficiency is $(1 \times 35 + 1 \times 20) / (2 \times 35)$ = 78.5%. You can note that this is exactly the speedup (55/35) divided by the number of cores (2).

So that you can gain hands-on experience, use the simulation app below on your own and to answer the practice questions hereafter.

This app allows you to pick a number of cores and a number of tasks to run on these cores. Try first with a single core running 5 tasks (you can vary the per/task amount of work in Gflop, but this value does not impact the overall execution pattern). The “Host Utilization” graph displays the execution as in Figure 1 above. Now try running a number of tasks and cores where the number of tasks does not evenly divide the number of cores. Looking at the host utilization graph again, now you will be able to see idle time for some of the cores (in light pink). Whenever we can see idle time on the graph, parallel efficiency is below 100%.

(Open simulator here)

eduWRENCH Pedagogic Module Simulator

Practice Questions

[A.2.p1.3] You have a 4-core computer where each core computes at speed 1000 Gflop/sec. You are told that a 10-task parallel program has 30 idle core seconds in total when executed on that computer. All tasks have the same work. What is the task work in Gflop? (You can double-check your answer with the simulation app above.)

Since we have 10 tasks and 4 cores, in the last phase of execution 2 cores are idle while 2 cores compute. Let $w$ be the work of a task. The duration of this last phase is $w / 1000$ seconds (i.e., work divided by compute speed). So the total idle core seconds is $2 \times w / 1000$ (since 2 cores are idle in the last phase). We know this number to be 30 seconds, therefore we simply need to solve:

$\frac{2\times w}{1000} = 30$

which gives us $w =$ 15000 Gflop/sec.

We can use the simulation app to double-check our result. We just need to enter 1500 (instead of 15000) as the task work amount in Gflop since in the simulation the core computes 10 times slower than in this question. The simulation clearly shows that the number of idle seconds is $15 \times 2 = 30$.

[A.2.p1.4] You are told that a 10-task program runs in 1 hour with on a 3-core machine. All tasks execute in the same amount of time on one core. What is the execution time of one task? (you can double-check your answer with the simulation app above)

The execution proceeds in 4 phases. If each of the first three phases 3 tasks are executed in parallel. In the last phase a single task executes. Therefore, each phase takes 60/4 = 15 minutes, which is the execution time of a task.

You can double-check this in simulation by setting the task work to $15\times 60 \ times 100 = 90000$, so that each task runs in 15 minutes on a core. The simulation clearly shows a 3600-second execution time, i.e., 1 hour.

[A.2.p1.5] Assume you have 20 tasks to execute on a multi-core computer, where each task runs in 1 second on a core. By what factor is the overall execution time reduced when going from 4 to 8 cores? (You can double-check your answer in simulation).

The total execution time when using 4 cores will be 5 seconds, as each core executes 5 tasks. When increasing from 4 cores to 8 cores, now the total execution time is 3 seconds. This is because the best we can do is have 4 of the cores run 2 tasks and the other 4 run 3 tasks. The overall execution time is reduced by a factor 5/3 = 1.66.

This is seen easily in simulation (setting the task work to 100 Flop).

[A.2.p1.6] Assume you now have 3 tasks to compute, still each taking 1 second on a core. What is the parallel efficiency on a 4-core computer?

When using only a single core, the 3 tasks will take 3 seconds to complete. When increasing the number of cores to 4, the same tasks can now be done in 1 second. Since $p$ the number of cores is greater than $n$ the number of tasks, we know that it will not be 100% efficiency. More precisely, the parallel speedup is 3, and thus the parallel efficiency is 3/4 = 75%.

[A.2.p1.7] You are upgrading your (pre-historic?) single-core computer and you have two new multi-core computers to choose from, one with 5 cores and one with 10 cores. Your only concern is to maximize parallel efficiency. All of the cores are identical. You have 15 tasks to run, each taking 1 second to complete on a core. Which multi-core computer will provide the higher parallel efficiency?

When using only a single core, the 15 tasks will take 15 seconds to complete.

When increasing the number of cores to 5, the program runs in 3 seconds, and there is no idle time (since 5 divides 15). Therefore, parallel efficiency is 100%.

When increasing the number of cores to 10, the program runs in 2 seconds. In this scenario, for the last second, 5 out of the 10 cores are idle. Therefore, efficiency is less than 100% (it is 75%).

We conclude that we should go with the 5-core computer (even though the 10-core computer completes the program faster, our concern here is parallel efficiency).

In all the above, we’ve only considered “identical” tasks: all tasks run in the same amount of time. Therefore, the main question was how the number of cores divides the number of tasks (if it divides it perfectly then we can have 100% efficiency). But in many real-world programs tasks are not identical. Some can take longer than the other. This is another possible source of load imbalance, and thus of idle time, and thus of loss of parallel efficiency.

Consider a 5-task program that runs on a 2-core computer. The tasks take 10s, 16s, 4s, 5s, and 11s, respectively. How fast can we run this program? For instance, we could run the first 3 tasks (10s, 16s, and 4s) on one core, and the last 2 tasks (5s and 11s) tasks on the other core. The first core would thus work for 30s while the second core would work for only 16s. The program thus runs in 30 seconds, and the parallel efficiency is $46 / (30 \times 2)$ = 76%.

Can we do better? If you think about it, the problem is to split the set of numbers {10, 16, 4, 5, 11} into two parts, so that the sum of the numbers in each part are as close to each other as possible. In this case, because we only have 5 numbers, we can look at all options. It turns out that the best option is: {10, 11} and {16, 4, 5}. That is, we run the first and last tasks on one core, and all the other tasks on another core. In this case, one core computes for 21s and the other for 25s. The parallel efficiency is now 92%.

What if we now have 3 cores? Then we have to split our set of numbers into 3 parts that are as “equal” as possible. The best we can do is: {10, 5}, {16}, and {11, 4}. In this case, the program runs in 16 seconds and the parallel efficiency on 3 cores is almost 96%. It is not useful to use more cores, since no matter what the program cannot run faster than 16 seconds (since we have one task that takes 16 seconds).

It turns out that splitting sets of numbers into parts with sums as close to each other as possible is a difficult problem. We are able to do it for the small examples like above, but as soon as the number of tasks gets large, it is no longer humanly possible. And in fact, it is not computer-ly possible either (at least, not quickly). More formally, determining the best split is an NP-complete problem (see algorithm/theory textbooks/courses). We will encounter this kind of problem (i.e., how to allocate tasks to compute resources) again in upcoming modules.

Practice Questions

[A.2.p1.8] A 5-task program runs optimally (i.e., it’s the best it can possibly do) in 10 seconds on a 2-core computer. Tasks 1 to 4 run in 2s, 4s, 3s, and 5s, respectively. Is it possible that Task 5 runs in 7s?

Nope. If Task 5 runs in 7 seconds, then we’d have to split the set {2, 3, 4, 5, 7} into two parts that each sum up to 10. One of these parts must contain number 7. So we also put number 3 into that part since then it exactly sums to 10. We are left with numbers 2, 4, and 5, which sum up to 11.

[A.2.p1.9] Consider a 6-task program. The execution times of 5 of the tasks are: 6, 8, 7, 12, 9. What should the 6th task’s execution time so that this program can run with 100% parallel efficiency on 3 cores?

If we run the 6s and the 9s tasks on one core, and the 8s and the 7s tasks on another core, both these cores finish computing in 15s. On the third core we run the 12s task. If the 6th task takes 3s, then all 3 cores finish computing in 15s. So the answer is 3 seconds.

Questions

[A.2.q1.1] You are told that a parallel program runs in 1 hour on a 3-core machine, and that the parallel efficiency is 90%. How long, in minutes, would the program take if executed using a single core?

[A.2.q1.2] You are told that a program runs in 10 hours when using the 4 cores of some computer with parallel efficiency 80%. Using 8 cores, the program runs in 6 hours. What is the parallel efficiency of this 8-core execution?

[A.2.q1.3] What parallel speedup will you observe when running 10 identical (in terms of execution time) tasks on 3 cores?

[A.2.q1.4] The parallel efficiency of a parallel program with 12 identical tasks on a multi-core computer is more than 82% but less than 90%. You know this computer has no more than 8 cores. How many cores does it have?

[A.2.q1.5] You have a 20-task program where each task’s work is 10 Gflop. You currently have a 4-core computer where each core computes at speed 50 Gflop/sec. For the same amount of money you can either (1) increase the speeds of all 4 cores by 20%; or (2) add a 5th 50 Gflop/sec core. What should you do if you want to run your program as quickly as possible?

[A.2.q1.6] Consider a 6-task program to be executed on a 3-core computer. The task execution times on one core are: 2s, 4s, 8s, 3s, 9s, and 3s. What is the best possible (i.e., the optimal) program execution time on these 3 cores? Could we do better with 4 cores?

Learning Objectives

• Understand and be able to quantify the impact of RAM constraints on parallel performance
• Understand and be able to quantify the impact of I/O on parallel performance

RAM Constraints and Parallelism

As seen in the Memory tab of the Single Core Computing module, a task may have a sizable amount of data that needs to be loaded and/or generated into RAM so that it can execute. Recall from that module that we do not allow a program to use more memory than available in physical RAM. Doing so is possible and handled by the Operating Systems (by shuffling data back and forth between RAM and disk) but comes with unacceptable performance penalties. So, here again, we never exceed the physical memory capacity of a host. If insufficient RAM is available for a task, this task must wait for currently running tasks to complete and free up enough ram. This can cause cores to remain idle. The worst possible case would be running a single task that uses the entire RAM, thus leaving all remaining cores idle while it executes. Because RAM constraints can cause idle time, they can also cause loss of parallel efficiency.

Simulating RAM Constraints

So that you can gain hands-on experience, use the simulation app below. This app is similar to that in the previous tab, but now includes a field for specifying the “Ram Needed For Each Task”. So now, we can simulate the fact that tasks require RAM space to run. The host we are simulating has 32 GB of RAM available.

First try using 4 cores for 8 tasks, where each task uses 8 GB of RAM. As you will see, there is no idle time. The number of tasks we can run at a time is 4, given the number of cores and the amount of RAM available.

Now try again, but this time set the tasks’ RAM requirement to 16 GB. There will now be idle time, as only 2 cores can be utilized simultaneously due to RAM constraints.

(Open simulator here)

eduWRENCH Pedagogic Module Simulator

Practice Questions

[A.2.p2.1] You need to execute 5 tasks that each run in 1 second on one core. Your current single-core processor thus can run these tasks in 5 seconds. The processor is then upgraded to have 5 cores, each identical in processing power to the original single core. If the machine has 32 GB of RAM and each task requires 8 GB of RAM to execute, what is the execution time on the new 5-core processor? (You can double-check your answer in simulation.) What is the parallel efficiency?

On the single-core machine the RAM constraint was not a problem as tasks were executed sequentially (there was never a need for more than 8 GB of RAM). With 5 cores, running all tasks concurrently would require 5x8 = 40 GB of RAM, but only 32 GB is available. Therefore, we can only run 4 tasks at the same time. So the last task runs by itself, with 4 cores being idle. The overall execution time is 2 seconds. This is seen easily in simulation.

Therefore:

$\text{Speedup}(5) = \frac{5}{2} = 2.5$

and

$\text{Efficiency}(5) = \frac{2.5}{5} = \text{50%}$

We would have been better off with a 4-core computer (since, likely, it would cost less).

[A.2.p2.2] Assume you have a 2-core machine on which you need to run 6 tasks (in any order). Each task runs in 1 second on a core. However, the tasks have the following RAM requirements in GB: 6, 2, 4, 3, 1, 7. If your machine has a total of 8 GB of RAM, can you achieve 100% parallel efficiency?

The question really is: Can you always run two tasks at the same time so that the sum of their RAM requirements never exceeds 8 GB? The answer is “yes”:

• Run the 7 GB and the 1 GB task together
• Run the 6 GB and the 2 GB task together
• Run the 4 GB and the 3 GB task together

(the order of the three steps above does not matter).

I/O and Parallelism

Another common cause of idle time is I/O. While a task running on a core performs I/O, the core is (mostly) idle. We learned about this in the I/O tab of the Single Core Computing module. In a parallel program this can translate to loss of parallel efficiency.

Let’s consider a simple parallel program: 4 tasks that each read in 10 MB of input data and then performs 400Gflop of computation. The program’s tasks, showing input data files, is depicted below:

Figure 1: Example 4-task parallel program with I/IO.

For now, let’s consider an execution of this program on a 1-core computer with a core that computes at 400 Gflop/sec and a disk with read bandwidth 100 MB/sec (on which the input data files are located). What is the execution time? Based on what we learned about I/O, we should strive to overlap I/O and computation as much as possible. For instance, the execution could proceed as follows:

Figure 2: Execution on 1 core.

It takes 1 second to read an input file, and then a task computes for 4 seconds. Using overlap of I/O and computation, the execution time is thus 17 seconds (only the first file read is not overlapped with computation). This is a great utilization of a single core. But what can we gain by running on multiple cores?

Let’s say now that we have 4 cores. One option is for all 4 tasks to start at the same time, in which case they all read their input data at the same time from disk. They split the disk bandwidth evenly, and thus it takes 4 seconds for each task to read its input. Then each task computes for 4 seconds on its own core. So the program runs for 8 second on the 4 cores. This execution is depicted below:

Figure 3: Execution on 4 cores, with simultaneous I/O.

One may wonder whether it may be a better idea to stagger the task executions, so that only one file is read from disk at a time, and so that I/O is overlapped with computation. This alternative is depicted below:

Figure 4: Execution on 4 cores, with staggered I/O.

The execution time is still 8s, so, for this example, the two executions are equivalent.

Overall, we achieve a parallel speedup of 17/8 = 2.125 and a parallel efficiency of only about 53%. And this is in spite of having 4 identical tasks and 4 cores, which, without I/O, would be 100% efficient. Increasing the parallel efficiency would require, for instance, upgrading to a disk with higher read bandwidth.

Practice Questions

[A.2.p2.3] A parallel program consists of 2 tasks:

• Task #1 reads 20 MB of input, computes 500 Gflop, writes back 10 MB of output
• Task #2 reads 10 MB of input, computes for 200 Gflop, writes back 20 MB of output

We execute this program on a computer with cores that compute at 100 Gflop/sec and with a disk with 100 MB/sec read and write bandwidth.

What is the best parallel speedup that can be achieved when running on 2 cores?

The execution on 1 core is as follows:

and takes 11 seconds.

The execution on 2 cores is as follows:

and takes 8 seconds. So the speedup is only 11/8 = 1.375.

Note that there are other options for running this program. For instance, we could start with reading the input for Task #2. This is not a good idea, because it means that Task #1 (which is much longer than Task #2) will start, and thus finish, later. Here is the execution:

This execution takes 9s, that is, 1 more second!

You may be wondering what happens if one does not stagger the I/O, but instead starts reading input files of both tasks at once. In this case, due to disk bandwidth sharing, Task #2 starts at time 2 and Task #1 starts at time 3. So here also, the execution takes 9s. You can try to draw the execution timeline as an exercise.

Questions

[A.2.q2.1] We are using a computer with 32 GB of RAM. What is the parallel efficiency when running 2 tasks on 2 cores if they each require 16 GB of RAM? What if each task requires 20 GB of RAM?

[A.2.q2.2] You are given a 2-core computer with 15 GB of RAM. On this computer you need to execute 6 tasks. The tasks have different RAM requirements (in GB): 4, 5, 8, 10, 11, 14. Can you achieve 100% parallel efficiency?

[A.2.q2.3] A program consists of 3 tasks that each takes in 2 GB of input data and have 30,000 Gflop work. This program is executed on a 2-core computer with 1 Tflop/sec cores and equipped with a disk with 250 MB/sec read bandwidth. What is the parallel efficiency if the program can never overlap I/O and computation (but multiple I/O operations can happen at the same time)?

[A.2.q2.4] Same question as above but now the program always overlaps I/O and computation.

Learning Objectives

• Understand the concept of task dependencies
• Understand and quantify the impact of task dependencies on parallelism

Basic Concept

So far, we have only considered independent tasks in our parallel programs, i.e., tasks that can be executed in any order and concurrently. In other words, given a computer with as many cores as tasks and sufficient RAM capacity, all tasks can run at the same time. But in many real-world programs this is not the case. Instead, tasks exhibit dependencies. In other words, some tasks cannot execute before other tasks are done. This could be because the output of a task serves as input to another, or more generally because a specific ordering of some tasks is necessary for program correctness.

As an analogy, consider a chef cooking a meal. First, they need to select and procure the ingredients. Second, they need to cook these ingredients. Finally, the cooked ingredients must be plated. None of these tasks may be completed out of order. The “cook ingredients” task depends on the “procure ingredients” task, and the “plate meal” task depends on the “cook ingredients” task. A convenient way to represent such programs is a Directed Acyclic Graph (DAG), in which vertices are tasks and edges are dependencies. For the “cook a meal” program, the DAG representation is straightforward, and depicted in the figure below:

Figure 1: DAG for the "chef" example.

Here is a typical example of task dependencies in a parallel program. Consider a program that counts the number of car objects in a set of compressed street images. Each image needs to be uncompressed, pre-processed, (e.g., to remove noise), analyzed (to find and count cars). Once this has been done for each image, car count statistics need to be displayed. If we have 5 compressed pictures, the program’s DAG is:

Figure 2: DAG for the "car counting" example.

Note that each task above can involve both I/O and computation. For instance, an “uncompress” task must read in an image file from disk to uncompress it. Then, whether it writes back to disk the uncompressed image or keeps in RAM so that the “pre-process” task can do its job is up to the program’s implementation in software. Given that the DAG above does not show any output file for these tasks, the idea is to keep everything in RAM and/or I/O operations. Clearly keeping things in RAM can avoid costly I/O operation, but as we know RAM capacity is limited. So, based on what we learned in the previous tab, we could lose parallel efficiency due to RAM constraints.

For now, to keep things simple, let’s assume that tasks take zero RAM and that they perform no I/O. Let’s consider an example program that is used to analyze some dataset. It begins with a “start” task that does some pre-processing of the in-RAM dataset. Then, once the pre-processing is done, it needs to perform three things. Namely, it needs to produce some visualization, perform some analysis, and compute some statistics:

• The visualization consists of a sequence of two tasks: “viz” (computes what to visualize) and “plot” (generates a fancy 3-D plot)
• The analysis consists of a sequence of two tasks : “analyze” (performs data analysis) and “summarize” (generates summary analysis results)
• The statistics consists of a single task: “stats” (computes some statistics)

Once all the above is done, a “display” task displays all results. The “analyze” task has an amount of work that is user-defined. The more work, the more in-depth the analysis results.

The program’s DAG is shown below, with the work of each task (and just X for the analysis task):

Figure 3: DAG for the "data set analysis" example.

To gain hands-on experience with the task dependency concept, use the simulation app below to simulate the execution of the above program on a 3-core computer, where each core computes at speed 10 Gflop/sec. You can pick the amount of work for the “analyze” task. The execution strategy used for this execution is very simple: whenever a task can be executed (because all its parent tasks have been executed) and a core is (or becomes) idle, then execute that task on that core immediately. We call a task whose parents have all executed a ready task. The following practice questions are based on this simulation app.

(Open simulator here)

eduWRENCH Pedagogic Module Simulator

Practice Questions

[A.2.p3.1] Say we run the program with an “analyze” task that has 100 Gflop work. What is the parallel efficiency when running the program on the 3-core computer and when using a single analysis task? (feel free to use the simulation app to help you)

The sequential program’s execution on 1 core, T(1), is simply the sum of individual task execution times,

\begin{align} T(1) & = 5 + 20 + 10 + 10 + 10 + 40 + 1 = 96 \;\text{sec} \end{align}

The simulated execution time on our 3-core computer is:

\begin{align} T(3) & = 46 \;\text{sec} \end{align}

So the parallel efficiency is $E(3) = (96/46)/3 =$ 69.56%.

[A.2.p3.2] What is the number of idle core seconds when running the program when the “analyze” task has 300 Gflop work on our 3-core computer? You can double-check your answer in simulation.

This is a very similar question as the previous one. The sequential execution time is 126 seconds, and the execution time on 3 cores is still 46 seconds. Therefore, the number of core idle seconds is $46 \times 3 - 126 = 12$ seconds.

We can double check this answer by counting the number of idle seconds as shown in the Host Utilization graph of the simulation app.

[A.2.p3.3] For what amount of work of the “analyze” task is the parallel efficiency maximized? You could use the simulation app to “search” for the right answer, but that would be really tedious. Try using analysis and/or intuition first.

Let’s first do a purely analytical solution. Let $x$ be the work of the “analyze” task in Gflop. The sequential execution time is $x/10 + 86$ seconds.

The parallel execution time is a bit trickier.

The visualization path takes time $5 + 20 + 10 + 1 = 36$ seconds, which is shorter than the statistics path, which takes 46 seconds. The analysis path takes time $5 + x/10 + 10 + 1 = 16 + x/10$ seconds.

So, we have two cases: If $16 + x/10 \leq 46$, that is, if $x \leq 300$, the critical path is the analysis path, otherwise the critical path is the statistics path. So let’s examine both cases:

• $x \leq 300$: the parallel execution time is 46 seconds, and so the parallel efficiency is equal to $((x/10 + 86) / 46) / 3$. This is maximized for $x = 300$, and is then equal to 84.05%.

• $x \geq 300$: the parallel execution time is 16 + x/10, and so the parallel efficiency is equal to $((x/10 + 86) / (16 + x/10)) / 3$. This is a decreasing function on the [300, infinity] domain, and so on that domain it is maximized for $x = 300$.

The final answer is thus 300 Gflop.

The above is quite formal, but we could have given a purely common-sense answer. The parallel efficiency is maximized when all three paths take time as close as possible as the longest such path, so as have cores working as much as possible. This is the same load balancing idea that we have seen in the Parallelism tab for independent tasks! This is achieved when the analysis path and the statistics path are equal (nothing can be done about the visualization path), that is, when $x = 300$.

For $x = 300$ the efficiency is 84.05%, which is the best this program can ever achieve.

Levels, Width, Critical Path

In the previous section, and the practice questions, we touched upon some fundamental concepts without naming them explicitly. Let’s do so now.

A first concept is that of a DAG level. A task is on level $n$ of the DAG if the longest path from the entry task(s) to this task is of length $n$, where the path length is measured in number of vertices traversed before reaching this task. By this definition, an entry task is in level 0. Every child task of an entry task is in level 1, and so on. Formally, the level of a task is one plus the maximum of the levels of its parent tasks (this is a recursive definition).

For our example DAG in Figure 3 above, we can determine the level of each task:

start 0
viz 1
analyze 1
stats 1
plot 2
summarize 2
display 3

So we say that this DAG has four levels. Note that this does not mean that the DAG tasks must be executed level by level. For instance, we could execute task “plot” (level 2) before task “analyze” (level 1).

A second concept is that of a DAG width (or DAG parallelism): the maximum number of tasks in the workflow levels. For instance, for our example DAG, the parallelism is 3 because level 1 has 3 tasks (and all other levels have fewer tasks). This means that we cannot make any use of more than 3 cores when executing this graph, as a fourth core would never have anything to do.

A third concept is that of the critical path: the longest path in the dag from the entry task(s) to the exit task(s), where the path length is measured in task durations, including the entry and the exit task(s). No matter how many cores are used, the program cannot execute faster than the length of the critical path. For instance, consider our example DAG, assuming that the “analyze” task has work 250 Gflop. There are three paths from “start” to “display”. The length of the visualization path is 5+20+10+1 = 36 seconds. The length of the statistics path is 5+40+1=46 seconds. The length of the analysis path is 5+25+10+1=41 seconds. And so the critical path is {“start” -> “stats” -> “display”}, of length 46 seconds. No matter how many 10 Gflop/sec cores are used to execute this program, it can never run in less than 46 seconds!

Practice Questions

[A.2.p3.4] For the DAG below, give the number of levels, the width, and the length of the critical path in seconds (name and execution time are shown for each task).

• Number of levels: 4
• Width: 3 (level 3 has 3 tasks: G, E, and F)
• Length of the critical path: 30s (A 1s, D 20s, F 7s, and H 2s)

[A.2.p3.5] For the DAG below, would it be useful to use more than 3 cores? Can the execution time be ever shorter than 29 seconds? Could you modify one edge’s end point to increase the DAG width?

Here is the set of DAG levels:

0 A
1 B, C
2 D, E, F
3 G
4 H

It would never be useful to use more than 3 cores because the width of the DAG is 3 (level 2). The DAG’s critical path is {A->B->D->G->H}, which has length 28s. So yes, the execution (on 3 cores) could be lower than 29s.

Replacing the D->G edge by a D->H edge would make the DAG width 4 (i.e., level 2 would have 4 tasks in it).

Choosing which task to run next

In our example dataset analysis program, there was never a choice for deciding which task to run next. First, we have to run “start”. Then, we have three tasks that are ready, that is, whose parents have all executed. Since we have 3 cores, we run all three, each on one core. In other words, since we have 3 paths in the DAG and 3 cores, we just run each path on its own core.

In general however, we could have more ready tasks than idle cores, in which case we have to pick which ready tasks to run. This, turns out, can be a difficult problem known as “DAG scheduling”. We explore this advanced topic in later modules, but for now we can get a sense for it via our example.

Let’s say that we now must run the program on a 2-core computer. We have a choice after “start” completes: we have 3 ready tasks and only 2 cores. Say we run “analyze” and “stats”. If “analyze” completes before “stats”, then we have another choice: should we run “viz” or “summarize”? It turns out that some of these choices are better than others. In this small example the “bad” choices are not terrible, but for larger DAGs they could lead to a large performance loss.

There are some rules of thumb for selecting ready tasks. A good and popular one is: Whenever there is a choice pick the task that is on the critical path. After all it is critical. But this is not guaranteed to be always best. It just happens to work well for many DAGs.

Simulating Execution on a 2-core Computer

To see the impact of task selection decisions, the simulation app below allows you to simulate the execution of our dataset analysis program on 2 cores while prioritizing some execution paths. For instance, if you select “viz/analyze”, whenever there is a choice, we always pick a visualization or an analysis task over the “stats” task.

You can experiment yourself with different settings, and use the app to answer the practice questions thereafter.

(Open simulator here)

eduWRENCH Pedagogic Module Simulator

Practice Questions

[A.2.p3.6] Setting the “analyze” task’s work to 10 Gflop, does it matter which paths are prioritized when executing the program on 2 cores? If so, which ones should be prioritized? Can you venture an explanation?

Yes, it does matter! Not prioritizing the statistics path is a mistake. This is because the statistics path is the critical path. Not counting the “start” and “display” tasks, the visualization path runs in 30s, the analysis path in 11s, and the stats path in 40s. This is exactly the problem we looked at in the first tab: partition a set of numbers into two groups so that their sums are as close to each other as possible! The best choice for this grouping here is clearly {30, 11} and {40}. In other words, on one core we should run the visualization and the analysis path, and on the other we should run the statistics path.

So, if we prioritize both the visualization and analysis paths after task “start” completes, they will run on different cores, which is a bad choice (as the groupings will be {30} and {11, 40}). Conclusion: the “stats” path should be part of the two prioritized paths.

All this can be seen easily in the simulation app.

[A.2.p3.7] Say now we set the work of the “analyze” task to be 300 Gflop. What are the execution times with each of the three path prioritization options? Can you explain why the results are as they are?

All three prioritization schemes give a 76 second execution time. In other words, path prioritization does not matter. With a 300 Gflop work for the “analyze” task, the visualization path takes 30 seconds, and both the analysis and the statistics paths take 40 seconds. (Without counting the “start” and the “display” tasks). No matter what we do, running on two cores three tasks that take 30s, 40s, and 40s will take 70s.

If you really want to spell it out, we can just look at all possibilities. If both 40s paths start first, each on a core, then the 30s path starts after that, for 70s of execution. If the 30s path starts with a 40s path, each on a core, then the 2nd 40s path will start on the core that ran the 30s path, since it becomes idle first. This, again, is a 70s execution. So overall, the execution will always be 5 + 70 + 1 = 76s.

[A.2.p3.8] Is it possible that, for some amount of work of the “analyze” task, all three different prioritizing options lead to three different execution times (when executing the program on 2 cores)? Although you may have a rapid intuition of whether the answer is yes or no, deriving a convincing argument is not that easy…

This is perhaps not an easy question, as it requires to think about this abstractly (so as to avoid examining all possibilities). The answer is “no”. Let’s see why.

We can look at this question at a very abstract level: we have three “things” to run, let’s call them $A$, $B$, and $C$. (Each of them is one of our three paths, excluding the “start” and “display” tasks). Let $a$, $b$, and $c$ be their execution times. Say, without loss of generality, that $a \leq b \leq c$. Then, we can see what runs on each core for each option that prioritizes two of them:

prioritizing core #1 core #2
$A$ and $B$ $A$ then $C$ $B$
$A$ and $C$ $A$ then $B$ $C$
$B$ and $C$ $B$ then $A$ $C$

The two prioritized things start first. Then the third thing runs on the core that becomes idle first (i.e., the core that was running the shortest thing).

We note that in the table above, the 2nd and 3rd rows are identical. That is, the cores finish computing at the same time. The only thing that changes is the order in which things run on core #1 (“$A$ then $B$” or “$B$ then $A$”). Therefore, two of the prioritization options always produce the same outcome in terms of overall program execution time!

Questions

[A.2.q3.1] For the DAG below, where each task has an execution time in seconds on a core of some computer, give the number of levels, the width, and the length of the critical path in seconds.

[A.2.q3.2] For the DAG in the previous question, what would be the parallel efficiency on 3 cores?

[A.2.q3.3] We now execute this same DAG on 2 cores. Whenever there is a choice for picking a ready task for execution, we always pick the ready task with the largest work (this is a “I should do the most time-consuming chores first” approach). What is the execution time?

[A.2.q3.4] Still for that same DAG on 2 cores, we now pick the ready task with the smallest work first (this is a “I should do the easiest chores first” approach). What is the execution time? It is better than the previous approach?

[A.2.q3.5] For this new DAG below, executed on 2 cores, what are the execution times of the “pick the ready task with the largest work” and “pick the ready task with the smallest work” approaches? Which approach is better?

Learning Objectives

• Understand the concept of data-parallelism
• Understand and be able to apply Amdahl’s law
• Understand and be able to reason about the performance of data-parallel programs

Motivation

In all we have seen so far in this module, a parallel program consists of a predetermined set of tasks, each of them executing on a single core. Many real-world programs are structured in this way, and this is called task parallelism.

Let’s now consider one task, which performs some computation on a single core. Perhaps, one can rewrite the code of this task to use multiple cores to accelerate its computation. This is done by writing the task’s code so that it uses multiple threads (see concurrent programming textbooks/courses). In other terms, perhaps the task’s computation itself can be parallelized.

An Example

Consider a transformation of the pixels of an image that makes the image resemble an oil-painting. This can be done by updating each pixel’s color by some other color based on the color of neighboring pixels. The oil-painting transformation has a parameter called the radius, which is the radius of the brush stroke. The larger the radius, the more neighboring pixels are used to update the color or a pixel, and the more work is required. In fact, the amount of work is quadratic in the radius, meaning that it grows with the square of the radius. This is how “oil-painting filters” work in many open-source and commercial image processing programs.

Consider now a program that is a sequence of two tasks: An “oil” task applies an oil-painting filter to an image with a given radius $r$, followed by a “luminence” task that computes the luminence histogram for the image (i.e., the statistical distribution of the brightness of its pixels). We can draw the program’s DAG as follows:

Figure 1: Example image processing program.

If we were to run this program on a core that computes at speed 100 Gflop/sec, and using $r=3$ for the “oil” task, the program would take time:

\begin{align} \text{T} & = \frac{ 100 \times 3^{2} \;\text{Gflop}}{100\; \text{Gflop/sec}} + \frac{100\; \text{Gflop}}{100\; \text{Gflop/sec}}\\ & = 10\; \text{sec} \end{align}

Data-Parallelism

In the oil-painting transformation the same computation is used for each pixel of the image (with perhaps special cases for the pixels close to the borders of the image). You can think of the computation applied to each pixel as a “micro-task”. All these micro-tasks have the same work and do the same thing (i.e., they run the same code), but on different data (the neighboring pixels of different pixels). This is called data parallelism. It is a bit of a strange term because it is just like task parallelism, but with very fine granularity. Regardless, it should be straightforward to perform the transform on, say, 4 cores: just give each core a quarter of the pixels to process!

A simple general model is: if the total work of the “oil” task is $X$ and if we have $n$ cores, we could perform the work using $n$ tasks each with $X/n$ work. This assumes $X$ is divisible by $n$. This is likely not quite the case in practice, but a very good approximation if the number of pixels is much larger than the number of cores, which we will assume here.

Simulating Data-Parallelism

After exposing data-parallelism in our example program (i.e., by rewriting the code of the “oil” task), the program’s DAG is as follows:

Figure 2: Example image processing program with data-parallelism exposed.

The program can run faster using multiple cores! How fast? The simulation app below simulates the execution for particular values of the radius $r$ and a number of cores (using one “oil” task per core). You can use the simulation to explore data-parallelism on your own, but also to answer some of the practice questions below.

(Open simulator here)

eduWRENCH Pedagogic Module Simulator

Practice Questions

[A.2.p4.1] Analytically estimate the execution time of the oil-painting program with radius $r = 3$ when it runs on 6 cores. Then check your results with the simulation app.

The execution time on 6 cores is:

$T = \frac{100 \times 3^2 / 6}{100} + \frac{100}{100} = 2.50 \text{sec}$

[A.2.p4.2] Which execution has the best parallel efficiency: A) $r=2$ on 6 cores; or B) $r=3$ on 8 cores? Try to formulate an intuitive answer. Then check your intuition using analytics and/or the simulation?

Intuitively, when going from execution A to execution B the total work grows roughly by a factor 9/4 while the number of cores grows by a much smaller factor 8/6. So execution B should be more efficient.

The execution times for execution A on 1 and 6 cores are:

$T_A(1) = \frac{100 \times 2^2}{100} + \frac{100}{100} = 5 \text{sec}$

$T_A(6) = \frac{100 \times 2^2 / 6}{100} + \frac{100}{100} = 1.66 \text{sec}$

You can confirm the above numbers with the simulation. The parallel efficiency is $E_A = (10/2.5)/6$ = 52.08%.

Similarly for execution B on 1 and 8 cores: $T_A(1) = \frac{100 \times 3^2}{100} + \frac{100}{100} = 10 \text{sec}$

$T_A(6) = \frac{100 \times 3^2 / 8}{100} + \frac{100}{100} = 2.125 \text{sec}$

You can confirm the above numbers with the simulation. The parallel efficiency is $E_B = (10/2.125)/8$ = 58.82%. Our intuition is confirmed! Execution B has better efficiency!

[A.2.p4.3] A program consists of two tasks that run in sequence. The first runs in 10s and the second in 20 seconds, on one core of a 4-core computer. A developer has an idea to expose data-parallelism in the second task and rewrites it so that it is replaced by 4 independent tasks each with 1/4-th of the original task’s work. What is the parallel efficiency on 4 cores?

When running on 4 cores, the program runs in 10 + 20/4 = 15 seconds. So the speedup is 30/15 = 2. So, the parallel efficiency is 50%.

Amdahl’s Law

The simulation and practice questions above highlight a simple phenomenon known as Amdahl’s law. This law says that the overall parallel speedup that a program that has a sequential and a parallel part is limited by the amount of time spent in the sequential part. This is very intuitive, since in the extreme a program is purely sequential and the parallel speedup is always 1 regardless of the number of cores. But the (to some) surprising thing is how severe the limit is. Let’s derive Amdahl’s law in the abstract, and then apply it to our example oil painting program.

Consider a program that runs on 1 core in time $T$. This program consists of two main phases, one that is inherently sequential and one that can be parallelized. Let $\alpha$ be the fraction of the execution time spent in the parallelizable phase. We can thus write the execution time on 1 core, $T(1)$, as:

\begin{align} T(1) & = \alpha T + (1 - \alpha) T\\ \end{align}

Now, if we run the program on $p$ cores, assuming perfect parallelization of the parallelizable phase, we obtain the execution time on $p$ cores, $T(p)$, as:

\begin{align} T(p) & = \alpha T / p + (1 - \alpha) T\\ \end{align}

The above just says that the parallel part goes $n$ times faster, while the sequential part is unchanged.

The parallel speedup on $p$ cores, $S(p)$, is then:

\begin{align} S(p) & = \frac{\alpha T + (1 - \alpha) T}{\alpha T / p + (1 - \alpha) T}\\ & = \frac{1}{ \alpha/p + 1 - \alpha} \end{align}

As $p$, the number of cores, grows, $S(p)$ increases (as expected). Amdahl’s law is the observation that no matter how large $p$ gets, the speedup is limited by a constant:

\begin{align} S(p) < \frac{1}{1 - \alpha} \end{align}

So, for instance, if 90% of the sequential execution time can be parallelized, then the speedup will be at most 1/(1-0.9) = 10.

For instance, if running on 8 cores, the speedup would be 1/(0.9/8 + 1 - 0.9) = 4.7, for a parallel efficiency below 60%.

The “non-intuitiveness” of Amdahl’s law, for some people, is that having 10% of the execution sequential does not seem like a lot, but seeing only a 4.7 speedup with 8 cores seems really bad. The graph below shows speedup vs. number of cores for different values of $\alpha$:

Figure 3: Speedup vs. number of cores for different values of the fraction of the sequential execution time that is parallelizable.

The main message of Figure 3 is that even with seemingly small non-parallelizable portions, program speedup drops well below the number of cores quickly. For instance, the data point circled in red shows that if as little as 5% of the sequential execution time is non-parallelizable, running on 20 cores only affords a 10x speedup (i.e., parallel efficiency is only 50%).

This is bad news since almost every program has inherently sequential phases. In our example program the sequential phase is the “luminence” task. But even without this task, there are many parts of a program that are sequential. For instance, a program typically needs to write output using sequential I/O operations. Even if these parts are short, Amdahl’s law tells us that they severely limit speedup.

Bottom line: achieving high speedup on many cores is not easy. The ability of a program to do so is often called parallel scalability. If a program maintains relatively high parallel efficiency as the number of cores it uses increases, we say that the program “scales”.

Practice Questions

[A.2.p4.4] A program that consists of a sequential phase and a perfectly parallelizable phase runs on 1 core in 10 minutes and on 4 cores in 6 minutes. How long does the sequential phase run for?

Let $\alpha$ be the fraction of the sequential execution time that is parallelizable. Amdahl’s law gives us the speedup on 4 cores as:

$S(4) = \frac{1}{ \alpha/4 + 1 - \alpha}$

Since we know $S(4)$ to be 10/6, we can just solve for $\alpha$. This gives us $\alpha = ((6/10) - 1) / (1/4 - 1) = .53$.

Therefore, the sequential phase lasts for $10 \times (1 - .53)$ = 4.7 minutes.

[A.2.p4.5] A program consists of a sequential phase and a perfectly parallelizable phase. When executed on 1 core, the parallel phase accounts for 92% of the execution time. What fraction of the execution time on 6 cores does this phase account for?

Let $T(1)$ be the sequential execution time. The execution time on 6 cores, $T(6)$, is:

$T(6) = 0.08 \times T(1) + 0.92 \times T(1) / 6$

and the fraction of T(6) that corresponds to the parallel phase is:

\begin{align} T(6) & = \frac{0.92 \times T(1) / 6}{0.08 \times T(1) + 0.92 \times T(1) / 6} & = \frac{0.92 / 6} {0.08 + 0.92 / 6} & = .65 \end{align}

So only 65% of the 6-core execution is spent in the parallel phase.

[A.2.p4.6] 40% of the sequential execution time of a program is spent in a phase that could be perfectly parallelized. What is the maximum speedup one could achieve if any number of cores can be used?

This is a direct application of Amdahl’s law. The upper bound on the speedup is 1/(1 - 0.4) = 1.66. There is really no need to remember the formula by heart. The bound is simply what speedup we would achieved with an infinite number of cores, i.e., when the execution time of the parallel phase is zero.

Amdahl’s law and our example

For our example oil-painting program, we can of course compute the speedup analytically.
To apply Amdahl’s law to this program, we need to compute $\alpha$, the fraction of the sequential execution time that is parallelizable. Still for a 100 Gflop/sec core, for a given radius $r$ the time spent in the “oil” task is $r^2$ seconds. The time spent in the “luminence” task is 1 second. Therefore, $\alpha = (r^2) / (1 + r^2)$. So, the speedup when running on $p$ cores with radius $r$, $S(p,r)$, is:

\begin{align} S(p,r) & = \frac{1}{r^2/(1+r^2) / p + 1 - r^2/(1+r^2)} \end{align}

You can double-check that this formula matches what we observed in the simulation app. For instance, for $r=2$, the speedup using 4 cores would be:

\begin{align} S(4,2) & = \frac{1}{(4/5)/ 4 + 1 - 4/5 } & = 2.5 \end{align}

We could then ask questions like: what is the largest number of cores that can be used without the efficiency dropping below 50%? We just need to solve:

\begin{align} \frac{1}{((4/5)/ n + 1 - 4/5)\times n} \geq .50 \end{align}

which gives us $n \leq 5$. So as soon as we use 6 cores or more, parallel efficiency drops below 50%, meaning that we are “wasting” half the compute power of our computer. We could use more cores effectively for larger $r$ because the application would have more (parallelizable) work to do.

In what we have seen so far, the data-parallelization of a task was “perfect”. That is, the original work is $X$ and when using $p$ tasks on $p$ cores each task has work $X/p$.

This is not always the case, as there could be some overhead. This overhead could be a sequential portion that remains unparallelized. Or there could be more work to be done by the parallel tasks. We illustrate this in the two practice questions below.

Practice Questions

[A.2.p4.7] Consider a program that consists of a single task with work 10,000 Gflop. The developer of the program has an idea to expose data-parallelism. But it is not perfect: the single task is rewritten as a first task with work 500 Gflop, and then $n$ tasks with each work $10000/n$ Gflop. So the total work of the program is larger and there is still a sequential phase. What would the speedup be if executing the modified code on 4 cores (compared to the original 1-task program on 1 of these cores)?

Let $s$ be the core compute speed in Gflop/sec.

The sequential program runs in time $10000/s$.

The data-parallel program runs in time $500/s + (10000/4)/s$.

Therefore, the speedup is:

\begin{align} \text{speedup} & = \frac{10000/s}{500/s + (10000/4)/s} & = \frac{10000}{500 + 2500} & = 3.33 \end{align}

[A.2.p4.8] Consider a program that consists of a single task with work 10,000 Gflop. The developer of the program has an idea to expose data-parallelism where the code now consists of $n$ tasks, each of them with work $(10000+X)/n$ (i.e., there is some work overhead for exposing data-parallelism, but there is no sequential phase). What is the largest value of X for which the parallel efficiency would be above 90% when running on an 8-core computer?

Let $s$ be the core compute speed in Gflop/sec. The sequential program runs in time $10000/s$, and the data-parallel program runs in time $((10000+X)/8)/s$.

Therefore, the speedup is:

\begin{align} \text{speedup} & = \frac{10000/s}{((10000+X)/8)/s} & = 8 \times \frac{10000}{10000+X} \end{align}

The parallel efficiency is $\frac{10000}{10000+X}$, so we need to solve:

\begin{align} \frac{10000}{10000+X} \geq 0.9 \end{align}

which gives $X \leq 1111.11$ Gflop.

Questions

[A.2.q4.1] If the sequential execution of a program spends 30% of its time in a phase that could be parallelized perfectly, what would be the parallel efficiency of an execution of this program on 6 cores (assuming that phase has been parallelized)?

[A.2.q4.2] A program consists of a sequential phase and a perfectly parallelizable phase. The program runs on 1 core in 20 minutes and on 3 cores in 10 minutes. How long does the sequential phase run for?

[A.2.q4.3] If a parallel program achieves parallel efficiency of 99% when running on 64 cores, what fraction of its sequential execution time was non-parallelizable?

[A.2.q4.4] Consider a program that consists of a single task with work 10,000 Gflop. Developer $A$ proposes to replace this task with 5 tasks each with work 2,000 Gflop. Developer $B$ proposes to replace this task with 4 tasks each with work 3,000 Gflop, followed by a sequential task with work 500 Gflop. Which developer’s idea should you use when running this program on a 4-core machine?

[A.2.q4.5] A program currently consists of two tasks, $A$ and $B$, that are independent (i.e., they can be performed in parallel). Task $A$ has work 1000 Gflop, while task $B$ has work 2000 Gflop. You can either replace task $A$ with two independent tasks each with work 600 Gflop, or replace task $B$ with two independent tasks each with work 1900 Gflop. If running on a 3-core computer, which replacement would be best in terms of program execution time?

Learning Objectives

• Be able to apply (most of) the concepts in this module to a case-study

A Bioinformatics program

Below is the DAG for a program that implements bioinformatics computations on a large database of DNA sequences. A first task applies some simple cleanup process to the sequences. After that, three tasks need to be executed to compute different similarity metrics between the sequences in the database. Once all these metrics are obtained, a complicated machine learning classification process is applied to the metrics (Task 5). The work and RAM footprint of each task is shown in the figure below.

We have to run this program on a 2-core Virtual Machine (VM) with 20 GB of RAM, where each core computes with speed 400 Gflop/sec, and data is read from storage at bandwidth 100 MB/sec.

Questions

[A.2.q5.1] What is the execution time of this program on this VM?

Saving money?

You’ve found that the execution time is longer than 1 minute. (If not, re-check your work for Question #1!)

This VM is “leased” from some cloud infrastructure that charges 1c for each minute of usage. As a result, a program that runs in, say, 61 seconds, will be charged 2c. If we could run it in under 60 seconds, we could save your organization 1c for each program execution. This does not sound like a lot, but this program runs thousands times each day on hundreds of similar VM instances. So at the end of the year you could have saved a substantial amount of money.

Given a budget your organization has allocated to making the program run faster, you have the following options at your disposal:

• Option #2: Upgrade your VM so that it has 3 cores and 30 GB of RAM.

• Option #3: Upgrade your VM so that cores compute at 440 Gflop/sec.

• Option #4: Pay a software developer to re-implement Task 5 so that it exposes some data parallelism. This is done by replacing the current Task 5 by a 1000 Gflop task followed $n$ independent tasks, each with work 9000/$n$ Gflop.

Each option above costs money, and you can afford only one of them. But the money spent is worth it if it makes the program run in under 60s.

Questions

[A.2.q5.2] Which of the options above are worth it?

Running eduWRENCH 1.0-dev

eduWRENCH is funded by the National Science Foundation (NSF) under grants number 1923539, and 1923621.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.