eduWRENCH - Pedagogic Modules Parallel and Distributed Computing Courseware

A.3.4 Workflows

The goal of this module is to introduce you to the workflow model of computation that is used in many real-world scientific applications.

Go through the tabs below in sequence…

Learning Objectives

  • Understand the concept of a workflow
  • Be able to reason about the performance of a workflow on a multi-core computer

What is a workflow?

A workflow (a.k.a. “scientific workflow”) application is comprised of individual computational tasks that must all be executed in some particular sequence to produce a final desired output (e.g., all the steps necessary to perform some complex genomic analysis can be organized as a bioinformatics workflow). In practice, the tasks are stand-alone executable programs that read in input files and produce output files. A file produced as output by one task can be required as input for another task. Consequently, a workflow is typically represented as a DAG of tasks where edges are file dependencies (see the Dependencies tab of the Multi Core Computing module in which we already discussed dependencies between tasks).

There are two typical “rules of execution” in practice:

  • A task cannot start before all its input files have been generated.
  • A task’s output file is available only once all of that task’s output files have been generated.

In other words, a task is considered completed only once it has written all its output files. Until then, its output files are “invisible” to other tasks. This is because, unless we know the details of a task’s implementation, we can never be sure when an output file is finalized before the task’s program actually exits.

For now, we assume that a task can only run using a single core.

The figure below depicts an example workflow application:

Dag
Figure 1: Example workflow application. Some examples of real-world workflows for scientific applications, along with their DAG representations, can be found [here](https://pegasus.isi.edu/application-showcase/).

Simulating Multi-core Workflow Execution

This module relies heavily on concepts introduced in previous modules. To make sure you master these concepts, we provide you with a simulation app and accompanying practice questions thereafter. If you find this content too difficult or are missing key knowledge, you may want to review the previous modules. In particular, many concepts come from the Single Core Computing module and the Multi Core Computing module.

The app below simulates the execution of the example workflow in Figure 1 on a computer with 50 Gflop/sec cores and 16 GB of RAM. Attached to this computer is a disk. The app allows you to pick the number of cores and the disk read/write bandwidth.

As these pedagogic modules increase in complexity and sophistication, the number of execution options also increases. The example workflow above is designed to have an execution that is relatively constrained in terms of the number of execution options. But we still need to specify some aspects of the execution strategy simulated by the app:

  • A core never runs more than one task at time (this is because, as in all previous modules, we disallow time-sharing of cores);
  • When there is not enough free RAM on the computer, tasks cannot be started;
  • When there are multiple ready tasks, they are started on cores in lexicographical order (i.e., “task2” would start before “task3”);
  • When two ready tasks are started they immediately read their input files. For instance, if task2 and task3 are ready and can both run simultaneously (enough cores, enough RAM), they do start at the same time and read their input files simultaneously. Importantly, these tasks then split the disk bandwidth equally.
(Open simulator here)

wrench_logo eduWRENCH Pedagogic Module Simulator

Sign in on the top of the page to access the simulator.

Practice Questions

Answer these practice questions, using the simulation app and/or using analysis (and then using the app for double-checking your results):

[A.3.4.p1.1] How many MB of data are read and written by the workflow when executed on this computer?

(click to see answer)

This can easily be done analytically. The table below shows for each file the total amount of read/write it causes in MB:

file size in MB times read times written total MB read/written
data 500 1 0 500
filtered 400 3 1 1600
finalA 200 1 1 400
finalB 200 1 1 400
finalC 200 1 1 400
aggBC 200 1 1 400

So the total amount of data read/written is $500 + 1600 + 4 \times 400 = 3700$ MB.

We can verify this in simulation. Running the app with 1 core and with disk bandwidth set to 100, the total execution time is 231 seconds. The time to perform the computation is the sum of the task execution times: $10 + 20 + 100 + 20 + 40 + 4 = 194$ seconds.

So the time to perform the I/O is $231 - 194 = 37$ seconds. Since the disk bandwidth is 100 MB/sec, this means the total data size is: 3700 MB!

[A.3.4.p1.2] What is the parallel efficiency when executing the workflow on 3 cores and when the disk bandwidth is 150 MB/sec?

(click to see answer)

The simulation shows that the 1-core execution takes time 218.67 seconds, while the 3-core execution takes time 197.33 seconds. So the speedup on 3 cores is 218.67 / 197.33 = 1.108. Meaning that the parallel efficiency is 1.108/3 = 36.9%. This is pretty low.

[A.3.4.p1.3] Explain why there is never any improvement when going from a 2-core execution to a 3-core execution for this workflow?

(click to see answer)

The lack of improvement is easy to see in the simulation. In fact, executions look identical with 2 and 3 cores.

The width of the DAG is 3, so in principle using 3 cores could be useful. The only level of the DAG with 3 tasks is the “blue” level. Unfortunately, the 3 tasks in that level cannot run concurrently due to RAM constraints. At most 2 of them can run concurrently (task3 and task4) since together they use less than 16 GB of RAM.

[A.2.3.p1.4] Consider the execution of this workflow on 2 cores with disk bandwidth set to 50 MB/sec. Is the disk ever used concurrently by tasks? How can you tell based on the simulation output?

(click to see answer)

Tasks task3 and task4 use the disk concurrently. This is easily seen in the “Workflow Task Data” section of the simulation output. For instance, task3 spends 16 seconds reading its input file. Given that this file is 400 MB, this means that task3 experiences a read bandwidth of 400/16 = 25 MB/sec. This is half of the disk bandwidth, meaning that the disk is used concurrently by another task (task4), which also gets half of the disk bandwidth.

[A.2.3.p1.5] Considering a 1-core execution of the workflow, for which disk bandwidth would the execution be perfectly balanced between computation time and I/O time?

(click to see answer)

Let $B$ be the unknown bandwidth. The compute time is, as we saw in question A.2.3.p1.1 above, 194 seconds. The I/O time, again based on what we saw in that previous question, is 3700 / $B$ seconds. So we simply need to solve:

$ 3700 / B = 194 $

which gives $B$ = 19.07 MB/sec. We can verify this in simulation by setting $B$ to 19. The simulation shows a total execution time of 388.7 seconds, which is almost exactly twice 194.

[A.2.3.p1.6] Considering computation and I/O, what is the length of the workflow’s critical path (in seconds) if the disk bandwidth is 100 MB/sec?

(click to see answer)

In the Task Dependencies tab of the Multi Core Computing module we defined the critical path without any I/O. Extending this notion to I/O is straightforward (one can simply consider file reads and writes as extra tasks to perform).

We have 3 possible paths in the workflow, and for each one we can compute its length (i.e., duration in seconds), as follows (note that all intermediate files are both written and read, and hence are counted “twice”):

  • task1->task2->task6: 5 + 10 + 4 + 4 + 20 + 2 + 2 + 4 = 51 seconds
  • task1->task3->task5->task6: 5 + 10 + 4 + 4 + 100 + 2 + 2 + 40 + 2 + 2 + 4 = 175 seconds
  • task1->task4->task5->task6: 5 + 10 + 4 + 4 + 20 + 2 + 2 + 40 + 2 + 2 + 4 = 95 seconds

The critical path (the middle path) has length 175 seconds. No execution can proceed faster than 175 seconds no matter how many cores are used.

[A.2.3.p1.7] Give your thoughts on why this workflow is poorly suited for parallel execution in general and on our 3-core computer in particular.

(click to see answer)

There are three clear problems here:

  • Problem #1: Only 1 level of the workflow has 3 tasks, and all other levels have 1 task. So this workflow is mostly sequential, and Amdahl’s law tells us this is bad news.

  • Problem #2: The only parallel level (the “blue” level) suffers from high load imbalance. One task runs in 100 seconds, while the other two run in 20 seconds. So, when running on 3 cores, assuming no I/O, the parallel efficiency is at most (140/100)/3 = 46.6%.

  • Problem #3: On our particular computer, the RAM constraints make things even worse as the workflow’s width becomes essentially 2 instead of 3. We can never run the 3 blue tasks in parallel.

To get a sense of how “bad” this workflow is, let’s assume infinite disk bandwidth and infinite RAM capacity (which removes Problem #3 above). In this case, on 3 cores, the workflow would run in time: 10 + 100 + 40 + 4 = 154 seconds. The sequential execution time would be 194 seconds. So the speedup would only be 1.26, for a parallel efficiency of only 42%. Amdahl’s law is never good news.


Questions

Given the workflow below, answer the following questions:

Dag

[A.2.3.q1.1] How many MB of data are read during an execution of this workflow? How many are written?

[A.2.3.q1.2] Say we run this workflow on a 1-core computer where the core speed is 100 Gflop/sec and the disk has read/write bandwidth at 100 MB/sec. What is the workflow execution time?

[A.2.3.q1.3] Say now this computer has 2 cores, and the workflow execution strategy is, whenever there is a choice, to start the task with the highest work. What is the execution time? What is the parallel efficiency?

[A.2.3.q1.4] Would the result be different if we instead picked the tasks with the lowest work first?

[A.2.3.q1.5] Say we now have 4 cores. Explain why there is no way to get the parallel efficiency above 60% even if the disk can be upgraded at will.


Learning Objectives

  • Be able to reason about workflow execution performance on distributed multi-host/multi-core platforms

Executing Workflows on Distributed Platforms

Workflows are often comprised of many tasks that are computationally intensive and/or require large amounts of storage. As a result, one often does not have the necessary resources on one’s local computer to execute them in any reasonable amount of time. Instead, one needs to deploy workflow executions on compute/storage resources that are connected via some network, a.k.a., distributed computing platforms. You likely have heard of some of these platforms, such as cloud platforms or high performance computing (HPC) platforms.

The goal is to execute a workflow application on these platforms as quickly as possible, given the underlying network infrastructure (latencies, bandwidths, network topologies) that interconnects storage (disks) and compute (multi-core hosts with some RAM) resources. This is only possible if an appropriate software infrastructure is provided to use remote resources. In this module, we just assume that this is the case, and leave the discussion of the details of the software infrastructure for future modules.

Example Platform

We consider the following distributed platform with three sites on a wide-are network.

Distributed platform
Figure 1: Example distributed computing platform.

The site in the bottom-left corner is where the user who wishes to execute the workflow resides. That user has only some personal computing device, like a laptop computer. No workflow data is stored and no workflow computation is performed on this computer; it is only used to orchestrate the workflow execution remotely. All workflow data is stored on a remote storage site (top center), and all workflow computation is performed on a remote compute site (top right). So workflow data has to flow back and forth between the storage site and the compute site. This is because, for now, the compute site has no persistent storage.

The storage site simply hosts a disk with 500 MB/sec read/write bandwidth, and uses a 1 MB buffer when being accessed remotely (see the Pipelining tab of the Client-Server module). It is connected to the compute site via a wide-area network link (in fact it is likely a multi-link network path, but let’s keep this simple and assume a single link). This link has 100 MB/sec bandwidth and 10 millisecond latency.

Let’s now look deeper into the setup of the compute site. This site hosts several computers, each of them with some RAM capacity and multiple cores, and each of them connected to a switch via a high-speed network link. This setup is depicted in the figure below:

Distributed platform zoom
Figure 2: Compute resources at the compute site

Each compute host has 32 GB of RAM, cores that compute at 100 Gflop/sec, and up to 8 of these cores. All compute hosts are connected to the site’s switch via a 10 GB/sec network link with 10 micro-second latency. This switch is connected to the storage site via the wide-area link. Therefore, the network path from the storage resource to each compute host has two links: the 100 MB/sec wide-area link, and the 10 GB/sec local-area link.

Say that a task needs to perform 1000 Gflop, requires 10 GB of RAM, reads in a 200 MB input file, and writes back a 10 MB input file. We can compute a rough estimate of this task’s execution on one of the compute hosts, assuming that no other task is competing with it, as:

\[\begin{align} \text{Task execution time} & = \text{input read}\; + \;\text{compute}\; + \;\text{output write}\\ & = 200 / 100 + 1000 / 100 + 10 / 100\\ & = 12.1\; \text{sec} \end{align}\]

The above expression assumes that data is read/written from/to the disk at 100 MB/sec, the smallest of the disk bandwidth (500 MB/sec) and of the bottleneck link bandwidth (100 MB/sec). It is only a rough estimate because it does not account for pipelining and latency, and because, as we have seen several times already in these modules, the network’s data transfer rate is often not simply data size divided by bandwidth. This is especially true when network latencies are high, which is the case here with a 10ms latency on the wide-area link that connects the storage resource to the compute resources. We will see below how (in)accurate these estimates can be. But as a general note, as we progress through these pedagogic modules, platforms become increasingly complex. As a result, we will rely more and more on simulation results and less and less on back-of-the-envelope estimates.

Example Workflow

We consider a simple “in-tree” workflow, depicted in the figure below.

Distributed platform
Figure 2: Example workflow.

This workflow has only two levels, with the first level consisting of 20 tasks and the second level having only one task. The width of the workflow is thus 20, and the critical path is relatively short. So, unlike the example workflow in the previous tab, this workflow should benefit significantly from parallel execution.

Executing the workflow on the platform

We wish to execute our workflow on our distributed platform. The workflow execution strategy is straightforward because our workflow has a simple structure: whenever there are sufficient compute resources at a compute host (i.e., at least one idle core and 8 GB of RAM), start the next to-be-executed pre_* task on it. When all pre_* tasks have been executed, then the final task can be executed.

Whenever several pre_* tasks start simultaneously, then they also read their input files simultaneously, thus splitting disk and network bandwidth. And, as in the previous tab, a task does not free up its compute resources until its output files have all been fully written to disk.

Simulating Distributed Workflow Execution

The simulation app below simulates the execution of our workflow on our platform, and allows you to pick the number of hosts and of cores per host at the compute site. You can experiment yourself with this application, but you should then use it for the practice questions hereafter.

(Open simulator here)

wrench_logo eduWRENCH Pedagogic Module Simulator

Sign in on the top of the page to access the simulator.

Practice Questions

[A.3.4.p2.1] When executing the workflow with a single 1-core compute host, what fraction of the time is spent doing actual computation?

(click to see answer)

Running the simulation gives us a total execution time of 299.69 seconds. In total, the computation consists of 21,000 Gflop to be performed on a 100 Gflop/sec core. So that is 210 seconds of computation. Therefore, the execution spends (299.69 - 210)/299.69 = 70% of its time doing computation. The rest of the execution is disk and network I/O.

[A.3.4.p2.2] Based on the answer to the previous question, how long would you expect the execution time to be if the (single) compute host had 2 cores? Double-check your expectation in simulation.

(click to see answer)

In the previous question, we found out that the computation in total takes 210 seconds. On 2 cores, this should be 110 seconds (since the final task runs by itself). Therefore we would expect the execution time to be 100 second shorter than in the previous question, that is, 199.69 seconds.

The simulation gives 189.77 seconds. This is faster than expected, which can be due to several reasons. When running tasks in parallel, there can be beneficial effects in terms of network bandwidth. In this case, this is happening on the wide-area link due to its high latency. This is now a recurring theme in these pedagogic modules: the network is complicated and its performance difficult to estimate precisely.

[A.3.4.p2.3] For running our workflow, is it better to have 5 4-core compute hosts or 4 5-core hosts? Check your answer in simulation.

(click to see answer)

It is better to use 5 4-core hosts because the RAM at each host if 32 GB. Therefore, no matter how many cores a host has it cannot run more than 4 of our pre_* tasks in parallel.

This is seen in simulation:

  • With 4 5-core hosts: 102.67 seconds
  • With 5 4-core hosts: 91.76 seconds

[A.3.4.p2.4] What is the parallel efficiency (in terms of cores) of the execution when using 5 4-core compute hosts?

(click to see answer)

The speedup is 299.69 / 91.76 = 3.26. Since we used 20 cores, our parallel efficiency is 3.26/20 = 16.33%. This is pretty low, but expected since we have so much I/O and a level of the workflow has no parallelism whatsoever.

[A.3.4.p2.5] What overall I/O bandwidth is achieved by the workflow execution when using a single core? What about when using 5 4-core hosts?

(click to see answer)

In total, the execution reads and writes 20*(50 + 100 + 100) + 1 = 5001 MB of data. Using the same reasoning as in question A.3.4.p2.1 above, we can compute the I/O time for each execution, and deduce the bandwidth. This is summarized in the table below:

execution total time compute time I/O time I/O bandwidth
1x1 core 299.69 s 210 s 89.69 s 55.75 MB/s
5x4 cores 91.76 s 20 s 71.76 s 69.69 MB/s

As earlier, we find that doing parallel I/O (over the network) brings some benefit. However, due to latency effects, we are pretty far from achieving the peak 100 MB/s bandwidth. It would be pretty difficult to estimate the I/O time of this workflow execution without the simulation.


Questions

Consider the following workflow (all green tasks have identical specs, and so do all blue tasks):

Distributed platform


[A.3.4.q2.1] You can lease three different platforms to execute this workflow:

  • Platform A: Two 4-core hosts, each with 8 GB of RAM, and 120 Gflop/sec core compute speed
  • Platform B: Three 6-core hosts, each with 12 GB of RAM, and 50 Gflop/sec core compute speed
  • Platform C: One 3-core hosts, with 16 GB of RAM, and 120 Gflop/sec core compute speed

Assuming the I/O and network times are zero, which of the three platforms above is the better choice?

[A.3.4.q2.2] Because tasks in all levels are identical, at any given time either all running tasks compute, or all running tasks perform I/O. Assuming the total I/O time, regardless of the number of hosts/cores, is 20 seconds, what is the parallel efficiency for the three platforms in the previous question?


Learning Objectives

  • Understand the concept of data locality in distributed platforms
  • Be able to quantify the impact of data locality on workflow execution

The need for data locality

In the previous tab, all workflow tasks were reading/writing data at a remote (from their perspective) storage site. The wide-area link has lower bandwidth than the disk’s, and the data transfer rate it achieves is negatively impacted by high latency. As a result, the workflow execution spends a large fraction of time performing remote I/O, which hurts performance and parallel efficiency. This is especially damaging for the “intermediate” data files, that is those that are output of one task an input to another. These files are written to the remote storage and then immediately read back from it. Keeping these files “close” to the compute hosts would be much more efficient.

Trying to keep/move data close to where the computation takes place is often called improving data locality. You may have encountered this term in computer architecture or operating systems courses/textbooks in the context of caches. Here, we use it in the context of network proximity.

Better data locality for our workflow

Going back to the setup in the previous tab, we want to be able to store data on the compute site. So let’s enhance that site with a bit more hardware!

Storage at the compute site
Figure 1: Added storage capability at the compute site.

Figure 1 above shows the compute site for the platform in the previous tab, but with a new host (shown in green). This host is not used for computation but provides access to a disk with 500 MB/sec read/write bandwidth.

Given the new storage capability, we can now refine the workflow execution strategy: unless a task is an exit task of the workflow, it stores its output on the disk at the compute site. In this way, whenever possible, tasks will read/write data to the local storage rather than the remote storage. The initial input files to the workflow are still stored at the remote storage site, and the output files must end up there as well.

Simulating Better Data Locality

The simulation app below simulates workflow execution when the compute site has storage capabilities. The app is similar to that in the previous tab, but allows you to pick the value of bandwidth of the wide-area network link between the storage site and the compute site. It also allows you to toggle the use of storage at the compute site (if not checked, the simulated execution behaves as in the previous tab, with poor data locality). You can use the app on your own, but then you should use it to answer the practice questions hereafter.

(Open simulator here)

wrench_logo eduWRENCH Pedagogic Module Simulator

Sign in on the top of the page to access the simulator.

Practice Questions

[A.3.4.p3.1] When executing the workflow with a 100 MB/sec wide-area link bandwidth and using a single core, how much time is saved when storing intermediate files at the compute site? If you do a back-of-the-envelope estimation of the time saved based on data sizes and bandwidths, do you get the same answer?

(click to see answer)

This can be answered by just running the simulation:

  • With only remote storage: 299.69 seconds
  • With local storage: 239.91 seconds

Thus, the saving is 59.78 seconds.

The only difference in the two executions is the I/O times for the intermediate files. In both cases, $2 \times 20 \times 100 = 4000$ MB of data are being read/written from/to storage. To the remote storage, this should take time 4000/100 = 40 seconds. To the local storage, this should take time 4000/500 = 8 seconds. So we would expect a saving of $40 - 8 = 32$ seconds. In fact, the saving is twice as much.

This is because the wide-area data transfer rate is not 100 MB/sec, due to the high network latency.

We saw this in the previous tab but can re-iterate. The application, when not using any local storage, reads/write a total of $20 \times 50 + 2 \times 20 \times 100 + 1 = 5001$ MB of data. Since the application computes for 210 seconds, this means that it spends 299.69 - 210 = 89.69 seconds transferring the data. Thus, the actual data transfer rate is 5001/89.69 = 55.75 MB/sec, a far cry from the peak 100 MB/sec!

So if we re-compute our saving estimate above using this effective data transfer rate we obtain: 4000/55.75 - 4000/500 = 63.64 seconds. This is much closer to what the simulation gives us. The remaining discrepancy is due to other effects/overheads captured by the simulation (which we will mention in upcoming modules).

[A.3.4.p3.2] Still using a 100 MB/sec wide-area link bandwidth, what parallel efficiency can we achieve when using 5 4-core hosts and local storage?

(click to see answer)

As we saw in the previous question, the sequential (1-core) execution time is 239.91 seconds when using local storage. Using the simulation to determine the parallel execution time we get: 41.86 seconds.

So the parallel efficiency is (239.91 / 41.86) / 20 = 28.6%. This is better than without using local storage, but still not great.

[A.3.4.p3.3] What is the parallel efficiency when doubling the wide-area link bandwidth?

(click to see answer)

Using the simulation again, we get: (239.91 / 36.61) / 20 = 32.7%.

[A.3.4.p3.4] Now, set the wide-area link bandwidth to a high 500 MB/sec. Do we see big jump in efficiency? What is the effective wide-area data transfer rate? Is it anywhere close to 500 MB/sec?

(click to see answer)

Using the simulation again, we get: (239.91 / 33.45) / 20 = 35.8%. This is not a big jump at all.

From the simulation output, we see that it takes 4.49 seconds for all tasks to read their input from remote storage. That is for a total of $20\times 50 = 1000$ MB. So the data transfer rate is 1000/4.49 = 222.71 MB/sec. This is not even half of 500 MB/sec. The large latency is preventing us from achieving the peak data transfer rate.

[A.3.4.p3.5] Assuming the wide-area latency was not a problem, and that we would achieve 500 MB/sec data transfer rate, what would the parallel efficiency be? How close is it from the efficiency when assuming that all I/O takes zero time?

(click to see answer)

Instead, of 4.49 seconds, the tasks would take “only” 1000/500 = 2 seconds to read their input. So we would shave 2.49 seconds off the execution time. (In fact we would also save a tiny bit for the transfer of the workflow’s 1 MB output file.) So the efficiency would be: (239.91 / (33.45 - 2.49)) / 20 = 38.7%.

If I/O took zero time, the sequential (1-core) execution time would be (20000 + 1000)/100 = 210, and the parallel execution time would be: 20 seconds. So the efficiency would be (210/20) / 20 = 52%.

So with 35.8% we are still pretty far from the ideal parallel efficiency.


Questions

Consider the following workflow:

Distributed platform


[A.3.4.q3.1] Say we execute this workflow at a compute site that hosts a 2-core hosts, with cores computing at 100 Gflop/sec. All data is read/written from/to a remote storage site. How many MB are read/written in total?

[A.3.4.q3.2] Say that the read/write data rate for the remote storage site is 200 MB/sec (which, as we know from the simulation above, could be well below the actual bandwidth). What is the workflow execution time? Hint: be careful about how the two blue tasks split the bandwidth.

[A.3.4.q3.3] We now have local storage at the compute site, with data access rate 500 MB/sec. What is the workflow execution time now? What is the parallel efficiency?

Learning Objectives

  • Understand how task- and data-parallelism can be mixed
  • Be able to reason about the performance of programs that include both task- and data-parallelism

Basic concept

So far in this module we have only considered sequential tasks. In other words, each task can only use a single core. But in the Data-Parallelism tab of the Multicore Computing module, we learned about Data Parallelism: the notion that a sequential task can be rewritten as a set of parallel tasks, with perhaps a remaining sequential portion of the execution. Then, in that same tab, we learned about Amdahl’s Law, which quantifies the data-parallel task’s execution time on a given number of cores, given what fraction of the task’s work has to remain sequential. You may want to review this content before proceeding.

Let’s consider workflows in which some tasks are data-parallel. For these tasks we need to decide how many cores they should use. So our workflow has both task-parallelism (like all workflows) and data-parallelism. This is often called “mixed” parallelism.

An example

Example workflow with task- and data-parallelism.
Figure 1: A simple workflow with some data-parallel tasks ($\alpha$ is the fraction of the work that is non-parallelizable)

Figure 1 above shows an example workflow with both task- and data-parallelism. For simplicity, we completely ignore files and I/O. The green and red tasks are not data-parallel, and can run only on a single core. The blue, yellow, and purple tasks are data-parallel. For each one of these tasks, in addition to its amount of work, we also indicate the value of $\alpha$, the fraction of the work that can be parallelized. Based on Amdahl’s law, a data-parallel task with work $w$ Gflop runs on a $p$-core computer, where core speed is $s$ Gflop/sec, in time:

\[T(p) = \frac{\alpha \times \frac{w}{s}}{p} + (1 - \alpha) \times \frac{w}{s}\]

The above equation just means that the parallelizable portion of the sequential execution time (the left term) is accelerated by a factor $p$ when executed in parallel on $p$ cores, while the sequential portion (the right term) remains sequential.

Say we are running this workflow on a 4-core computer where cores compute at speed 100 Gflop/sec. We could run each of the data-parallel tasks using 4 cores. In this case, here is the execution time of each task:

  • Green: $1.00$ sec
  • Blue: $10 \times 0.9 / 4 + 10 \times 0.1 = 3.25$ sec
  • Yellow: $20 \times 0.85 / 4 + 20 \times 0.15 = 7.25$ sec
  • Purple: $12 \times 0.80 / 4 + 12 \times 0.20 = 4.80$ sec
  • Red: $1.00$ sec

No two tasks can run at the same time. So the total execution time is the sum of the task execution times, i.e., 17.30 seconds.

There are many other options! For instance, we could run the blue task using 2 cores, the yellow task using 2 cores, and the purple task using 4 cores, for the following task execution times:

  • Green: $1.00$ sec
  • Blue: $10 \times 0.9 / 2 + 10 \times 0.1 =$ 5.50 sec
  • Yellow: $20 \times 0.85 / 2 + 20 \times 0.15 =$ 11.5 sec
  • Purple: $12 \times 0.80 / 4 + 12 \times 0.20 =$ 4.80 sec
  • Red: $1.00$ sec

But now the blue and yellow tasks can run simultaneously! So the execution time is: $1 + 11.5 + 4.80 + 1 = $ 18.30 seconds. This option is not as good as the previous one.

How many options are there? Well, for each of the 3 tasks we have 4 options, so that is $4^3 = 64$ options!!! One (or more) of these options has to be the best one, and one (or more) has to be the worst one. For instance, running all tasks on a single core would waste 1 core of our 4-core computer, and is clearly not as good as running some of the tasks on 2 cores.

Simulating Task- and Data-Parallelism

The simulation app below makes it possible to simulate the execution of the above example workflow on a platform that comprises two 3-core hosts. Again, remember that in this tab we ignore all I/O. The app allows you to pick how many cores are used for the blue, yellow, and purple tasks. The execution strategy, when picking tasks to assign to idle cores, always picks tasks in the order yellow, blue, purple. But turns out this does not matter in terms of application performance (because we have only 3 tasks to run on the 2 compute hosts). You can use this app on your own, but then you should use it to answer the following practice questions.

(Open simulator here)

wrench_logo eduWRENCH Pedagogic Module Simulator

Sign in on the top of the page to access the simulator.

Practice Questions

[A.3.4.p4.1] Estimate the execution time when all data-parallel tasks use 3 cores. Double-check your result in simulation.

(click to see answer)

With 3 cores, here are the data-parallel task execution times:

  • Blue task: $0.90 \times 10 / 3 + 0.10 \times 10 =$ 4.00 seconds
  • Yellow task: $0.85 \times 20 / 3 + 0.15 \times 20 =$ 8.66 seconds
  • Purple task: $0.80 \times 12 / 3 + 0.20 \times 12 =$ 5.60 seconds

The blue and purple tasks run on the same host, for a total time of 9.60 seconds, while the yellow task runs on the other host.

The total execution time is thus 11.60 seconds, which is confirmed by the simulation.

[A.3.4.p4.2] Say that you must configure two of the data-parallel tasks to use 1 core, and the third one to use 3 cores. Which task should use 3 cores to achieve the shortest execution time? Come up with an answer based on your intuition, and then check your intuition in simulation.

(click to see answer)

The yellow task has 2000 Gflop work, so, even though its $\alpha$ is not as high as that of the blue task, we should give it the 3 cores!

The simulation gives us the following total execution times:

  • when giving 3 cores to the blue task: 22 seconds
  • when giving 3 cores to the yellow task: 14 seconds
  • when giving 3 cores to the purple task: 22 seconds

Our intuition is confirmed. The fact that the other two options have the same execution time is simply because the yellow task is the task that determines the execution time.

[A.3.4.p4.3] Say we configure each data-parallel to run on 2 cores. Which of these tasks will run on the same host? Come up with an answer using intuition, or analytically, and then double-check it in simulation.

(click to see answer)

When using 2 cores, the yellow task will still be the longest task, so it will be placed by itself on a host. The blue and purple task will run on the same host. This is confirmed in the simulation output.

[A.3.4.p4.4] Because the yellow task is so expensive, we decide to always run it on 3 cores. Is it better to give 1 core to the blue task and 2 cores to the purple task, or the other way around?

(click to see answer)

All data-parallel tasks run simultaneously.

First, does this matter? That is, if the yellow task runs for, say 13 seconds, it does not really matter what we do with the blue and purple tasks. Turns out that the yellow task runs in time $20 \times 0.85 / 3 + 20 \times 0.15 =$ 8.66 seconds. So the yellow task will not determine the execution time, and yes, the choice in the question matters.

If we give 1 core to the blue task, then it runs in 10 seconds, and determines the execution time. If instead we give 1 core to the purple task, it will run in 12 seconds, and determines the execution time. So we should give 2 cores to the purple task and 1 core to the blue task.


Questions

Considering the workflow below, answer the following questions.

Workflow for question.

[A.3.4.q4.1] If we are given two hosts with 100 Gflop/sec hosts, where host1 has 20 cores and host2 has 40 cores. Should we run the blue task on host1 or on host2 (if our objective is to run the workflow as quickly as possible)?

[A.3.4.q4.2] If, instead, we run the workflow on a single 4-core computer, what is the best approach?

[A.3.4.q4.3] Say now we are running our workflow on a single 40-core host. What is the best way to allocate cores to the blue and purple tasks? If you are really into it, you can do this completely analytically (it requires finding roots of degree-2 polynomials). More easily, you can simply write the execution time as a function of the number of cores allocated to the blue task, and plot this function to find where it attains its minimum visually. There are many websites on which you can do this (search for “graphing calculator”).


Learning Objectives

  • Be able to put together the concepts learned in the previous tabs

Scenario

Consider the scenario (i.e., a workflow to be executed on a distributed platform) depicted in this figure:

Capstone scenario

The 4-task workflow needs to be executed on a 2-host platform, with all workflow data hosted at a remote storage site. The first task of the workflow is a data-parallel task; 10% of its sequential execution time cannot be parallelized (i.e., $\alpha = 0.9$).

Note that in the platform above, we give you the actual data transfer rate achieved by the wide-area link (20 MB/sec). As we saw in previous tabs, due to high latencies, the achieved data transfer rate can be well below the link bandwidth. We give you the data transfer rate so that it is straightforward to estimate data transfer times accurately.

Possible platform upgrades

The compute resources in the platform are really virtual machines that you have leased from a cloud provider. With the current configuration the workflow executes in 74 seconds, but you want it to run it as fast as possible since you want to execute this workflow as many times as possible per day (with different input data).

After looking at the cloud provider’s website, you figure out you can afford one of the following upgrades:

  • Upgrade #1: Double the wide-area data transfer rate;
  • Upgrade #2: Add 2 cores to each host; or
  • Upgrade #3: Add 8 GB of RAM to each host.

Questions

[A.3.4.q5.1] Which upgrade should you pick?


 
 

Running eduWRENCH 1.0-dev

eduWRENCH is funded by the National Science Foundation (NSF) under grants number 1923539, and 1923621.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.