Performance: Best Practices

In this section, you will learn about the following:

Key concepts on how to make apps perform better
What to think about when optimizing for area or runtime
Common mistakes

Overview

Spatial is designed to be a DSL that is much higher level than Verilog/VHDL but much lower level than frameworks like TensorFlow and Halide. Because of this, some optimizations such as loop fusion and loop reordering are not part of the Spatial compiler. They are intended to be part of higher level compilers that target Spatial, or the responsibility of the user to do manually. Spatial provides a lot of support in tuning hyper-parameters (see DSE), but leaves structural decisions about the algorithm up to the user.

In this section, we enumerate some key concepts to think about when designing a Spatial app. These topics are roughly sorted from easiest to understand and implement to hardest or most subtle. The most important points are very dependent on the algorithm you are designing

Avoid “Short” DRAM transfers

// Good
sram(0::2048) load dram(0::2048)

// Bad (consolidate into single DRAM request)
Foreach(2048 by 64){tile => 
  sram(tile::tile + 64) load dram(tile::tile + 64)
}

// Bad (consider transposing the data structure)
Foreach(4 by 1){col => 
  sram(0::16, col) load dram(0::16, col)
}

// Bad (if possible, flatten DRAM or transpose)
val dram = DRAM[I16](64,4)

// Bad (Use mux for range, rather than predicating the transfer)
if (cond) dram(0::32) store sram
else      dram(32::64) store sram

The best performance you can get from DRAM is to make requests as long as possible along the leading dimension (last dimension in Spatial syntax). You should amortize the hardware and latency cost of DRAM transactions by avoiding cases where the leading dimension is short.

The DRAM controller will be able to service a long request of contiguous bursts much better than it will be able to service a bunch of short requests, even if the short requests happen to be contiguous. If you are using reduced-rank transfers (i.e. loading a 1D slice into a 2D memory), the worst thing you can do for performance is have the leading dimension be the dimension that is stricken from the view.

Also, each load and store instantiates a physical channel in the parameterized Fringe wrapper for the Spatial app, so try to use them as sparingly as possible.

The snippets on the left demonstrate some of these do’s and dont’s. They may look like silly examples, but be aware that these kinds of mistakes creep up in Spatial code of even the most experienced developers as they spend lots of attention hand-optimizing other portions of their application.

The section about Streaming controllers below discusses more about this technique in detail.

Minimize Hierarchy depth

// Inefficient Structure 
Foreach(N by 1){i => 
    reg := foo(i)
    Foreach(M by 1){j => 
        Foreach(K by 1){k => sram(i,j,k) = bar(i,j,k,reg)}
    }
}





// Better Structure
Foreach(N by 1, M by 1, K by 1){(i,j,k) => 
    if (j == 0 && k == 0) reg := foo(i)
    sram(i,j,k) = bar(i,j,k,reg)
}

In the latest version of Spatial, each controller has a small amount of overhead (~2 cycles) associated with it, which is due to the handshaking between a child’s enable/done signals and the acknowledgement from its parent controller. This means the as your controller hierarchy gets deeper, you accumulate inefficiency at each layer.

Consider the snippet to the left, where we have a triple-nested controller structure. In the first example, we set the value of a register based on the result of some function foo on i. This register write is a sibling of the controller that loops over M, and because Spatial assumes all nodes inside a controller are either primitives (register write) or controllers (Foreach), it will wrap the register write in its own unit pipe. The outermost controller that iterates over N becomes a two-stage controller and the schedule is pipelined. Since the Foreach subtree most likely runs for many more cycles per iteration of i compared to the register write, this unit pipe only adds extra cycles when the loop is not in steady-state.

In the innermost loop, we update the sram based on some function foo. If we assume there are 2 cycles of overhead per level, this loop contributes an extra M*N cycles of overhead to the total runtime. The Foreach that runs over M will contribute another N cycles of overhead (we are handwaving over the contribution of the injected unit pipe for the register write).

We now show a “better” way to express this structure. What we show is an optimization called “Loop Perfection” (Pu et al). For now, we require the user to do this explicitly, but this will be done automatically by the compiler in future releases of Spatial.

The first thing we should attempt to do is combine as many levels of “perfectly” nested loops into one controller, which generates a single counter chain in hardware bound to a single controller, rather than different individual counter chains each tied to their own controller. Second, we must predicate the register write with the condition in which we actually want it to update. In this snippet, we have only a single inner controller that contains primitives only, so as long as the functions do not contain any accumulations (reads and writes to the same memory in a single iteration), this controller will have an initiation interval of 1 and we will be able to issue new values for i, j, and k at each cycle

Note that a common mistake is to ignore loop carry dependencies when optimizing code like this. For example, if the register write happens after the Foreach controller on M, it is possible that the coarse-grain pipelining in the original structure will result in different behavior than the fully pipelined inner controller of the second structure.

pipeline fission and parallelization

// Imbalanced Version
val s = SRAM[Int](3,32)
Foreach(N by 1){R => 
  'PREFETCH.Foreach(3 by 1){i => // Slow load
    if (foo(i,R)) s(i::i+1, 0::32) load d(i::i+1, 0::32 par 16)
  } 
  'COMPUTE.Pipe{ // Fast compute
    m(0) = s(0,bar(0,R))}
    m(1) = s(1,bar(1,R))}
    m(2) = s(2,bar(2,R))}
  }                                 
}

// "Thread" Parallelism
val s = SRAM[Int](3,32)
Foreach(N by 1){R => 
  'PREFETCH.Foreach(3 by 1 par 3){i => 
    if (foo(i,R)) s(i::i+1, 0::32) load d(i::i+1, 0::32 par 16)
  } 
  'COMPUTE.Pipe{
    m(0) = s(0,bar(0,R))}
    m(1) = s(1,bar(1,R))}
    m(2) = s(2,bar(2,R))}
  }                                 
}

// Pipeline Parallelism
val s = List.fill(3)(SRAM[Int](32)) // NOTE: Quirk in Spatial 1.0 may require you to make this a 2D SRAM with first dim set to 1
Foreach(N by 1){R => 
  if (foo(0,R)) s(0)(0::32) load d(0, 0::32 par 16)
  if (foo(1,R)) s(1)(0::32) load d(1, 0::32 par 16)
  if (foo(2,R)) s(2)(0::32) load d(2, 0::32 par 16)
  'COMUTE.Pipe{
    m(0) = s(0)(bar(0,R))}
    m(1) = s(1)(bar(1,R))}
    m(2) = s(2)(bar(2,R))}
  }                                 
}

Very generally, there are two knobs available when tuning outer controllers: “thread” parallelism and pipeline parallelism. You may want to review the instrumentation guide to understand this section more deeply. In order to achieve “thread” level parallelism, Spatial provides the syntax in controller declarations, i.e. min until max by step par P. It can sometimes be tricky to understand exactly how to tune these parameters by hand, understand the unintended consequences in algorithm correctness and performance, and visualize how this impacts banking analysis. The other knob is not always trivially obvious, as it actually requires the user to carefully rewrite portions of code.

The snippet to the left shows an outer controller that is clearly imbalanced, as its “prefetch” stage clearly runs for much longer than its “compute” stage. The ’PREFETCH.and ’COMPUTE. is syntax for tagging the controller with a name for debugging purposes and has no impact on the generated hardware. While this example may appear silly, similar cases appear very often in applications like convolution.

The two ways to help balance this pipeline are to use a par factor in the prefetch stage, or split the larger controller into a sequence of shorter controllers.

When we use par to parallelize this, we turn the Foreach controller that runs for 3 iterations to a Parallel controller that runs for a single iteration, and all 3 of the loads happen simultaneously (with some congestion on the DRAM controller). Regardless, this will run roughly 3 times faster than the unparallelized version, and assuming the outer controller runs in steady state for a lot of iterations, we could expect roughly a 3x improvement in its overall runtime. However, using parallelization causes the banking analyzer to work harder and could result in significantly more complicated banking hardware (maybe not in this particular example, but this may be true in more complicated cases).

The other option is to manually fracture the controller and make each iteration its own stage within the outer controller. Without introducing too much metaprogramming here, we unroll the loop by hand manually. Again, assuming the outer controller runs in steady state for a long time, the bottleneck becomes one of these unrolled stages rather than the sum of all three iterations running in a single controller. In this particular case, we also pay more in terms of total memory since the different SRAMs need to be buffered to different depths, but each one has a simpler banking scheme (here we assume that our implicit hierarchical banking by row is “simpler,” which again depends very much on the specific code you are working with). Also note that we avoid telescoping our controllers and generating a deeper hierarchy (as advised against in the previous tip). Depending on how deep in the hierarchy this controller sub-tree occurs, this may be an important thing to consider.

The example used in this section is also a good example of a pipeline that is hopelessly imbalanced. Regardless of parallelization, there is no way to truly balance the many-cycle loads from DRAM with the ~3 cycle “compute” stage. We discuss tips for this situation next.

Buffer reclamation (hopelessly imbalanced pipelines)

Here we will discuss two solutions to “hopelessly imbalanced pipelines,” which are sequences of adjacent controller stages that are heavily imbalanced and no amount of parallelization will cause them to become balanced.

First, it is important to understand what buffering is and why the nesting of controllers in Spatial creates multi-buffer memories. The animations on the right show what happens in a 3-stage pipelined controller where the first stage writes to memory and the last stage reads from that memory. If the memory is not buffered, the second iteration of the first controller will delete data that was not yet touched by the last stage. Even if the intermediate stages that do not access the memory are small or short stages, each one still requires its own buffer for the entire memory structure in order to protect data as it is passed along the pipeline.

Also note that the data in each buffer begins uninitialized, but after the controller finishes, data from “stale” iterations will remain in the SRAM unless explicitly cleared.

// "PREFETCH" always takes much longer than "COMPUTE", 
//   and there are no other stages in the parent
val s = SRAM[Int](3,32)
Foreach(N by 1){R => 
  'PREFETCH.Foreach(3 by 1){i => // Slow load
    load d(i::i+1, 0::32 par 16)
  } 
  'COMPUTE.Pipe{ // Fast compute
    m(0) = s(0,bar(0,R))}
    m(1) = s(1,bar(1,R))}
    m(2) = s(2,bar(2,R))}
  }                                 
}

// Mostly balanced except for a few neighborhoods
val sram = SRAM[Int](256,256)
Foreach(N by 1){R => 
  'LONGSTAGE1.Foreach(...){i => sram(R,i) = ...} // 200 cycs/iter
  if (cond(R)) 'LONGSTAGE2.Foreach(...){i => ...} // Unit pipe inserted (3 cycs/iter) + 200 cycs/iter
  reg := foo(R) // Unit Pipe inserted, 3 cycs/iter
  'LONGSTAGE3.Foreach(reg.value by 1){i => ...} // 200 cycs/iter
  reg := bar(R) // Unit Pipe inserted, 3 cycs/iter
  'LONGSTAGE4.Foreach(...){i => ... = sram(R,i)} // 200 cycs/iter
}

// Mostly balanced with reclaimed buffer space
val sram = SRAM[Int](256,256)
Foreach(N by 1){R => 
  'LONGSTAGE1.Foreach(...){i => sram(R,i) = ...} // 200 cycs/iter
  'CONSOLIDATED1.Pipe{ // 3 + 200 cycs/iter
    if (cond(R)) 'LONGSTAGE2.Foreach(...){i => ...} // Unit pipe inserted (3 cycs/iter) + 200 cycs/iter
  }
  'CONSOLIDATED2.Pipe{ // 3 + 200 cycs/iter
    reg := foo(R) // Unit Pipe inserted, 3 cycs/iter
    'LONGSTAGE3.Foreach(reg.value by 1){i => ...} // 200 cycs/iter
  }
  'CONSOLIDATED3.Pipe{ // 3 + 200 cycs/iter
    reg := bar(R) // Unit Pipe inserted, 3 cycs/iter
    'LONGSTAGE4.Foreach(...){i => ... = sram(R,i)} // 200 cycs/iter
  }
}

The best way to resolve these is to rewrite the entire controller (potentially using meta-programming) and convert the parent controller into a fully-pipelined inner controller. In many cases, this will not be easy or possible. There is a “trick” you can do in these cases to minimize the inefficiency of these situations.

This snippet to the left is an example of an outer controller whose children stages are so imbalanced that there is no way, whether through parallelization or pipelining, to balance the stages. The compute stage will only run for a few cycles. The important thing to note is that these two stages are the only children of the controller.

In such a case, you should consider tagging the parent controller as Sequential.Foreach. You should consider if the algorithm can tolerate these extra cycles and where this sub-tree sits in the hierarchy (i.e.- the extra cycles from a sequential controller rather than a pipelined controller are multiplicatively accumulated up to the root), This change will save on area, since the memory s will not need to be double-buffered. Depending on the volume of the buffered memories, how wide the outer controller pipeline is (i.e.- buffer depth), and how much these extra cycles hurt performance (especially if this sub-tree is not a bottleneck), this “de-optimization” could be useful for reclaiming valuable resources.

The next snippet shows a case where you may have a wide parent controller with many children, and there are only certain neighborhoods where the pipeline is heavily imbalanced. Assume that we have used instrumentation to get the cycle counts displayed.

The parent controller in this case will have 7 children. 4 of these children are obvious from the source code (LONGSTAGEs 1 - 4). Two of them become clear with the information provided by previous tutorials that a controller can contain either 1) only primitives or 2) other controllers + transient primitives (those that do not take resources or latency such as bit slicing and register reads). The two register writes in this snippet contain whatever primitive nodes are in foo and bar, plus a register write is a node that takes resources and a cycle to complete. These will get wrapped in their own unit pipes by the compiler in order to abide by this rule. Finally, the 7th controller is generated by the if condition. Because we must perform some computation on R to get cond(R) (even if this is just == or !=), and this result must be resolved before we can decide to run LONGSTAGE2, the compiler will inject a unit pipe just before LONGSTAGE2. This unit pipe will do the primitive computations and write the result to an internally generated register. This register can be consumed as a primitive in the parent controller (since it is a transient node), and then used to mask the next controller.

Notice that the sram is written to in the “first” stage and read from in the “last” stage. It appears that this should be quadruple-buffered until you realize that there are a lot of these unit pipes being injected. You will end up with a 7-buffered memory. Depending on how large the sram is, this could consume a lot more resources than it is worth, since some of these stages are simple unit pipes. From a runtime point of view, having these tiny stages does not change the overall runtime as long as N is large enough to keep the parent controller in steady-state for a while. From a resource point of view, these unit pipes are very costly. The solution here would be to manually group together a “long” controller with neighboring “short” controllers, as shown to the left. We can turn this parent controller into a 4-stage controller, saving 3 copies of our “large” sram and only adding a minimal penalty to runtime.

manually unroll certain cases

// Inefficient Structure
Foreach(N by 1){i => 
    sram(i) = Reduce(Reg[T])(M by 1 par M){j => 
        foo(j)
    }{case (a,b) => bar(a,b)}
}
      
// Efficient Structure
Foreach(N by 1){i => 
    val fooList: List[T] = List.tabulate(M){j => foo(j)}
    sram(i) = fooList.reduceTree{case(a,b) => bar(a,b)}
}

In this example, we are storing the result of a reduction into an SRAM. We parallelize the iterator of the Reduce by a value equal to the number of iterations that the controller will run for. In other words, this loop will be fully unrolled and the compiler will replace the counter and controller with the primitive operations and the resulting reduction tree. This appears that the Foreach over N should become an inner loop and fully pipelined with an initiation interval of 1, but there is a caveat in Spatial 1.0. Because the store to SRAM, which is a primitive, is a sibling of the Reduce controller, it will be wrapped in a unit pipe before unrolling. When the code is unrolled, we end up with two unit pipes. The first one is the reduction tree that writes to a register, and the second is the sram write. Since the outer Foreach is still an outer pipe, this means there will only be one value for i issued at a time and the entire reduction tree for this iteration will need to drain before the next will be issued.

Without spoiling too much from the meta-programming tutorial, here we show how to leverage Scala to manually unroll the reduction for us. This structure is equivalent to writing out foo(j) for each value of j, and then combining them pair-wise in a tree.

Note that you must import utils.implicits._ to gain access to functions such as reduceTree (Many-to-one) and combinationTree (Many-to-many) implementations.

Make accesses affine

// 1D Example
Foreach(N by 1 par 4){i => sram(i) = ...}
Foreach(N by 1 par 2){i => ... = sram(i)}

// 2D Example
Foreach(M by 1 par 2, N by 1 par 4){(i,j) => ... = sram(i,j)}

// 2D Random Row Example
Foreach(M by 1 par 2, N by 1 par 4){(i,j) => ... = sram(foo(i),j)}

// Not Affine
Foreach(-1 until N by 1){i => ... = sram(mux(i<0,0,i))}
// Affine
Foreach(-1 until N by 1){i => ... = mux(i<0, sram(0), sram(i))}

// Not Affine
Foreach(0 until N + 1 by 1){i => ... = sram(i%N)}
// Affine
Foreach(0 until N + 1 by 1){i => ... = mux(i > N, sram(i-N), sram(i)) }

An affine access pattern is one whose n-dimensional address, f(x), is in the form

f(x) = Ax + b,

where x is a vector of m elements, A is an n x m matrix, and b is a vector of n elements. The compiler is able to deconstruct accesses to addressable memories into affine patterns and compute the best banking scheme for the memory. We will not discuss how this works in this tutorial, but for more details on the math and rules behind banking analysis, refer to Appendix A of the Spatial paper. For now, we will just discuss affine access patterns and review some specific strategies for using them in cases where a programmer may otherwise not.

The compiler must determine a fixed banking scheme such that all accesses (readers and writers) are able to connect all of their lanes to the appropriate banks simultaneously at any given cycle during execution.

The snippets and images to the left show some simple examples of 1D and 2D banking schemes where the accesses are affine. We build on the 2D example to show what happens when one dimension is not affine.

In the snippet to the left, we read an sram with an affine column index but a “random” row access. If foo is not an affine function, then we cannot guarantee that for a single iteration, the two rows will not have bank collisions. The animation demonstrates how duplicating the memory is a solution to this problem. Note that a writer to an SRAM must be bankable, but readers to an SRAM can be decomposed across multiple duplicates. A write to an SRAM is dispatched to all duplicates, but a read is dispatched to one duplicate in order to avoid collisions.

Wherever possible, you should force your accesses to be affine in order to get the best banking scheme. While this is not always possible, the snippet to the left shows a specific example where it is possible to convert an access that has a mux or a modulus operation into one that is purely affine. Note that the current release of Spatial does not include modulus as an affine operator, but this is among the things that will be added in future releases.

While these examples turn non-affine accesses into affine accesses to avoid duplication, it sometimes requires intuition to decide if the cost of extra muxes and access ports on a single SRAM is more or less expensive than duplicating the entire SRAM. If the SRAM is small, then it may make more sense to pay for the duplicates rather than the extra resolution logic.

Preserving lockstep

// Unrolling i
Foreach(N by 1 par 2){i => 
    Foreach(64 by 16){j => 
        Reduce(reg)(M by 1 par 4){k => sram1(j) * foo(i,j,k)}{_+_}
    }
}

// Unrolling i, variable runtime controller
Foreach(N by 1 par 2){i => 
    val sram2 = SRAM[T](16)
    Foreach(64 by 16){j => 
        sram2 load dram(i, j::j+16)
        Reduce(reg)(M by 1 par 4){k => sram1(j) * foo(i,j,k,sram2)}{_+_}
    }
}

// Unrolling i, preserving lockstep
val sram2 = SRAM[T](2,64)
Foreach(N by 1 par 2){i =>
    if (i % 2 == 0) sram2 load dram(i::i+2, 0::64)
    Foreach(64 by 16){j => 
        Reduce(reg)(M by 1 par 4){k => sram1(j) * foo(i,j,k,sram2)}{_+_}
    }
}

The previous section demonstrates how to “clean” up the access patterns to a memory to assist with banking. Even if the above rules are followed, we will now introduce another concept to consider when trying to minimize resource usage when writing an app. The banking analysis looks at the iterators in the access pattern to a memory, as well as the unroll ancestry of these iterators. For example, in the snippet on the left, we will unroll j by 2 and create two instances of the innermost Reduce Each instance will receive the value of i that comes from its lane.

When we attempt to bank sram1, we will see that the access to sram1 is loop invariant with iterators i and k. This means that rather than creating 8 instances of the read (parallelizations of 2 and 4 in the controller ancestry), we can read sram1 once and broadcast this result across all lanes of i and k. The key that allows us to do this is the fact that the descendants of loop j run for a statically known, identical number of cycles.

Now, consider the addition to the snippet on the left. Here, we add a line that loads data from dram into sram2 before we enter the Reduce stage. The outer Foreach becomes a two-stage controller. Because a load turns into a Stream controller with variable latency, as it depends on responses from the DRAM controller, the different lanes of i will dephase from each other. We are no longer guaranteed that the two copies of the j loop will have their js incrementing in lockstep.

In this example, the declaration of sram2 will get unrolled with the outermost controller, so we will have a separate 1D copy of the memory for each unrolled body.

We can rewrite this snippet another way by making sram2 a 2D memory which would be the concatenation of the two unrolled 1D memories and move the load out of the middle controller. The animation shows how this change results in the innermost functions operating in lockstep once again.

In general, you should aim to consolidate your DRAM transfers and memories, which results in less hardware to manage independent DRAM streams, more efficient banking of a single memory rather than unique banking for unrolled, smaller memories, and preservation of lockstepness.

tricky cases of cse

val s = SRAM[Int](8,2) // 8 x (p,1-p)
// Populate s
val combinedList = List.tabulate(256){i => // 2^8 combined p = (p0p1p2p3, p0p1p2(1-p3), p0p1(1-p2)p3, ...) 
    val t0 = if (i % 2 == 0) x(0,0) else x(0,1)
    val t1 = if ((i/2) % 2 == 0) x(1,0) else x(1,1)
      ...
    val t7 = if ((i/128) % 2 == 0) X(7,0) else x(7,1)
    t0*t1*t2*t3*t4*t5*t6*t7
}

val combinedList = List.tabulate(8){i => 
    List(x(i,0), x(i,1))
}.combinationTree{_*_}

In certain cases, the current version of Spatial will expect you to manually perform CSE (common sub-expression elimination). Consider the example to the left, where we want to combine a bunch of joint probability distributions. In the naive implementation, we enumerate each possible combination of marginal probabilities by representing each of the 256 elements in 8 binary bits. We use these bits to index into the structure containing the probabilities.

The problem with this implementation is that each of the 256 elements is some product of 8 primitive values. These elements can be decomposed into pairwise multiplications that are reused for other elements, but this information is not captured here. In the current version of Spatial, the user must lay out the computation with CSE done explicitly. This will be done automatically in a future release of Spatial.

For this particular example, there is a utility function called combinationTree{f: (T,T) => T} built into the language utils. You can define your own functions when you have cases where CSE is clearly required, but for this tutorial we will just use the built-in function.

previous: performance (instrumentation)

next: Performance (Streaming)