General Matrix Multiply (GeMM)

In this section, you will learn about the following components in Spatial:

MemReduce and MemFold
Multi-dimensional Banking

Note that a large collection of Spatial applications can be found here.

Overview

General Matrix Multiply (GEMM) is a common algorithm in linear algebra, machine learning, statistics, and many other domains. It provides a more interesting trade-off space than the previous tutorial, as there are many ways to break up the computation. This includes using blocking, inner products, outer products, and systolic array techniques. In this tutorial, we will demonstrate how to build a blocked GEMM app that uses outer products, and leave it to the user to try and build a GEMM version that uses inner products. Later tutorials will show how to use shift registers and systolic arrays in other applications, but the same techniques can be retroactively applied to this tutorial on GEMM as well.

The animation to the right shows the basic way to compute C = A x B using outer products

Basic implementation

import spatial.dsl._

@spatial object MatMult_outer extends SpatialApp {
  type X = FixPt[TRUE,_16,_16]

  def main(args: Array[String]): Unit = {
    // Get sizes for matrices from command line
    val m = args(0).to[Int]
    val n = args(1).to[Int]
    val p = args(2).to[Int]

    // Generate data for input matrices A and B, and initialize C
    val a = (0::m, 0::p){(i,j) => ((i + j * p) % 8).to[X] }
    val b = (0::p, 0::n){(i,j) => ((i + j * n) % 8).to[X] }
    val c_init = (0::m, 0::n){(_,_) => 0.to[X] }

    // Communicate dimensions to FPGA with ArgIns
    val M = ArgIn[Int]
    val N = ArgIn[Int]
    val P = ArgIn[Int]
    setArg(M,m)
    setArg(N,n)
    setArg(P,p)

    // Create pointers to matrices
    val A = DRAM[X](M, P)
    val B = DRAM[X](P, N)
    val C = DRAM[X](M, N)

    // Set up parallelizations
    val op = 1
    val mp = 1
    val ip = 1

    val bm = 16
    val bn = 64
    val bp = 64

    setMem(A, a)
    setMem(B, b)
    setMem(C, c_init)

    Accel {
      // Tile by output regions in C
      Foreach(M by bm par op, N by bn par op) { (i,j) =>
        val tileC = SRAM[X](bm, bn)
                                                   
        // Prefetch C tile
        tileC load C(i::i+bm, j::j+bn par ip)
                                                   
        // Accumulate on top of C tile over all tiles in P dimension
        MemFold(tileC)(P by bp) { k =>
          val tileA = SRAM[X](bm, bp)
          val tileB = SRAM[X](bp, bn)
          val accum = SRAM[X](bm, bn)
                                 
          // Load A and B tiles
          Parallel {
            tileA load A(i::i+bm, k::k+bp)
            tileB load B(k::k+bp, j::j+bn)
          }
          
          // Perform matrix multiply on tile
          MemReduce(accum)(bp by 1 par mp){ kk =>
            val tileC_partial = SRAM[X](bm,bn)
            Foreach(bm by 1, bn by 1 par ip){ (ii,jj) =>
              tileC_partial(ii,jj) = tileA(ii,kk) * tileB(kk,jj)
            }
            tileC_partial
          }{_+_}
        }{_+_}
                                                   
        // Store C tile to DRAM
        C(i::i+bm, j::j+bn par ip) store tileC
      }
    }
    
    // Fetch result on host
    val result = getMatrix(C)

    // Compute correct answer
    val gold = (0::m, 0::n){(i,j) => 
      Array.tabulate(p){k => a(i,k) * b(k,j)}.reduce{_+_}
    }

    // Show results
    println(r"expected cksum: ${gold.map(a => a).reduce{_+_}}")
    println(r"result cksum: ${result.map(a => a).reduce{_+_}}")
    printMatrix(gold, "Gold: ")
    printMatrix(result, "Result: ")

    assert(gold == result)
  }

}

The app on the left shows how to perform matrix multiplication using outer products and tiling. We will walk through the new constructs introduced in this code. The animation below shows the specific tiling scheme used in this example.

In order to generate a, b, and c_init host datastructures, we use the syntax that creates a Matrix. Specifically, this is (0::m, 0::p){(i,j) => … }. This works for datastructures up to 5 dimensions. To create c_init, we mark the iterators with underscores because they are not explicitly used in the function, which is normal for Scala.

For now, we set all of our parallelization factors to 1. In the next section, we will discuss these numbers further and explain the implications they have on the generated hardware.

In this version of matrix multiply, we choose to step across different tiles of C in the outermost loop, scanning the memory horizontally. There are other orderings of loops that will result in valid matrix multiply apps, but this is one of the variations that minimizes transactions with DRAM.

Because it is a general matrix multiply, there is no guarantee in the real world that matrix C is initialized to 0. Therefore, the correct thing to do is to accumulate on top of what is already in C. The outermost loop in this app is a three-stage coarse-grain pipeline. This prefetch line is the first stage.

We choose to use the MemFold controller here because we do not want the first iteration (k = 0) to overwrite what we just prefetched into tileC. We also create a local accum sram that will hold the partial result of multiplying two tiles together from A and B. This entire MemFold composes the second stage of the outermost pipeline.

Here we perform matrix multiply on the two tiles by using outer products. The outermost loop here is iterating across the inner, matching dimensions of partial matrices. In other words, kk is selecting a column from tileA and a row from tileB. The MemReduce and MemFold controllers must return an SRAM for their map function, so we create tileC_partial to serve that purpose.

The innermost loop of this algorithm is iterating over the rows of tileA and the columns of tileB. This is effectively performing an outer product (exhaustive pairwise multiplication) between every element in the two vectors selected by kk.

MemReduce returns the SRAM, so this particular MemReduce is the output of the map function for the outer MemFold. This is why there are two {_+_} lambdas that seem to be strangely redundant.

The final step, and the third stage of the outermost pipeline is the store that transfers tileC back to DRAM. The animation below demonstrates how to think of tileC as a triple-buffered memory.

We use the getMatrix and printMatrix functions here to fetch the data from DRAM and display the contents of these datastructures. Similarly, you can use getTensor3, getTensor4, and getTensor5, as well as their obvious counterparts for print for higher dimensional structures. Teh

multi-dimensional banking

import spatial.dsl._

@spatial object MatMult_outer extends SpatialApp {
  type X = FixPt[TRUE,_16,_16]

  def main(args: Array[String]): Unit = {
    val m = args(0).to[Int]
    val n = args(1).to[Int]
    val p = args(2).to[Int]

    val a = (0::m, 0::p){(i,j) => ((i + j * p) % 8).to[X] }
    val b = (0::p, 0::n){(i,j) => ((i + j * n) % 8).to[X] }
    val c_init = (0::m, 0::n){(_,_) => 0.to[X] }

    val M = ArgIn[Int]
    val N = ArgIn[Int]
    val P = ArgIn[Int]
    setArg(M,m)
    setArg(N,n)
    setArg(P,p)

    val A = DRAM[X](M, P)
    val B = DRAM[X](P, N)
    val C = DRAM[X](M, N)

    // *** Set mp and ip > 1
    val op = 1
    val mp = 2
    val ip = 4

    val bm = 16
    val bn = 64
    val bp = 64

    setMem(A, a)
    setMem(B, b)
    setMem(C, c_init)

    Accel {
      Foreach(M by bm par op, N by bn par op) { (i,j) =>
        val tileC = SRAM[X](bm, bn)
                                                   
        tileC load C(i::i+bm, j::j+bn par ip)
                                                   

        MemFold(tileC)(P by bp) { k =>
          val tileA = SRAM[X](bm, bp)
          val tileB = SRAM[X](bp, bn)
          val accum = SRAM[X](bm, bn)
                                 
          Parallel {
            tileA load A(i::i+bm, k::k+bp)
            // *** Parallelize writer by 8
            tileB load B(k::k+bp, j::j+bn par 8)
          }
          
          MemReduce(accum)(bp by 1 par mp){ kk =>
            val tileC_partial = SRAM[X](bm,bn)
            Foreach(bm by 1, bn by 1 par ip){ (ii,jj) =>
              tileC_partial(ii,jj) = tileA(ii,kk) * tileB(kk,jj)
            }
            tileC_partial
          }{_+_}
        }{_+_}
                                                   
        C(i::i+bm, j::j+bn par ip) store tileC
      }
    }
    
    val result = getMatrix(C)

    val gold = (0::m, 0::n){(i,j) => 
      Array.tabulate(p){k => a(i,k) * b(k,j)}.reduce{_+_}
    }

    println(r"expected cksum: ${gold.map(a => a).reduce{_+_}}")
    println(r"result cksum: ${result.map(a => a).reduce{_+_}}")
    printMatrix(gold, "Gold: ")
    printMatrix(result, "Result: ")

    assert(gold == result)
  }

}

Here we demonstrate how banking works in Spatial, using tileB as our example memory. We base our banking calculation on “Theory and algorithm for generalized memory partitioning in high-level synthesis” (Wang et al), with minor modifications and improvements on search patterns.

Essentially, we attempt to bank a memory by looking at all of the accesses and finding a banking scheme that has no collisions. For more details, see Appendix A of the paper about Spatial. We attempt to bank an N dimensional memory in 1 dimensional space, and only if that fails or proves to be unnecessarily costly in resource utilization, we attempt to bank hierarchically.

Here we set mp = 2 and ip = 4. If you inspect the part of the app below where these parallelizations are used, you will see that mp parallelizes an outer pipe and ip parallelizes an inner pipe. The animation below shows one possible banking scheme for tileB under these parallelizations (considering only this read). Note that it is not the only valid scheme, and other hierarchical as well as other flat schemes exist.

Here we choose to parallelize the writer to tileB by 8. This means the banking scheme above is no longer valid. We need to statically bank the SRAM and ensure that there are no bank collisions inside any of the accesses to the memory. In this kind of situation, the compiler will likely choose a flat banking scheme, as demonstrated by the animation below. Note that this animation only shows the reader with mp = 2 and ip = 1, but a motivated reader could experiment with these parameters in Spatial and see what happens to the memory.

Previous: Inner Product

next: Convolution