Convolutions

In this section, you will learn about the following components in Spatial:

LineBuffer
ShiftRegister
LUT
Spatial Functions

Note that a large collection of Spatial applications can be found here.

Overview

Convolution is a common algorithm in linear algebra, machine learning, statistics, and many other domains. The tutorials in this section will demonstrate how to use the building blocks that Spatial provides to do convolutions.

Specifically, we will build a basic differentiator for a time-series using sliding window averaging for an example 1D convolution. The animation to the right demonstrates this kind of 1D convolution with a square window kernel.

We will also build a 2D convolution application. The animation to the right demonstrates the 2D convolution with the padding that we will use (credit https://github.com/vdumoulin/conv_arithmetic). Alternatively, Spatial supports 2D convolutions as matrix multiplies. See (TODO: Link to “toeplitz” API) for more details.

Basic implementation (1D)

import spatial.dsl._

@spatial object Differentiator extends SpatialApp {

  def main(args: Array[String]): Unit = {
    type T = FixPt[TRUE,_16,_16]
    
    // Set tile size
    val coltile = 64
    
    // Load data
    val data = loadCSV1D[T](s"$DATA/slac/slacsample1d.csv", ",")
    
    // Set full size of input vector for use by FPGA
    val memcols = ArgIn[Int]
    setArg(memcols, data.length.to[Int])
    
    // Create input and output DRAMs
    val srcmem = DRAM[T](memcols)
    setMem(srcmem, data)
    val dstmem = DRAM[T](memcols)

    // Set low pass filter window size
    val window = 16

    Accel {
      // Create shift register window
      val sr = RegFile[T](window)
    
      // Allocate memories for input and output data
      val rawdata = SRAM[T](coltile)
      val results = SRAM[T](coltile)

      // Work tile by tile on input vector
      Foreach(memcols by coltile) { c =>
        
        // Fetch this tile
        rawdata load srcmem(c::c+coltile)
                                   
        // Scan through tile to get deriv
        Foreach(coltile by 1) { j =>
        
          // Shift next element into sliding window
          sr(*) <<= rawdata(j)
          
          // Compute mean of points in first half of window
          val mean_right = Reduce(Reg[T](0.to[T]))(window/2 by 1) { k => sr(0,k) }{_+_}
                               
          // Compute mean of points in second half of window
          val mean_left = Reduce(Reg[T](0.to[T]))(window/2 by 1) { k => sr(0,k+window/2) }{_+_}
                       
          // Subtract and average
          val slope = (mean_right - mean_left) / (window/2).to[T]
          
          // Store result (if all data in window is valid)
          val idx = j + c
          results(j) = mux(idx < window, 0.to[T], slope)
        }
        dstmem(c::c+coltile) store results
      }
    }


    // Extract results from accelerator
    val results = getMem(dstmem)

    // Read answer
    val gold = loadCSV1D[T](s"$DATA/slac/deriv_gold.csv", ",")

    // Create validation checks and debug code
    printArray(results, "Results:")
    val margin = 0.5.to[T]

    val cksum = gold.zip(results){case (a,b) => abs(a-b) < margin}.redu
 ce{_&&_}
    assert(cksum)
  }
}

In this example, we introduce a 1D convolution with one period of a square wave as the kernel. This is essentially a low pass filter and derivative of some input data. We will discuss the new syntax used in this app. To the right is a rough animation of what is happening in this app.

Spatial supports reading and writing to csv and binary files at runtime. Here, we have our data stored in a 1D csv file with , being the delimiter. The $DATA variable is substituted at the compile time of Spatial, and will use the environment variable on your system TEST_DATA_HOME. You may point this to wherever your data exists, or optionally clone our data repository if you are testing with our example code.

Here we create a shift register called sr. A RegFile in spatial is an N-dimensional addressable memory similar to an SRAM. The only difference is that the underlying resources used to create a RegFile are registers, while an SRAM will generally be placed in block RAM on an FPGA. A RegFile is an array of registers, which means it is inherently fully banked.

In order to support vectors of any size, we have tiled by coltile. We first fetch the tile we are working on, and then iterate element by element and shift the data into the RegFile. This shift operation pushes old data one address higher and puts the new data in address 0.

Rather than directly storing a kernel, we are using a square wave for this app. This means anything in the first half of the window is multiplied by -1 and anything in the second half is multiplied by 1. Therefore we can average the two halves separately and subtract one from the other to get the derivative.

For the first few iterations, the data in the RegFile will be uninitialized and could be any value, so we simply want to mask this out from the final result. As we step to new tiles, data from the previous tile will still exist in the RegFile, so we do not have to worry about masking it.

The graph to the right shows a plot of the input and expected output for this app.

To compile the app for a particular target, see the Targets page

Basic implementation (2d)

import spatial.dsl._

@spatial object Sobel extends SpatialApp {

  def main(args: Array[String]): Unit = {
    // Set up kernel height and width
    val Kh = 3
    val Kw = 3

    // Set max number of columns and pad size
    val Cmax = 160
    val pad = 3
    
    // Set up data for kernels
    val kh_data = List(List(1,2,1), List(0,0,0), List(-1,-2,-1))
    val kv_data = List(List(1,0,-1), List(2,0,-2), List(1,0,-1))

    val B = 16

    // Get image size from command line
    val r = args(0).to[Int]
    val c = args(1).to[Int]

    // Generate some input data
    val image = (0::r, 0::c){(i,j) => if (j > pad && j < r-pad && i > pad && i < r - pad) i*16 else 0}

    // Set args
    val R = ArgIn[Int]
    val C = ArgIn[Int]
    setArg(R, image.rows)
    setArg(C, image.cols)

    // Set up parallelization factors
    val lb_par = 16 (1 -> 1 -> 16)
    val par_store = 16
    val row_stride = 10 (100 -> 100 -> 500)
    val row_par = 2 (1 -> 1 -> 16)
    val par_Kh = 3 (1 -> 1 -> 3)
    val par_Kw = 3 (1 -> 1 -> 3)

    // Set up input and output images
    val img = DRAM[Int](R, C)
    val imgOut = DRAM[Int](R, C)

    // Transfer data to memory
    setMem(img, image)

    Accel {
      // Iterate over row tiles
      Foreach(R by row_stride par row_par){ rr =>

        // Handle edge case with number of rows to do in this tile
        val rows_todo = min(row_stride, R - rr)

        // Create line buffer, shift reg, and result
        val lb = LineBuffer[Int](Kh, Cmax)
        val sr = RegFile[Int](Kh, Kw)
        val lineOut = SRAM[Int](Cmax)

        // Use Scala lists defined above for populating LUT
        val kh = LUT[Int](3,3)(kh_data.flatten.map(_.to[Int]):_*)
        val kv = LUT[Int](3,3)(kv_data.flatten.map(_.to[Int]):_*)

        // Iterate over each row in tile
        Foreach(-2 until rows_todo) { r =>

          // Compute load address in larger image and load
          val ldaddr = if ((r+rr) < 0 || (r+rr) > R.value) 0 else {r+rr}
          lb load img(ldaddr, 0::C par lb_par)

          // Iterate over each column
          Foreach(0 until C) { c =>

            // Reset shift register
            Pipe{sr.reset(c == 0)}

            // Shift into 2D window
            Foreach(0 until Kh par Kh){i => sr(i, *) <<= lb(i, c) }

            val horz = Reduce(Reg[Int])(Kh by 1 par par_Kh, Kw by 1 par par_Kw){(i,j) =>
              sr(i,j) * kh(i,j)
            }{_+_}
            val vert = Reduce(Reg[Int])(Kh by 1 par par_Kh, Kw by 1 par par_Kw){(i,j) =>
              sr(i,j) * kv(i,j)
            }{_+_}

            // Store abs sum into answer memory
            lineOut(c) = mux(r + rr < 2 || r + rr >= R-2, 0.to[Int], abs(horz.value) + abs(vert.value))
          }

          // Only if current row is in-bounds, store result to output DRAM
          if (r+rr < R && r >= 0) imgOut(r+rr, 0::C par par_store) store lineOut
        }

      }
    }
    val output = getMatrix(imgOut)

    /*
      Filters:
      1   2   1
      0   0   0
     -1  -2  -1

      1   0  -1
      2   0  -2
      1   0  -1

    */
    // Compute gold check
    val gold = (0::R, 0::C){(i,j) =>
      if (i >= R-2) {
        0
      } else if (i >= 2 && j >= 2) {
        val px00 = image(i,j)
        val px01 = image(i,j-1)
        val px02 = image(i,j-2)
        val px10 = image(i-1,j)
        val px11 = image(i-1,j-1)
        val px12 = image(i-1,j-2)
        val px20 = image(i-2,j)
        val px21 = image(i-2,j-1)
        val px22 = image(i-2,j-2)
        abs(px00 * 1 + px01 * 2 + px02 * 1 - px20 * 1 - px21 * 2 - px22 * 1) + abs(px00 * 1 - px02 * 1 + px10 * 2 - px12 * 2 + px20 * 1 - px22 * 1)
      } else {
        0
      }
      // Shift result down by 2 and over by 2 because of the way accel is written
    }

    printMatrix(image, "Image")
    printMatrix(gold, "Gold")
    printMatrix(output, "Output")

    val cksum = gold == output
    println("PASS: " + cksum + " (Sobel)")
    assert(cksum)
  }
}

Here we show one of many ways to write a 2D convolution. Specifically, we are running a Sobel filter, which is roughly a simple edge detector in an image.

Here we set up the data for both kernels as Lists, which are not virtualized in Spatial. This means they are treated as Scala lists even at compile time of Spatial. This is one of many ways to use Scala to metaprogram Spatial.

Here we create image data in no particularly important way.

In this design, we tile by row, which allows us to use row_par to let the FPGA parallelize across multiple chunks of input data. We are careful here to handle the edge case, when there is fewer than a full tile of rows left in the loop.

Here we create a LineBuffer and a RegFile. The animation below shows how LineBuffers work. It is important to use LineBuffers correctly in their parent control structures, or else unexpected behavior may occur.

A LineBuffer is a general case of a buffered memory, where we can consistently index into rows 0 (newest data), 1, and 2 (oldest data), but have new rows loading in the background that will be pushed to row index 0 at the next buffer swap. In order to understand the swapping procedure, you should look at the least-common ancestor (LCA) of all the reads and writes to the memory. In this case, the LCA of the write and read for this LineBuffer is the loop Foreach(-2 until rows_todo). The first stage of this loop is the load into the LineBuffer, and the second stage is the Foreach(0 until C) loop which contains the read to the LineBuffer deeper in its sub-tree. There is a third stage to this controller, but it is irrelevant with respect to the operation of this LineBuffer.

The LineBuffer will swap its data on the exact cycle when all stages active during a particular iteration have received their done signals. This is the main reason why it is important to understand the control logic around the LineBuffer, or else it is possible to have unintended swapping behavior.

Here we shift 3 elements in parallel to each of the 3 rows of the sliding window. Note that it is entirely possible to perform this convolution without the intermediate RegFile, and the LineBuffer can be read directly. Using the intermediate RegFile just logically decouples the sliding window from the underlying memory.

Here we conditionally store our result back to DRAM as long as we are within valid bounds.

Using functions

import spatial.dsl._
@spatial object Differentiator extends SpatialApp {

  // Set low pass filter window size
  val window: scala.Int = 16

  def compute_kernel[T:Num](start: scala.Int, sr: RegFile1[T]): T = {
    Reduce(Reg[T](0.to[T]))(window/2 by 1) { k => sr(k+start) }{_+_} / window
  }

  def main(args: Array[String]): Unit = {
    type T = FixPt[TRUE,_16,_16]
    
    // Set tile size
    val coltile = 64
    
    // Load data
    val data = loadCSV1D[T](s"$DATA/slac/slacsample1d.csv", ",")
    
    // Set full size of input vector for use by FPGA
    val memcols = ArgIn[Int]
    setArg(memcols, data.length.to[Int])
    
    // Create input and output DRAMs
    val srcmem = DRAM[T](memcols)
    setMem(srcmem, data)
    val dstmem = DRAM[T](memcols)

    Accel {
      // Create shift register window
      val sr = RegFile[T](window)
    
      // Allocate memories for input and output data
      val rawdata = SRAM[T](coltile)
      val results = SRAM[T](coltile)

      // Work tile by tile on input vector
      Foreach(memcols by coltile) { c =>
        
        // Fetch this tile
        rawdata load srcmem(c::c+coltile)
                                   
        // Scan through tile to get deriv
        Foreach(coltile by 1) { j =>
        
          // Shift next element into sliding window
          sr <<= rawdata(j)
          
          // Compute mean of points in first half of window
          val mean_right = compute_kernel[T](0, sr)
                               
          // Compute mean of points in second half of window
          val mean_left = compute_kernel[T](window/2, sr)
                       
          // Subtract and average
          val slope = (mean_right - mean_left) / (window/2).to[T]
          
          // Store result (if all data in window is valid)
          val idx = j + c
          results(j) = mux(idx < window, 0.to[T], slope)
        }
        dstmem(c::c+coltile) store results
      }
    }


    // Extract results from accelerator
    val results = getMem(dstmem)

    // Read answer
    val gold = loadCSV1D[T](s"$DATA/slac/deriv_gold.csv", ",")

    // Create validation checks and debug code
    printArray(results, "Results:")
    val margin = 0.5.to[T]

    val cksum = gold.zip(results){case (a,b) => abs(a-b) < margin}.reduce{_&&_}
    assert(cksum)
  }
}

It is possible to separate code into different files and functions in Spatial. The code to the left demonstrates how to do this. Note that Spatial currently in-lines functions.

Previous: Matrix Multiply

next: genetic alignment