On Real-Time Super-Resolution Using Deep Learning and Field Programmable Gate Arrays

15 min readNov 29, 2020

There are many papers and blog posts available that provide valid arguments for why super-resolution using deep learning is superior to conventional super-resolution methods (e.g., this survey of deep learning super-resolution provides a great overview). It would seem, however, that real-time super-resolution on embedded platforms (e.g., IoT devices) will continue to be an area of research as long as display technologies continue to evolve. In this article, we will focus on a specific real-time super-resolution technology— that is, deep learning super-resolution for real-time applications using FPGAs. Among others used for references, this article focuses on the following two core papers:

An Energy-Efficient Fpga-Based Deconvolutional Neural Networks Accelerator for Single Image Super-Resolution, by Jung-Woo Chang et al.
Efficient Multiquality Super‐resolution Using a Deep Convolutional Neural Network for an FPGA Implementation, by Min Beom Kim et. al.

Background

This article intends to introduce the topic of real-time super-resolution for embedded systems [1]; where super-resolution is the process of enhancing an image or video from one or more low resolution (LR) observations to a high resolution (HR) representation [5]. As one might imagine, super-resolution is a powerful technique used in industries concerned with video streaming, television broadcasting, surveillance, and medical imaging, to name a few. Super-resolution using deep learning is well studied (refer to [2]), but most state-of-the-art methods form very complex neural networks that are not optimized for resource-constrained embedded systems (i.e., they require large memory footprints).

This article is not meant to provide a comprehensive overview of super-resolution using deep learning (I recommend this Medium article by Raj for such an overview). Instead, the motivation is to simply provide a perspective on two approaches for optimizing a super-resolution neural network for use with a field programmable gate array (FPGA). As such, the focus will be on how the authors of [3] and [4] have optimized the Fast Super-Resolution Convolutional Neural Network (FSRCNN) and hardware resources for an FPGA implementation.

Fast Super-Resolution Convolutional Neural Network (FSRCNN)

The FSRCNN is based on the well-known Super-Resolution Convolutional Neural Network (SRCNN), a Convolutional Neural Network (CNN) architecture consisting of only three layers. The SRCNN is intended to be a simple, practical fully feed-forward network that can simultaneously process three channels of color images to achieve optimal super-resolution performance [6]. Nevertheless, the authors identify two key areas for improvement that are covered in subsequent sections: investigating different filter sizes and generalizing the model to accommodate multi-quality input images.

The FSRCNN is a shallow network that optimizes memory and compute utilization. As illustrated in the figure below, it can be shown that FSRCNN contributes three improvements to the SRCNN network: (1) it does not require a preprocessed, bicubic interpolated input but instead upsamples at the end using a deconvolutional layer; (2) three steps are used for non-linear mapping (shrinking, mapping, expanding) instead of only one; and (3) smaller filters and a deeper network are used to improve performance and lower computational costs [7]. The shallow network (when compared to other high performing architectures) and the ability to use original, un-processed input images make the FSRCNN model a natural choice for real-time embedded systems.

SRCNN vs. FSRCNN (https://arxiv.org/pdf/1608.00367.pdf) — The network structure of SRCNN and FSRCNN (source)

Field Programmable Gate Array (FPGA)

It is well understood that CNNs involve many thousands, millions, and even billions of multiply-accumulate (MAC) operations, thus requiring significant memory and compute resources. Moreover, graphics processing units (GPUs) have been considered the staple of deep learning because of their ability to compute billions and even trillions of floating-point operations (FLOPS) per second. Likewise, a field programmable gate array (FPGA) is a configurable integrated circuit that is ubiquitous in applications that require real-time signal processing. From low-power embedded systems to high-performance computing systems, FPGAs provide an ultra-flexible platform capable of real-time digital signal processing.

[An FPGA] is a semiconductor IC where a large majority of the electrical functionality inside the device can be changed; changed by the design engineer, changed during the PCB assembly process, or even changed after the equipment has been shipped to customers out in the ‘field’. — Intel

So what sets FPGAs apart from GPUs? And why should the data science community be interested in FPGAs as a platform for deep learning applications?

An article posted on the Intel website provides an excellent description of what sets FPGAs apart from GPUs. It is important to point out that FPGAs provide real-time deterministic performance for video and image processing applications and they are known to provide better performance per watt than GPUs, making them a better candidate for low-powered embedded systems. FPGAs, however, are not without their limitations. First, they are limited on available processing blocks that can be programmed in the device (e.g., DSP and RAM blocks); meaning, smaller low-power FPGAs have fewer resources available. External resources (e.g., DRAM) are often coupled with FPGAs to overcome these resource limitations, but these external resources add communication overhead that could make or break an embedded system’s ability to perform deep learning applications in real-time [3]. Second, it is also widely accepted that FPGAs are not good for computations requiring floating-point arithmetic as they require significantly more block RAM than fixed-point computations. We will see in this article that these disadvantages must be overcome to achieve real-time performance for super-resolution using deep learning.

Implementations

Now that we understand the goal of super-resolution and the role FPGAs play in real-time embedded systems, we can now walk through a couple of examples of how FPGAs can be used for real-time super-resolution using deep learning.

Energy-Efficient Super-Resolution on an FPGA [3]

As mentioned above, state-of-the-art super-resolution neural networks implement a deconvolution layer at the end of the network to effectively upsample the result of the network. This allows for the original, low-resolution image to be input directly to the network versus needing to be upsampled before training (e.g., using bicubic interpolation) [3]. This added deconvolution layer, however, introduces challenges for resource-constrained devices. Firstly, it is estimated that a deconvolution layer requires almost 7 times more MAC operations when compared to a convolution layer — in regards to an FPGA, DSP block utilization is linearly proportional to multiplication operations [3]. Secondly, the overlapping sum problem that deconvolutions introduce — the output blocks (i.e., pixels) will overlap as a result of the deconvolution — can cause significant communication overhead costs if the FPGA has to recall neighboring blocks from external memory, as alluded to in the Background section.

As such, Chang et al. emphasize the importance of limiting all memory transactions to be on the FPGA (shown as on-chip memory in the figure below). Unfortunately, on-chip memory is typically smaller than off-chip memory so additional considerations are required for optimizing FPGA resource utilization. To eliminate the need for off-chip memory operations, Chang et al. propose the TDC method — or the method of transforming deconvolution layers into convolution layers to reduces the number of MAC operations and the overall size of each kernel [3].

An example of an FPGA utilizing external memory for deep learning (source)

To implement the TDC method, Chang et. al. utilize an inverse coefficient mapping technique that effectively maps the weights in the deconvolutional filters (“W_D”) to weights in the new convolutional filters (“W_C”), as shown in the equation below. It is left for the reader to explore the derivation of this equation in Section III.B of [3]

In this equation, S is target output scale factor (stride), (m, n) are indices for loops of the output and input feature maps (bounded by MxN kernels, 1 ≤ m ≤ M and 1 ≤ n ≤ N), (x_i, y_i) are indices of the input pixel, (x_d, y_d) are indices of the “W_D” weight coefficient, and (x_o, y_o) are indices of the output pixel.

Furthermore, the TDC method exposes the target output scale factor “S” (i.e., stride) and the deconvolutional filter size (“K_D”) as hyperparameters that can be tuned to achieve the best performance based on FPGA resources. The convolutional filter size (“K_C”) is then set using these hyperparameters and the equations below, where the fractional value of “N_O” determines how the current output block overlaps with the most distant output block [3]

Note: the two equations above effectively transform the deconvolutional filter into “S² × (M × N)” convolutional filters of size “K_C × K_C” [3].

Chang, et al. also implement a load-balancing process for re-distributing the filter weights that ensures each processing element is being used as fairly and efficiently as possible. For brevity, this process is left to the reader to investigate. As illustrated in the example below, both the TDC method and the load-balancing process are completed offline.

An example of TDC and load-balancing zero-valued weights. In this example, K_D = 5 and S=2, and K_C is set to 3 through the TDC method (source)

Up to this point, the discussion has been fairly generic. As one might suspect from the Background section, Chang et. al. implement the FSRCNN(x, y, z) as the target architecture (shown below); where (x, y, z) are a combination of sensitive variables that can be tuned to change the network structure and improve the overall performance of the network(x is the low-resolution feature dimension, y is the number of shrinking filters, and z is the mapping layer depth). For example, in [7] the conventional FSRCNN(x, y, z) is set to FSRCNN(56,12,4) to achieve the maximum peak signal-to-noise ratio (PNSR). Finally, the figure below also provides an example of how transforming the DeConv(K_D, M, N) into convolution layers Conv(K_C, S²×M, N) might be represented for different output scale factors.

After the TDC based FSRCNN network has been trained using a GPU with Caffe framework, it is deployed to a Xilinx Virtex-7 485T FPGA for real-time performance evaluation [3]. The figure below illustrates the expected architecture when implemented on an FPGA. As indicated in [3], the convolutional layer processors (CLPs) for the first and second convolutional layers and third and fourth convolutional layers are fused into the combined CLP1 and CLP2 blocks, respectively. Note: this assumes a mapping depth of one (z=1), as described below. By unrolling the Conv(1,y,x) 1x1 in the second and fourth layers and fusing them with the first and third layers, respectively, the need for extra buffers (block RAM) is eliminated. This optimization, of implementing CLP1 and CLP2, reduces the number of line buffers by more than 80%.

Note: the RGB frame is converted into YCbCr before entering the CNN, where only the Y channel is processed (the Cb and Cr channels are processed using a conventional bicubic interpolation, which Chang et. al. say is standard procedure for conventional super-resolution [3])

Expected architecture and dataflow for TDC based FSRCNN network (source)

Among the optimizations detailed in [3], Chang et. al. conclude through experimentation that 13-bit fixed-point values maintain the performance achieved by conventional FSRCNN with 32-bit floating-point values (this reduces block RAM usage). Also through a series of experiments, they conclude that the most optimal network for the Xilinx Virtex-7 485TFPGA is FSRCNN(25, 5, 1) using a deconvolutional filter size (“K_D”) of 7— this structure is both resource-efficient (lowest block RAM and DSP usage) and achieved the highest PNSR, as shown below. Note: it was observed that K_D=9 was too restrictive regarding the FSRCNN(x,y,z) sensitive variables, limiting (x,y,z) to unreasonably small sizes [3].

Experiment results for different FSRCNN sensitive variables (source)

Finally, if we look at Table VI from [3] (shown below), we can compare the speed (clock cycles) and throughput of several conventional Deconvolutional Neural Networks (DCNNs) (left) to using the TDC method (right, “This Work”). From the text, it appears this table appropriately takes into account the 7x7 deconvolutional filter size (K_D) and the FSRCNN(25,5,1) architecture — this provides an estimated 31x speed improvement over the conventional DCGAN architecture (31*2,074 ≈65,384). Furthermore, the table shows a throughput improvement of more than 1000x for the TDC method (with a 4x scale factor) compared to the conventional DCGAN. Chang et. al. claim that the throughput improvement is due to the TDC method solving the large loop dimension problem encountered with deconvolutional layers.

Our DCNN accelerator solved the large loop dimension problem of the output image by using the TDC method. This is because we could simultaneously generate HR images with S² channels of LR images using the TDC method, GOPS was higher in proportion to S — Chang et. al. [3]

Efficient Multiquality Super-Resolution on an FPGA [4]

Where Chang et. al. cover single image quality super-resolution, Kim et. al. observe that in reality, at least in the television industry, unrelated input images may have the same pixel resolution but with different image qualities; an example used in [4] is that a T.V. might display a 4K video (3840 × 2160 or 4096 × 2160 pixels), but the image quality might only be SD (720 × 480 pixels). The core paper covered in this section proposes the Efficient Multiquality Super-Resolution (EMQSR) method to not only address the issue of unknown input image quality but to do so in a way that is efficient enough for an FPGA implementation (i.e., low memory usage and optimized MAC operations). The EMQSR adopts the FSRCNN network (similar to Chang et. al.) and is broken down into two training phases with an Average Percentage of Zeros (APoZ) network trimming procedure before the second phase.

In the first phase, low-quality (LQ) and high-quality (HQ) low-resolution (LR) single-image super-resolution (LQ-SQSR and HQ-SQSR) networks are trained independently; albeit, differently than the conventional FSRCNN. A global residual structure is used to add the output super-resolved image to the bicubic interpolated (upscaled) input image—this allows the new selection layer to add multiple results without normalization [4]. Also, the Rectified Linear Unit (ReLU) activation function is used instead of the Parametric ReLU — it seems the Parametric ReLU requires an unnecessary multiplication that would otherwise negatively impact the efficiency for an FPGA implementation [4]. Finally, additional considerations are made to accommodate limited resources on the FPGA: (1) a deconvolutional filter of size S× S (where S = 2, the upscaling factor) is used to avoid the overlapping pixel issues, and (2) a mapping layer filter of size 1 × m (where m = 3) is used to avoid needing unnecessary line buffers [4]. Note: these issues are the same issues encountered in the previous section, but addressed differently.

Showing the first phase, SQSR, networks that are trained individually. Where d1 and d2 are the filter depths for the expanding and shrinking layers, respectively. (source)

Before the second phase, Kim et. al. implement the APoZ procedure to compress the network [4]. From a high level, the APoZ method effectively trims the neural network by removing filters with weight values below a threshold — in turn, this lowers the FPGA resources required by the network. In practice, this trimming method ranks the filters by APoZ value and removes the T highest APoZ valued filters from the subject layer (where T = threshold hyperparameter, the max number of filters to remove from the subject layer) as well as the weights in subsequent layers [4]. Note: Kim et. al. mention the model is re-trained and fine-tuned after each trimming procedure, and this is completed for all layers through the fourth (expanding) layer.

The second phase occurs once both LQ- and HQ-SQSR networks have been independently trained and optimized through APoZ trimming. For this phase, a convolution layer is inserted just before the deconvolution layer — they call this the selection layer because it “chooses [the] proper CNN according to the input image’s quality” to achieve multiquality functionality [4]. The concatenated, EMQSR network structure is shown below, where the selection layer comes immediately after concatenating the LQ-SQSR and HQ-SQSR networks (light gray).

Showing the EMQSR network with concatenated SQSR networks and selection layer. (source)

Unlike the first phase, the EMQSR network is trained with both HQ and LQ LR images simultaneously — this second training phase is further reduced. First, the selection layer is independently trained by freezing the LQ- and HQ-SQSR parameters. In this manner, the selection layer is trained to emphasize (select) the correct HQ or LQ output based on the input image quality [4]. Finally, the entire network is fine-tuned by training all layers. Again, the bicubic interpolated upscaled input image is added to the super-resolved output image to maintain the residual learning strategy mentioned above. It is important to point out the following claim regarding scalability:

Moreover, to reconstruct more than two input qualities, EMQSR adds more SQSR structures parallel to existing SQSR, and concatenate them in a single feature map — Kim et. al. [4]

Let’s now work through an example of the EMQSR — focusing on the EMQSR-20 network Kim et. al. use for their FPGA implementation (the reader is encouraged to explore other EMQSR networks compared in [4]). For training the EMQSR-20 network, they propose a batch size of 128 (where the first 64 inputs are LQ LR images and the second 64 inputs are HQ LR images). Moreover, the initial learning rate is 1E−3 with a decay rate of 1E−4 after 500 epochs and the L1 loss function is used because of its “better robustness” for super-resolution [4]. Hyperparameters specific to the EMQSR-20 network include filter depths of d1 = 20 and d2 = 12 (to fit into target Xilinx Kintex7–410T FPGA) and a fixed APoZ trimming rate unfortunately, it is not clear what trimming rate was actually used) [4]. The training results for an upscaling factor (S) of 2 are provided below.

It is interesting to compare the results of EMQSR-20 with the individual HQ-SQSR and LQ-SQSR networks (pre-concatenation). The advantage of the EMQSR-20, of course, is that it has less than half the parameters needed by each SQSR network as well as the highest performing EMQSR-32 network — a clear indication that the EMQSR-20 is more resource-efficient for FPGA implementation. What is perhaps more interesting, however, is that the EMQSR-20 network only marginally outperforms the FSRCNN (base) network but has nearly twice the parameters!

Regardless of the parameter count, the EMQSR-20 smaller expanding layer (d1) and mapping layer (d2) filter sizes in addition to APoZ trimming results in nearly a 60% reduction in parameters and better performance on multi-quality input images. This appears to be enough optimization for Kim et. al. to implement the EMQSR-20 network on Xilinx Kintex7–410T FPGA. There is only a brief discussion of the FPGA experiment, but Kim et. al. conclude that the performance on the FPGA with a 2x scaling factor has a delay of only 10-lines from input to output and they claim this to be real-time performance.

Conclusion

In this article, we discussed two core papers where optimizations for the state-of-the-art FSRCNN are proposed, intended to be implemented on an FPGA. The first paper discussed in this article (i.e., [3]) was heavy on the hardware implementation and more concerned with throughput performance (GOPS), lacking important details on the development of the neural network; while the second paper (i.e., [4]) provided significantly more details on training and optimizing the neural network for use on an FPGA and focused more on the image quality performance (PSNR), but lacked details regarding the implementation on real hardware. In both cases, we worked through two key issues that prevent the conventional FSRCNN network from being deployable on an FPGA, namely the overlapping sum problem and the naive network structure (i.e., the inclusion of zero weights). Finally, we have seen how both networks (TDC based FSRCNN and EMQSR-20) can be effectively deployed to and executed on a Xilinx Kintex-7 FPGA. In closing, the details gleaned from both papers and provided in this article provide us with a holistic perspective regarding the complexities of deploying a super-resolution neural network on an FPGA, thus opening the door for further exploration of real-time super-resolution using deep learning and FPGAs.

References

[1] https://en.wikipedia.org/wiki/Embedded_system

[2] https://blog.paperspace.com/image-super-resolution/

[3] Chang, Jung-Woo, et al. “An Energy-Efficient Fpga-Based Deconvolutional Neural Networks Accelerator for Single Image Super-Resolution.” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 1, Jan. 2020, pp. 281–95. DOI.org (Crossref), doi:10.1109/TCSVT.2018.2888898

[4] Kim, Min Beom, et al. “Efficient Multiquality Super‐resolution Using a Deep Convolutional Neural Network for an FPGA Implementation.” Journal of the Society for Information Display, vol. 28, no. 5, May 2020, pp. 428–39. DOI.org (Crossref), doi:10.1002/jsid.902.

[5] Yang, Wenming, et al. “Deep Learning for Single Image Super-Resolution: A Brief Review.” IEEE Transactions on Multimedia, vol. 21, no. 12, Dec. 2019, pp. 3106–21. arXiv.org, doi:10.1109/TMM.2019.2919431.

[6] Dong, Chao, et al. “Image Super-Resolution Using Deep Convolutional Networks.” ArXiv:1501.00092 [Cs], July 2015. arXiv.org, http://arxiv.org/abs/1501.00092

[7] Dong, Chao, et al. “Accelerating the Super-Resolution Convolutional Neural Network.” ArXiv:1608.00367 [Cs], Aug. 2016. arXiv.org, http://arxiv.org/abs/1608.00367.

[8] Wang, Zhihao, et al. “Deep Learning for Image Super-Resolution: A Survey.” ArXiv:1902.06068 [Cs], Feb. 2020. arXiv.org, http://arxiv.org/abs/1902.06068.