Ph.D. Dissertation Defense
Title: Optimizing Data Intensive Window-based Image Processing on
Reconfigurable Hardware Boards
Presenter: Haiqian Yu

Abstract

FPGA-based computing boards are frequently used as hardware
accelerators for image processing algorithms with large amounts of
computation and data accesses. The current design process requires
that, for each specific image processing application, a detailed
design must be completed before a realistic estimate of the
achievable speedup can be obtained. However, users need the speedup
information to decide whether or not they want to use FPGA hardware
to accelerate the application. Quickly providing an accurate speedup
estimation becomes increasingly important for designers to make the
decision without going through the lengthy design process.

We present an automated tool, Sliding Window Operation OPtimization
(SWOOP), that generates an estimate of speedup for a high
performance design before detailed implementation is complete. SWOOP
targets Sliding Window Operations (SWOs). SWOOP provides a system
block diagram of the final design as well as an optimal memory
hierarchy.

One of the contributions of this research is the automatic design of
the on-chip memory as a managed cache.  The hardware setup we target
can be viewed as exploiting a cached memory system with the on-chip
memory acting as an L1 cache and the on-board memory acting as an L2
cache.  However, unlike most processors, no support for caching is
provided.  To  minimize the number of off-chip data accesses, the
memory has to be carefully managed by the designer.  Our approach
automatically determines the way the data should be accessed and
buffered in the on-chip memory to minimize the off-chip memory
traffic.  This approach is applicable to any hardware architecture
that contains a hierarchy of memory outside of the normal caching
structure.

SWOOP takes both the application parameters and FPGA board
parameters as input. The achievable speedup is determined by the
area of the FPGA, or, more often, the memory bandwidth to the
processing elements. The memory bandwidth to each processing
element is a combination of bandwidth to the FPGA and the
efficient use of on-chip RAM as a data cache. SWOOP uses analytic
techniques to automatically determine the number of parallel
processing elements to implement on the FPGA, the assignment of
input and output data to on-board memory, and the organization of
data in on-chip memory to most effectively keep the processing
elements busy.  The result is a block layout of the final design,
its memory architecture, the estimated usage of different
resources and a measure of the achievable speedup.

Several manually designed applications including simple 2-D
high-pass and low-pass filters, template matching and 2-D
cross-correlation have been used to test the performance of SWOOP.
Our experiments show that SWOOP can quickly (less than a second) and
accurately (less than 10\% difference from manual designs) estimate
the maximum parallelism according to the applications and
constraints. The block layout of the final designs together with the
memory architecture are near optimal with regard to performance.
Moreover, since SWOOP identifies where the tightest constraint to
parallelism is found in a design, it can tell designers where to
focus their efforts for further optimization.