Ph.D. Dissertation Defense Title: Optimizing Data Intensive Window-based Image Processing on Reconfigurable Hardware Boards Presenter: Haiqian Yu Abstract FPGA-based computing boards are frequently used as hardware accelerators for image processing algorithms with large amounts of computation and data accesses. The current design process requires that, for each specific image processing application, a detailed design must be completed before a realistic estimate of the achievable speedup can be obtained. However, users need the speedup information to decide whether or not they want to use FPGA hardware to accelerate the application. Quickly providing an accurate speedup estimation becomes increasingly important for designers to make the decision without going through the lengthy design process. We present an automated tool, Sliding Window Operation OPtimization (SWOOP), that generates an estimate of speedup for a high performance design before detailed implementation is complete. SWOOP targets Sliding Window Operations (SWOs). SWOOP provides a system block diagram of the final design as well as an optimal memory hierarchy. One of the contributions of this research is the automatic design of the on-chip memory as a managed cache. The hardware setup we target can be viewed as exploiting a cached memory system with the on-chip memory acting as an L1 cache and the on-board memory acting as an L2 cache. However, unlike most processors, no support for caching is provided. To minimize the number of off-chip data accesses, the memory has to be carefully managed by the designer. Our approach automatically determines the way the data should be accessed and buffered in the on-chip memory to minimize the off-chip memory traffic. This approach is applicable to any hardware architecture that contains a hierarchy of memory outside of the normal caching structure. SWOOP takes both the application parameters and FPGA board parameters as input. The achievable speedup is determined by the area of the FPGA, or, more often, the memory bandwidth to the processing elements. The memory bandwidth to each processing element is a combination of bandwidth to the FPGA and the efficient use of on-chip RAM as a data cache. SWOOP uses analytic techniques to automatically determine the number of parallel processing elements to implement on the FPGA, the assignment of input and output data to on-board memory, and the organization of data in on-chip memory to most effectively keep the processing elements busy. The result is a block layout of the final design, its memory architecture, the estimated usage of different resources and a measure of the achievable speedup. Several manually designed applications including simple 2-D high-pass and low-pass filters, template matching and 2-D cross-correlation have been used to test the performance of SWOOP. Our experiments show that SWOOP can quickly (less than a second) and accurately (less than 10\% difference from manual designs) estimate the maximum parallelism according to the applications and constraints. The block layout of the final designs together with the memory architecture are near optimal with regard to performance. Moreover, since SWOOP identifies where the tightest constraint to parallelism is found in a design, it can tell designers where to focus their efforts for further optimization.