## 行政院國家科學委員會專題研究計畫 成果報告

# 座標旋轉原理演算法應用於二維及三維特殊信號處理器之 晶片設計與製作(I)

研究成果報告(精簡版)

| 計 | 畫 | 類 | 別 | : | 個別型                    |
|---|---|---|---|---|------------------------|
| 計 | 畫 | 編 | 號 | : | NSC 98-2221-E-216-037- |
| 執 | 行 | 期 | 間 | : | 98年08月01日至99年07月31日    |
| 執 | 行 | 單 | 位 | : | 中華大學微電子工程學系            |
|   |   |   |   |   |                        |

- 計畫主持人: 宋志雲
- 共同主持人:辛錫進
- 計畫參與人員:碩士班研究生-兼任助理人員:郭志軒 博士班研究生-兼任助理人員:柯律廷
- 報告附件:出席國際會議研究心得報告及發表論文

處 理 方 式 : 本計畫涉及專利或其他智慧財產權,1年後可公開查詢

### 中華民國 99年09月03日

# 行政院國家科學委員會補助專題研究計畫 □ 期中進度報告

座標旋轉原理演算法應用於二維及三維特殊信號處理器之晶片設計 與製作(I)

計畫類別:■個別型計畫 □整合型計畫 計畫編號:NSC 98-2221-E-216-037 執行期間: 98 年 8 月 1 日至 99 年 7 月 31 日

執行機構及系所:中華大學微電子工程學系

計畫主持人:宋志雲

共同主持人:辛錫進

計畫參與人員:柯律廷 郭志軒

成果報告類型(依經費核定清單規定繳交):■精簡報告 □完整報告

本計畫除繳交成果報告外,另須繳交以下出國心得報告:

□赴國外出差或研習心得報告

- □赴大陸地區出差或研習心得報告
- ■出席國際學術會議心得報告

□國際合作研究計畫國外研究報告

處理方式:除列管計畫及下列情形者外,得立即公開查詢 ■涉及專利或其他智慧財產權,■一年□二年後可公開查詢

中華民國 99 年 8 月 25 日

#### 中文摘要

本研究使用混合座標旋轉原理設計及製作直接數位頻率合成器。此一設計之架構為無乘法器,包含小量之唯獨記憶體(16×4-位元)以及疊流資料路徑,所產生無寄生動態範圍超過84.4 dBc。系統晶片由台積電0.18µm 1P6M CMOS製程設計,並且在Xilinx陣列處理器上實體模擬。證明此一以混合座標旋轉原理為基礎之直接數位頻率合成器適合由超大型積體電路製作,在硬體成本,功率消耗以及無寄生動態範圍上都有具備優勢。

關鍵詞:直接數位頻率合成器,混合座標旋轉原理,系統晶片,陣列處理器,無寄生動態範圍。

### 英文摘要

This research presents a hybrid COordinate Rotation DIgital Computer (CORDIC) algorithm for designs and implementations of the direct digital frequency synthesizer (DDFS). The proposed multiplier-less architecture with small ROM (16×4-bit) and pipelined data path provides a spurious free dynamic range (SFDR) of more than 84.4 dBc. A SoC (System on Chip) has been designed by 0.18µm 1P6M CMOS, and then emulated on the Xilinx FPGA. It is shown that the hybrid CORDIC-based architecture is suitable for VLSI implementations of the DDFS in terms of hardware cost, power consumption, and SFDR. *Key-Words:* - DDFS, hybrid CORDIC, SoC, FPGA, SFDR.

### (二) 報告內容:

#### 1. Introduction

The direct digital frequency synthesizer (DDFS) plays a key role in many digital communication systems. Fig. 1 depicts the conventional DDFS, which consists mainly of phase accumulator, sine/cosine generator, digital-to-analog converter, and low-pass filter. The sine/cosine generator as the core of DDFS is usually implemented by using a ROM lookup table; with high spurious free dynamic ranges (SFDR) comes a large ROM lookup table [1]. In order to reduce the size of the lookup table, many techniques were proposed [1]-[4]. The quadrant compression technique can reduce the ROM size by 75% [2]. The Sunderland architecture is to split the ROM into two smaller ones [3], and its improved version known as the Nicholas architecture results in a higher ROM-compression ratio (32:1) [4]. In [5], the polynomial hyperfolding technique with high order polynomial approximation was used to design DDFS. In [6]-[10], the angle rotation algorithm was used to design quadrature direct digital frequency synthesizer/complex mixer (QDDSM).

COordinate Rotation DIgital Computer (CORDIC) is a well known arithmetic algorithm, which evaluates various elementary functions including sine and cosine functions by using simple adders and shifters only. Thus, CORDIC is suitable for the design of high-performance chips with VLSI technologies. Recently, the CORDIC algorithm has received a lot of attention to the design of high-performance DDFS [11]-[14], especially for the modern digital communication systems.

In this research, we propose a hybrid CORDIC algorithm for the VLSI implementation of DDFS. The remainder of this paper proceeds as follows. In section 2, the proposed hybrid CORDIC algorithm is presented. In section 3, the hardware implementation of DDFS is described. The performance analysis is given in section 4. Conclusion can be found in section 5

#### 2. The Hybrid CORDIC Algorithm

In this section, the hybrid CORDIC algorithm is proposed, and based on which, a low-power and high-SFDR DDFS can be developed.

### 2.1 The modified scaling-free CORDIC algorithm

In order to reduce the number of CORDIC iterations, the input angle can be divided into encoded angles by using the modified Booth encoding (MBE) method [15]. Specifically, let  $\psi$  denote the input angle represented by

$$\psi = f(0)2^{-p} + f(1)2^{-(p+1)} + \dots + f(w-1)2^{-(w-1)}$$
(1)

where  $f(i) \in \{0,1\}$ , w is the word length of operands, and  $\left\lceil \frac{(w-2.585)}{3} \right\rceil = p \le i \le w-1$ . The MBE

decomposition of  $\psi$  is as follows.

$$\psi = \sum_{i=p/2}^{(w-1)/2} \beta(i)$$
(2)

where the encoded angle:  $\beta(i) = \rho(i)2^{-i}$  with  $\rho(i) \in \{-2, -1, 0, 1, 2\}$ . As  $\sin \beta(i)$  and  $\cos \beta(i)$  can be approximated by

$$\sin \beta(i) \cong \beta(i) = \rho(i)2^{-i} \tag{3}$$

$$\cos\beta(i) \cong 1 - \frac{\beta^2(i)}{2} = 1 - \rho^2(i)2^{-(2i+1)}$$
(4)

we have

$$\begin{bmatrix} x(i+1) \\ y(i+1) \end{bmatrix} = \begin{bmatrix} 1-\rho^2(i)2^{-(2i+1)} & \rho(i)2^{-i} \\ -\rho(i)2^{-i} & 1-\rho^2(i)2^{-(2i+1)} \end{bmatrix} \begin{bmatrix} x(i) \\ y(i) \end{bmatrix}$$
(5)

 $z(i+1) = z(i) - \rho(i)2^{-i}$ (6)

Fig. 2 shows the proposed architecture for the modified scaling-free CORDIC arithmetic, in which, eight shifters, two CSAs, two CLAs, two latches, and four MUXs are used; the shifters and MUXs are to determine  $\rho(i)$ .

### 2.2 The modified scaling-free radix-8 CORDIC Algorithm

By using the modified angle recoding method [15]-[16], the input angle  $\psi$  can be divided as follows.

$$\psi = \sum_{i=p}^{w-1} \phi(i) \tan^{-1} 2^{-i}$$
(7)

where  $\phi(i) \in \{0,1\}$ , and w is the word length. The CORDIC iteration is therefore represented as

$$\begin{bmatrix} x(i+1) \\ y(i+1) \end{bmatrix} = \begin{bmatrix} 1 & \phi(i)2^{-i} \\ -\phi(i)2^{-i} & 1 \end{bmatrix} \begin{bmatrix} x(i) \\ y(i) \end{bmatrix}$$
(8)

$$z(i+1) = z(i) - \phi(i) \tan^{-1} 2^{-i}$$
(9)

Let i = 3n - c;  $c \in \{0,1,2\}$ . By using the Taylor series expansion, the absolute difference between  $\tan^{-1}(2^{-(3n-c)})$ and  $2^c \tan^{-1}(2^{-3n})$  is given by

$$\varsigma = \left| \tan^{-1} 2^{-(3n-c)} - 2^c \tan^{-1} 2^{-3n} \right| = \left| \frac{1}{3} \cdot 2^{-3(3n-c)} + \Lambda \right|$$
(10)

where  $\Lambda$  is the remaining terms of the difference between  $\tan^{-1}(2^{-(3n-c)})$  and  $2^c \tan^{-1}(2^{-3n})$ . Thus, we have

$$\varsigma \simeq \frac{2^{-3(3n-c)}}{3} = \frac{2^{-3i}}{3} \tag{11}$$

For w-bit operands,  $\varsigma$  can be ignored in the following sense

$$\zeta \le 2^{-w} \tag{12}$$

Based on equations (11) and (12), we have

$$\frac{2^{-3i}}{3} \le 2^{-w}$$
 (13)

(14)

 $3i + \log_2 3 \ge w$ 

$$i \ge \frac{w - \log_2 3}{3} = \left\lceil \frac{w - \log_2 3}{3} \right\rceil \cong \frac{w}{3}$$
(15)

As a result, when  $i > \frac{w}{3}$ , three consecutive terms of equation (7) can be integrated into a single term as

follows:

$$\phi(3n-2)\tan^{-1}(2^{-(3n-2)}) + \phi(3n-1)\tan^{-1}(2^{-(3n-1)}) + \phi(3n)\tan^{-1}(2^{-(3n)}) = (\phi(3n-2)\cdot 2^2 + \phi(3n-1)\cdot 2^1 + \phi(3n)\cdot 2^0)\tan^{-1}(2^{-3n})$$

$$= \phi(n)\tan^{-1}(2^{-3n})$$
(16)

where  $\phi(\cdot) \in \{0,1\}$ , and therefore  $\phi(n) \in \{0,1,2,\dots,7\}$ . It follows that the resulting radix-8 CORDIC algorithm is represented as

$$\begin{bmatrix} x(i+1) \\ y(i+1) \end{bmatrix} = K_8(i) \begin{bmatrix} 1 & -\varphi(i) \cdot 2^{-3i} \\ \varphi(i) \cdot 2^{-3i} & 1 \end{bmatrix} \begin{bmatrix} x(i) \\ y(i) \end{bmatrix}$$
(17)

$$z(i+1) = z(i) - \varphi(i) \tan^{-1} 2^{-3i}$$
(18)

$$K_8(i) = (1 + \varphi^2(i)2^{-6i})^{1/2}$$
(19)

The scaling factor  $K_8$  is given by

$$K_{8} = \prod_{i=p}^{i=\lceil w/3 \rceil - 1} K_{8}(i)$$
(20)

It can be shown that the scaling factor turns out to be equal to 1 when the input angle is less than  $2^{-w/2}$ , and moreover, if the input angle is less than  $2^{-w/3}$ , equation (18) can be rewritten as [19]  $z(i+1) = z(i) - \varphi(i)2^{-3i}$ (21)

Fig. 3 depicts the proposed architecture for the modified scaling-free radix-8 CORDIC arithmetic. In which, six shifters, two CSAs, two CLAs, and two latches are used; the shifters and switches are to determine the radices for computations. Note that the number of processors is reduced, and system throughput is increased at the cost of hardware complexity.

#### 2.3 The proposed hybrid CORDIC Algorithm

The input angle  $\Omega$  can be decomposed into a higher-angle  $\Omega_H$  and a lower-angle  $\Omega_L$  represented as

$$\Omega = \Omega_{H} + \Omega_{L} = \sum_{i=0}^{\lceil u/2 \rceil - 1} \rho(i) 2^{-2i} + \sum_{i=\lceil u/2 \rceil}^{\lceil u/2 \rceil + \lceil (w-u)/3 \rceil - 1} \sum_{i=\lceil u/2 \rceil} \rho(i) 2^{u-1+3(i-\lceil w/4 \rceil)}$$
(22)

where *w* is the word length with the first *u* bits being the most significant bits;  $\Omega_H$  and  $\Omega_L$  are computed by using the modified scaling-free CORDIC algorithm and the modified scaling-free radix-8 CORDIC algorithm, respectively. For computation efficiency, the determination of *u* is as follows: 1) *u* must be an odd

number to satisfy the MBE method, and 2)  $u \ge \frac{w}{2}$  under the scaling-free constraint. Thus, we have

$$u = \begin{cases} 2n+1, if \ w = 4n+0\\ 2n+1, if \ w = 4n+1\\ 2n+1, if \ w = 4n+2\\ 2n+3, if \ w = 4n+3 \end{cases}$$
(23)

Based on the above equation, the minimum iteration number of the proposed hybrid CORDIC algorithm can be obtained as shown in Fig. 4. The computations of x(i) and y(i) are therefore as follows.

For 
$$\frac{p}{2} \le i < \left| \frac{u}{2} \right|$$
,  
 $x(i+1) = x(i) - \rho^2(i)2^{-(4i+1)}x(i) - \rho(i)2^{-2i}y(i)$  (24)

$$y(i+1) = y(i) - \rho^{2}(i)2^{-(4i+1)}x(i) + \rho(i)2^{-2i}y(i)$$
(25)

For 
$$\left\lceil \frac{u}{2} \right\rceil \le i < \left\lceil \frac{w}{4} \right\rceil + \left\lceil \frac{w-u}{3} \right\rceil$$
,  
 $x(i+1) = x(i) - \varphi(i)2^{u-1+3(i-\lceil w/4\rceil+1)}y(i)$  (26)  
 $y(i+1) = y(i) + \varphi(i)2^{u-1+3(i-\lceil w/4\rceil+1)}x(i)$  (27)

### 3. Hardware Implementation of the Proposed DDFS

In this section, the DDFS implemented by using the hybrid CORDIC algorithm is presented. Fig. 5 shows

the 16-bit DDFS architecture consisting mainly of phase accumulator, phase calculator, and sine/cosine generator, which is different from the conventional architecture. It is noted that the accumulated error in the sine/cosine generator is to be corrected by using the  $4 \times 16$ -bit correction table. Take into account DAC technology, hardware cost and practical applications, the word length of the propose DDFS is set to 16-bit.

The hybrid CORDIC-based sine/cosine generator with recursively accumulated angle  $\mathcal{G}_{in}$  is given by

$$\begin{bmatrix} x(i+1) \\ y(i+1) \end{bmatrix} = \begin{bmatrix} \cos \theta_{in} & -\sin \theta_{in} \\ \sin \theta_{in} & \cos \theta_{in} \end{bmatrix} \begin{bmatrix} x(i) \\ y(i) \end{bmatrix}$$
(28)

where  $\theta_{in} = \Delta_{acc} \frac{2\pi}{2^{16}}$ , and  $\Delta_{acc}$  is an integer number. (29)

For convergence, the input angle of the scale-free CORDIC algorithm is restricted as follows:

$$\mathcal{G}_{in} < \sum_{4}^{w-1} 2^{-i} \cong \frac{1}{8}$$
 (30)

From the above two equations, we have

$$\Delta_{acc} = \frac{2^{16}}{2\pi} \cdot \mathcal{G}_{in} < 1304 \tag{31}$$

The architecture for the sine/cosine generator is shown in Fig. 5. In which three modified scaling-free CORDIC arithmetic units (MCORDIC-Type A) and two modified scaling-free radix-8 CORDIC arithmetic units (MCORDIC-Type B) are used.

The chip is synthesized by the TSMC 0.18  $\mu m$  1P6M CMOS cell libraries [17]. The layout view of the proposed DDFS is shown in Fig. 6. The core size obtained by the Synopsys<sup>®</sup> design analyzer is  $612 \times 612 \mu m^2$ . The power consumption obtained by the PrimePower<sup>®</sup> is 6.05 mW with a clock rate of 100MHz at 1.8V. The tuning latency is 8 clock cycles. All the control signals are internally generated on-chip. The chip provides both high throughput and low gate count.

### 4. Performance Analysis of the Proposed DDFS

The number of correcting points versus the SFDRs with different  $(F_s/F_o)$ 's in the proposed DDFS is shown in Fig. 7. Due to trade-off between hardware cost and system performance, the correcting circuit with 16 points is implemented in the proposed DDFS. In case of 16-bit word length, as shown in Fig. 8, the high-frequency SFDR is 169.7 dBc, respectively. The mid-frequency SFDR of sine and cosine is 122 dBc, as shown in Fig. 9, respectively. The low-frequency SFDR of sine and cosine is 85.06 dBc, as shown in Fig. 10, respectively. As a result, the SFDR is above 85 dBc. Table 1 shows various comparisons of the proposed DDFS with other methods in [11] and [13]. As one can see, the proposed DDFS is superior in terms of SFDR, hardware cost, and power consumption.

#### 5. Conclusion

The hybrid CORDIC-based multiplier-less DDFS architecture with small ROM and pipelined data path has been implemented. A SoC designed by 1P6M CMOS has been emulated on Xilinx XC2V6000 FPGA. For 16-bit DDFS, the SFDR of sine and cosine using the proposed architecture are more than 85.06 dBc. Simulation results show that the hybrid CORDIC-based approach is superior to the traditional approach to the design and implementation of DDFS, in terms of SFDR, power consumption, and hardware cost. The 16-bit DDFS is a reusable IP, which can be implemented in various processes with efficient uses of hardware resources for trade-offs of performance, area, and power consumption.

### **References:**

- [1] J. Vankka, "Methods of mapping from phase to sine amplitude in direct digital frequency synthesis," *IEEE Proceedings of the Frequency Control Symposium*, June 5-7 1996, pp.942-950.
- [2] S. C. Yi, K. T. Lee, J. J. Chen, C. H. Lin, "A low power efficient direct digital frequency synthesizer based on new two-level lookup table," *IEEE Canadian Conference on Electrical and Computer Engineering 2006*, May 2006, pp.963-966.
- [3] D. A. Sunderland, R. A. Srauch, S. S. Wh arfield, H. T. Peterson, C. R. Cole, "CMOS/SOS frequency synthesizer LSI circuit for spread spectrum communications," *IEEE Journal of Solid-State Circuits*, Vol.19, No.4, August 1984, pp.497-506.
- [4] H. T. Nicholas, H. Samueli, B. Kim, "The optimization of direct digital frequency synthesizer performance in the presence of finite word length effects," *IEEE 42nd Annual Frequency Control Symposium*, June 1-3 1988, pp.357-363.
- [5] D. D. Caro, E. Napoli, A. G. M. Strollo, "Direct digital frequency synthesizers with polynomial hyperfolding technique," *IEEE Transactions of Circuits and Systems-II: Express Briefs*, Vol.51, No.7, July 2004, pp.337-344.
- [6] D. Fu, A. N. Willson, Jr. "A high-speed processor for digital sine/cosine generation and angle rotation" in Proc. 32nd Asilomar Conf. Signals, Systems and Computers, Vol.1 1998, pp.177-181
- [7] A. Torosyan, D. Fu, A. N. Willson, Jr, "A 300-MHz quadrature direct digital synthesizer/mixer in 0.25-μm CMOS" *IEEE Journal of Solid-State Circuits*, Volume 38, Issue 6, June 2003 pp. 875 887
- [8] S. Cheng, K. Zhang, S. Cao, X. Zhou, D. Zhou, "A 2.4-GHz ISM Band Delta-Sigma Fractional-N Frequency Synthesizer with Automatic Calibration Technique," WSEAS Transactions on Circuits and Systems, Issue 10, Vol.7, pp.859-868, Oct. 2008
- [9] I.JIVET, B.DRAGOI, "Performance Analysis of Direct Digital Synthesizer Architecture with Amplitude Sequencing," *WSEAS Transactions on Circuits and Systems*, Issue 1, Vol.7, pp.1-6, Jan 2008.
- [10] J. W. Hong, C. L. Hou, C. M. Chang, S. T. Cheng, H. Y. Su, "Current or/And Voltage-Mode Quadrature Oscillators With Grounded Capacitors and Resistors Using FDCCIIs," WSEAS Transactions on Circuits and Systems, Issue 3, Vol.7, pp.129-138, Mar 2008.
- [11] A. Madisetti, A. Y. Kwentus, A. N. Willson Jr, "A 100-MHz, 16-b, direct digital frequency synthesizer with a 100-dBc spurious-free dynamic range," *IEEE Journal of Solid-State Circuits*, Vol.34, No.8, August 1999, pp.1034-1043.
- [12] E. Grayver, B. Daneshrad, "Direct digital synthesis using a modified CORDIC," *IEEE International Symposium on Circuits and Systems (ISCAS '98)*. Vol.5, May 1998, pp.241-244.
- [13]C. Y. Kang, E. E. Swartzlander Jr., "Digit-pipelined direct digital frequency synthesis based on differential CORDIC," *IEEE Transactions on Circuits and Systems I: Regular Papers*, Vol.53, No.5, May 2006, pp.1035-1044.
- [14] T. Y. Sung, H. C. Hsin, "Design and simulation of reusable IP CORDIC core for special-purpose processors," *IET Computers & Digital Techniques*, Vol.1, No.5, Sept. 2007, pp.581-589.
- [15] Y. H. Hu, S. Naganathan, "An angle recoding method for CORDIC algorithm implementation," *IEEE Transactions on Computers*, Vol.42, No.1, January 1993, pp.99-102.
- [16] T. B. Juang, S. F. Hsiao, M. Y. Tsai, "Para-CORDIC: parallel CORDIC rotation algorithm," *IEEE Transactions on Circuits and Systems-I: Regular Papers*, Vol.51, No.8, August 2004, pp.1515-1524.

[17] "TSMC 0.18 CMOS Design Libraries and Technical Data, v.3.2," Taiwan Semiconductor Manufacturing Company, Hsinchu, Taiwan, and National Chip Implementation Center (CIC), National Science Council, Hsinchu, Taiwan, R.O.C., 2006.

| CODDIC Decod DDES             | Madisetti [11] 1999 |         | Swartzlander [13] 2006 |       | This work         |  |
|-------------------------------|---------------------|---------|------------------------|-------|-------------------|--|
| CORDIC Based DDFS             |                     |         |                        |       | [Sung, Hsin & Ko] |  |
| Process (µm)                  | 1.0                 |         | 0.13                   |       | 0.18              |  |
| Core Area $(mm^2)$            | 0.306               | 0.176   | 0.35                   | 0.15  | 0.375             |  |
| Maximum Sampling Rate (MHz)   | 80.4                | 85.7    | 1018                   | 1052  | 100               |  |
| Power Consumption (mw)        | 40.602              | 20.8251 | 350                    | 143   | 6.056             |  |
| Power Consumption (mw/MHz)    | 0.505               | 0.243   | 0.343                  | 0.134 | 0.06              |  |
| SFDR (dB <sub>c</sub> )       | 81                  | 62.1    | 90                     | 60    | 85.06             |  |
| Output Resolution (bits)      | 16                  | 16      | 16                     | 11    | 16                |  |
| Tuning Latency (clock cycles) | 16                  | 16      |                        |       | 8                 |  |

#### Table I Comparison with previous works



Fig. 1 The conventional DDFS architecture



Fig. 2 The proposed architecture of modified scaling-free CORDIC arithmetic for computing  $\theta_H$  (MCORDIC-Type A)



Fig. 3 The architecture of modified scaling-free radix-8 CORDIC arithmetic for computing  $\theta_L$  (MCORDIC-Type B)



Fig. 4 The 16-bit DDFS architecture



Fig. 5 The architecture of sine/cosine generator (The  $\theta_{in}$  is an accumulated angle)



Fig. 6 The layout view of the proposed DDFS



Fig. 7 Plot of the number of correcting points versus SFDRs with different  $(F_s / F_o)$ 's



Fig.8 High-frequency SFDR using the proposed 16-bit DDFS (169.7 dBc)



Fig.9 Mid-frequency SFDR using the proposed 16-bit DDFS (122 dBc)



Fig.10 Low-frequency SFDR using the proposed 16-bit DDFS (85.06 dBc)

## 國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)、是否適 合在學術期刊發表或申請專利、主要發現或其他有關價值等,作一綜合評估。

| 1 | . 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估                          |
|---|----------------------------------------------------------|
|   | 達成目標                                                     |
|   | 🗌 未達成目標(請說明,以 100 字為限)                                   |
|   | □ 實驗失敗                                                   |
|   | □ 因故實驗中斷                                                 |
|   | □ 其他原因                                                   |
|   | 說明:                                                      |
| 2 | . 研究成果在學術期刊發表或申請專利等情形:                                   |
|   | 論文:■已發表 □未發表之文稿 □撰寫中 □無                                  |
|   | 專利:□已獲得 ■申請中 □無                                          |
|   | 技轉:□已技轉 □洽談中 ■無                                          |
| 卢 | 其他:(以100字為限)                                             |
| 3 | . 請依學術成就、技術創新、社會影響等方面,評估研究成果之學術或應用價                      |
|   | 值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)(以                        |
|   | 500 字為限)                                                 |
|   | 本研究使用混合座標旋轉原理設計及製作直接數位頻率合成器。此一設計之架構為無乘法                  |
|   | 器,包含小量之唯獨記憶體(16×4-位元)以及疊流資料路徑,所產生無寄生動態範圍超過               |
|   | 84.4 dBc。系統晶片由台積電0.18µm 1P6M CMOS 製程設計,並且在 Xilinx 陣列處理器上 |
|   | 實體模擬。證明此一以混合座標旋轉原理為基礎之直接數位頻率合成器適合由超大型積體                  |
|   | 電路製作,在硬體成本,功率消耗以及無寄生動態範圍上都有具備優勢。非常適用於無線                  |
|   | 網路之晶片。                                                   |
|   | 未來將進一步研究硬體架構的精簡,高頻之無寄生動態範圍提升以及雜訊比(PSNR)的改                |
|   | 進。                                                       |
|   |                                                          |
|   |                                                          |
|   |                                                          |
|   |                                                          |
|   |                                                          |
|   |                                                          |
| 1 |                                                          |

# 國科會補助計畫衍生研發成果推廣資料表

日期: 99年8月25日

|                | 計畫名稱:座標旋轉                                                             | 原理演算法應用於二                        | 二維及三維特殊信號處理                        |  |
|----------------|-----------------------------------------------------------------------|----------------------------------|------------------------------------|--|
|                | 器之晶片設計與製作(I)                                                          |                                  |                                    |  |
| 國科會補助計畫        | 計畫主持人: 宋志:                                                            | 卖                                |                                    |  |
|                | 計畫編號:NSC 98-22                                                        | 221-E-216-037 領                  | 域:積體電路及系統設計                        |  |
|                | (<br>(<br>中文)<br>高無寄生動                                                | 熊範圍及無乘法器之                        | 2直接數位頻率合成器                         |  |
| <b>吅孫亡里夕</b> 稱 |                                                                       |                                  |                                    |  |
| 州贸成不石件         | (英文) High-SFDR a                                                      | and Multiplierless D             | irect Digital Frequency            |  |
|                | Synthesizer                                                           |                                  |                                    |  |
| 成果歸屬機構         | 中華大學                                                                  | 發明人                              | 宋志雲                                |  |
|                |                                                                       | (創作人)                            |                                    |  |
|                | (中文)使用混合座                                                             | 標旋轉原理設計及靠                        | 製作直接數位頻率合成                         |  |
|                | 器。此一設計之架構                                                             | 為無乘法器,包含,                        | 卜量之唯獨記憶體(16X4-                     |  |
|                | 位元)以及疊流資料路                                                            | 徑,所產生無寄生                         | 動態範圍超過 84.4 dBc。                   |  |
|                | 系統晶片由台積電 11                                                           | P6M CMOS 製程設                     | 計,並且在 Xilinx 陣列處                   |  |
|                | 理器上實體模擬。證明此一以混合座標旋轉原理為基礎之直接數                                          |                                  |                                    |  |
|                | 位頻率合成器適合由超大型積體電路製作,在硬體成本,功率消                                          |                                  |                                    |  |
|                | 耗以及無寄生動態範圍上都有具備優勢。本合成器於高頻條件下,                                         |                                  |                                    |  |
|                | 月史向之無可王勤悲!                                                            | 靶闺,连到 109./UC<br>岢妃妈血灾止禹能等       | C。比較現任的且接數位                        |  |
|                | 频平合成品, 共有升<br>(茁立) This research                                      | 市灯的無可生動怨車<br>presents a hybrid ( | 心国。<br>"Oordinate Rotation Digital |  |
| 技術說明           | $(\mathcal{P}\mathcal{X})$ This research<br>Computer (CORDIC)         | algorithm for design             | ns and implementations of          |  |
|                | the direct digital free                                               | (DDFS). The proposed             |                                    |  |
|                | multiplier-less architecture with small ROM (16X4 -bit) and pipelined |                                  |                                    |  |
|                | data path provides a sp                                               | urious free dynamic              | range (SFDR) of more than          |  |
|                | 84.4 dBc.A SoC (Sys                                                   | stem on Chip) has                | been designed by 1P6M              |  |
|                | hybrid CORDIC-bas                                                     | ed architecture                  | is suitable for VI SI              |  |
|                | implementations of th                                                 | ne DDFS in terms                 | of hardware cost, power            |  |
|                | consumption, and SI                                                   | FDR. In case of                  | 16-bit word length, the            |  |
|                | high-frequency SFDR                                                   | is 169.7 dBc.As c                | one can see, the proposed          |  |
|                | DDFS is superior in                                                   | terms of SFDR, h                 | hardware cost, and power           |  |
|                | Consumption.                                                          |                                  |                                    |  |
| <b>產業別</b>     | <b>庙</b> 斤 跤 <b>千</b>                                                 |                                  |                                    |  |
| <b>注</b> 水初    |                                                                       |                                  |                                    |  |
| 山仁/文口应田竺田      | 無線數位高頻寬網絡                                                             | 設備及晶片                            |                                    |  |
| <b></b>        |                                                                       |                                  |                                    |  |
| 计处地描示化业工工机     | 可轉移晶片設計原始。                                                            | 碼及相關實驗資料                         | ,改進相關設備之性能。                        |  |
| <b></b>        |                                                                       |                                  |                                    |  |
| 效益             |                                                                       |                                  |                                    |  |

註:本項研發成果若尚未申請專利,請勿揭露可申請專利之主要內容。

### 國科會補助專題研究計畫項下出席國際學術會議心得報告

| 日期・99年8月23日 |
|-------------|
|-------------|

| 計畫編號           | NSC 98-2221-E-216-037                                                                               |                       |                |  |  |
|----------------|-----------------------------------------------------------------------------------------------------|-----------------------|----------------|--|--|
| 計畫名稱           | 座標旋轉原理演算法應用於二維及三維特殊信號處理器之晶片設計與製作(I)                                                                 |                       |                |  |  |
| 出國人員           | 它士雨                                                                                                 | 服務機構                  | 中華大學微電子工程學系    |  |  |
| 姓名             | 木心芸                                                                                                 | 及職稱                   | 教授             |  |  |
| 會議時間           | 99年4月11日至<br>99年4月13日                                                                               | 會議地點                  | 中國 杭州市         |  |  |
| 會議名稱           | 9 <sup>th</sup> WSEAS International Conference on Instrumentation, Measurement, Circuit and Systems |                       |                |  |  |
| 旅丰龄士           | (1) Reconfigurable Architecture for VLSI 9/7-5/3 Wavelet Filter                                     |                       |                |  |  |
| 资 <b>公</b> 珊 入 | (2) Folded Reconfigurable Arch                                                                      | hitecture for VLSI W  | /avelet Filter |  |  |
| 7月 日           | (3) A Novel Linear Array for D                                                                      | Discrete Cosine Trans | sform          |  |  |

一、參加會議經過

發表論文三篇,9<sup>th</sup> WSEAS International Conference on Instrumentation, Measurement, Circuit and Systems 並 與與會人士討論,並擔任兩場討論會之 Session Chair。本研討會為 WSEAS 之重要研討會,歷史 悠久,至少有二十年以上之歷史。本人發表之三篇論文接獲選為最佳論文(Best Papers),並邀請增 加篇幅在其期刊 WSEAS Transactions on Circuits and Systems (EI)登出。本人發表之論文,引起大家 極大興趣,會中廣泛討論內容。

二、與會心得

此一研討會為 WSEAS 重要研討會,與會人士均為一時之選。會中可認識相關領域的領導人物,機會 難得。論文如獲最佳論文,延伸後將獲得該學會之期刊(EI)刊登。與相關人士討論在台灣主辦此一國際 研討會之可能性。

三、考察參觀活動(無是項活動者略)

四、建議

鼓勵在教學之學校教師投稿此一研討會,並組 Special Session,因為有指標性意義。同時爭取 WSEAS 類似權威性研討會之主辦權,以增加台灣之國際學術地位。

五、攜回資料名稱及內容

研討會之論文光碟及紙本論文集。

# 附錄一:本計畫發表之相關論文

### 國際期刊:

- Tze-Yun Sung, Hsi-Chin Hsin, Yi-Peng Cheng, "Low-power and high-speed CORDIC-based split-radix FFT processor for OFDM systems", Digital Signal Processing, vol. 20, no. 2, 2010, pp. 511-527. (SCI) NSC 98-2221-E-216 -037.
- Tze-Yun Sung, Yaw-Shih Shieh, Hsi-Chin Hsin, "An Efficient VLSI Linear Array for DCT/IDCT Using Subband Decomposition Algorithm", Mathematical Problems in Engineering, vol. 2010, Article ID 185398, 21 pages, doi:10.1155/2010/185398, 2010. (SCI), NSC 98-2221-E-216 -037.
- 3. Yaw-Shih Shieh, **Tze-Yun Sung** (通訊作者), Hsi-Chin Hsin, "A Novel Linear Array for Discrete Cosine Transform", WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS, Issue 5, Volume 9, May 2010, pp. 335-346. (EI), NSC 98-2221-E-216 -037.
- Tze-Yun Sung, Hsi-Chin Hsin, "Reconfigurable Architecture for VLSI 9/7-5/3 Wavelet Filter", WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS, Issue 5, Volume 9, May 2010, pp. 347-357. (EI), NSC 98-2221-E-216 -037.
- Tze-Yun Sung, Hsi-Chin Hsin, "Multiplierless, Reconfigurable Folded Architecture for VLSI Wavelet", WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS, Issue 5, Volume 9, May 2010, pp. 358-368. (EI), NSC 98-2221-E-216 -037.

### 國內期刊:

- 1. **Tze-Yun Sung**, Hsi-Chin Hsin, Sheng-Dong Chang, "Reconfigurable VLSI Architecture for 9/7-5/3 Discrete Wavelet Transform", Chung Hua Journal of Science and Engineering (中華理工學刊), vol. 7, no. 3, Sept. 2009, pp. 63-70. NSC 98-2221-E-216 -037.
- Tze-Yun Sung, Hsi-Chin Hsin, Tsuo-Fu Wang, "VLSI Reconfigurable Architecture for 9/7-5/3 Lifting-Based Discrete Wavelet Transform", Chung Hua Journal of Science and Engineering (中華理工 學刊), vol. 7, no. 4, Dec. 2009, pp. 63-70. NSC 98-2221-E-216 -037.

### 國際研討會:

- 1. Yaw-Shih Shieh, **Tze-Yun Sung**(通訊作者), Hsi-Chin Hsin, "A Novel Linear Array for Discrete Cosine Transform" 9th WSEAS Int. Conference on INSTRUMENTATION, MEASUREMENT, CIRCUITS and SYSTEMS, April 2010, pp. 27-32. (EI), NSC 98-2221-E-216-037.
- Tze-Yun Sung, Hsi-Chin Hsin, "Reconfigurable Architecture for VLSI 9/7-5/3 Wavelet Filter," 9th WSEAS Int. Conference on INSTRUMENTATION, MEASUREMENT, CIRCUITS and SYSTEMS, April 2010, pp. 15-20. (EI), NSC 98-2221-E-216 -037.
- Tze-Yun Sung, Hsi-Chin Hsin, "Reconfigurable Architecture for VLSI 9/7-5/3 Wavelet Filter," 9th WSEAS Int. Conference on INSTRUMENTATION, MEASUREMENT, CIRCUITS and SYSTEMS, April 2010, pp. 21-26. (EI), NSC 98-2221-E-216-037.

### 4.

### 國內研討會:

1. Yaw-Shih Shieh, Tze-Yun Sung(通訊作者), Hsin-Chin Hsin, "A Novel VLSI Linear Array for Discrete

Cosine Transform and Its Inverse," 8<sup>th</sup> Conference of Microelectronics Technology and Applications (2010-CMTA), Kaohsung City, Taiwan, May 2010, pp.241-250. NSC 98-2221-E-216 -037.

 Tze-Yun Sung, Hsin-Chin Hsin, "Multiplierless, Folded Reconfigurable Architecture for VLSI Wavelet Filter," 8<sup>th</sup> Conference of Microelectronics Technology and Applications (2010-CMTA), Kaohsung City, Taiwan, May 2010, pp. 251-258. NSC 98-2221-E-216 -037.

### 國科會補助專題研究計畫項下出席國際學術會議心得報告

| ロ 切 ・ ファ 平0 / 4./ 6 | 日期 | : | <b>9</b> 9年 | F-8 | 月 | 25 | E |
|---------------------|----|---|-------------|-----|---|----|---|
|---------------------|----|---|-------------|-----|---|----|---|

| 計畫編號           | NSC 98-2221-E-216-037                                                                               |                       |                |  |  |
|----------------|-----------------------------------------------------------------------------------------------------|-----------------------|----------------|--|--|
| 計畫名稱           | 座標旋轉原理演算法應用於二維及三維特殊信號處理器之晶片設計與製作(I)                                                                 |                       |                |  |  |
| 出國人員           | · · 士 · 中                                                                                           | 服務機構                  | 中華大學微電子工程學系    |  |  |
| 姓名             | 木心芸                                                                                                 | 及職稱                   | 教授             |  |  |
| 會議時間           | 99年4月11日至<br>99年4月13日                                                                               | 會議地點                  | 中國 杭州市         |  |  |
| 會議名稱           | 9 <sup>th</sup> WSEAS International Conference on Instrumentation, Measurement, Circuit and Systems |                       |                |  |  |
| 旅丰龄士           | (1) Reconfigurable Architecture for VLSI 9/7-5/3 Wavelet Filter                                     |                       |                |  |  |
| 资 <b>公</b> 珊 入 | (2) Folded Reconfigurable Arcl                                                                      | hitecture for VLSI W  | /avelet Filter |  |  |
| 728日           | (3) A Novel Linear Array for D                                                                      | Discrete Cosine Trans | sform          |  |  |

一、參加會議經過

發表論文三篇,9<sup>th</sup> WSEAS International Conference on Instrumentation, Measurement, Circuit and Systems 並與 與會人士討論,並擔任兩場討論會之 Session Chair。本研討會為 WSEAS 之重要研討會,歷史悠久, 至少有二十年以上之歷史。本人發表之三篇論文接獲選為最佳論文(Best Papers),並邀請增加篇幅在 其期刊 WSEAS Transactions on Circuits and Systems (EI)登出。本人發表之論文,引起大家極大興趣, 會中廣泛討論內容。

二、與會心得

此一研討會為 WSEAS 重要研討會,與會人士均為一時之選。會中可認識相關領域的領導人物,機會難 得。論文如獲最佳論文,延伸後將獲得該學會之期刊(EI)刊登。與相關人士討論在台灣主辦此一國際研討 會之可能性。

三、考察參觀活動(無是項活動者略)

四、建議

鼓勵在教學之學校教師投稿此一研討會,並組 Special Session,因為有指標性意義。同時爭取 WSEAS 類似權威性研討會之主辦權,以增加台灣之國際學術地位。

五、攜回資料名稱及內容

研討會之論文光碟及紙本論文集。

### **Reconfigurable Architecture for VLSI 9/7-5/3 Wavelet Filter**

Tze-Yun Sung<sup>1</sup> Hsi-Chin Hsin<sup>2</sup> <sup>1</sup>Department of Microelectronics Engineering Chung Hua University 707, Sec. 2, Wufu Road, Hsinchu City 300-12, Taiwan bobsung@chu.edu.tw <sup>2</sup>Department of Computer Science and Information Engineering National United University 1, Lien-Da, Miao-Li, 36003, Taiwan hsin@nuu.edu.tw

*Abstract:* - In this paper, the high-efficient and reconfigurable lined-based architectures for the 9/7-5/3 discrete wavelet transform (DWT) based on lifting scheme are proposed. The proposed parallel and pipelined architectures consist of a horizontal filter (HF) and a vertical filter (VF). The critical paths of the proposed architectures are reduced. Filter coefficients of the biorthogonal 9/7-5/3 wavelet low-pass filter are quantized before implementation in the high-speed computation hardware In the proposed architecture is 100% hardware utilization and ultra low-power. The proposed reconfigurable architectures have regular structure, simple control flow, high throughput and high scalability. Thus, they are very suitable for new-generation image compression systems, such as JPEG-2000.

*Key-Words:* - Reconfigurable architecture, 9/7-5/3 discrete wavelet transform (DWT), horizontal filter (HF), vertical filter (VF), lifting scheme.

### **1** Introduction

In the field of digital image processing, the JPEG-2000 standard uses the scalar wavelet transform for image compression [1]; hence, the two-dimensional (2-D) discrete wavelet transform (DWT) and IDWT has recently been used as a powerful tool for image coding/decoding systems. Two-dimensional DWT/IDWT demands massive computations, hence, it requires a parallel and pipelined architecture to perform real-time or on-line video and image coding and decoding, and to implement high-efficiency application-specific integrated circuits (ASIC) or field programmable gate array (FPGA). At the kernel of the compression stage of the system is the DWT.

Swelden proposed using the biorthogonal 9/7 wavelet based on lifting scheme for lossy compression [2]. The symmetry of the biorthogonal 9/7 filters and the fact that they are almost orthogonal [2] make them good candidates for image compression application. Gall and Tabatai proposed using the biorthogonal 5/3 wavelet based on lifting scheme for lossless compression [3]. The goal of the proposed architectures is to embed the 5/3 DWT computation into the 9/7 DWT computation. The coefficients of the filter are quantized before hardware implementation; hence, the multiplier can be replaced by limited quantity of shift registers and adders. Thus, the system hardware is saved, and the system throughput is improved significantly.

In this paper, we proposed a high-efficient architecture for the even and odd parts of 1-D DWT based on lifting scheme. The advantages of the proposed architectures are 100% hardwareutilization, multiplier-less, regular structure, simple control flow and high scalability.

The remainder of the paper is organized as follows. Section 2 presents the lifting-based 2-D discrete wavelet transform algorithm, and derives new mathematical formulas. In Section 3, the high-efficient and reconfigurable architecture for the lifting-based 2-D DWT are proposed. Finally, comparison of performance between the proposed reconfigurable architecture and previous works is made with conclusions given in section 4.

### 2 The Lifting-Based 2-D DWT Algorithm

Usually the Lifting-based DWT requires less computation compared to the convolution-based approach. However, the savings depend on the length of the filters. During the lifting implementation, no-extra memory buffer is required because of the in-place computation feature of lifting. This is particularly suitable for the hardware implementation with limited available on-chip memory. Many papers proposed the algorithms and architectures of DWT [3]-[11], but they require massive computation. In 1996, Sweldens proposed a new lifting-based DWT architecture, which requires half of hardware compared to the conventional approaches [2].

#### 2.1 The 9/7 2-D DWT Algorithm

The 9/7 discrete wavelet transform factoring into lifting scheme is represented as [12]:

$$\widetilde{P}_{9/7} = \begin{bmatrix} 1 & \alpha(1+z^{-1}) \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ \beta(1+z) & 1 \end{bmatrix} \begin{bmatrix} 1 & \gamma(1+z^{-1}) \\ 0 & 1 \end{bmatrix}$$
(1)
$$\begin{bmatrix} 1 & 0 \\ \delta(1+z) & 1 \end{bmatrix} \begin{bmatrix} \zeta & 0 \\ 0 & 1/\zeta \end{bmatrix}$$

where  $\alpha, \beta, \gamma$  and  $\delta$  are the coefficients of lifting scheme, and  $\zeta$  and  $1/\zeta$  are scale normalization factors.

The architecture based on lifting scheme consists of splitting module, two lifting modules and scaling modules. The architecture of 9/7 1-D DWT based on lifting scheme is shown in Figure 1.

The 9/7 2-D DWT is a multilevel decomposition technique. According to the architecture of 9/7 1-D DWT based on lifting scheme, the architecture of modified 9/7 2-D DWT based on lifting scheme can be derived and shown in Figure 2.

**2.2** The modified 9/7 2-D DWT Algorithm According to equation (1), the transform matrix of the 9/7 DWT based on lifting scheme is modified as

$$\begin{split} \tilde{P}_{1}(z) &= \begin{bmatrix} 1 & \alpha(1+z^{-1}) \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ \beta(1+z) & 1 \end{bmatrix} \\ \begin{bmatrix} 1 & \gamma(1+z^{-1}) \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ \delta(1+z) & 1 \end{bmatrix} \begin{bmatrix} \zeta & 0 \\ 0 & 1/\zeta \end{bmatrix} \\ &= \begin{bmatrix} 1/\alpha & (1+z^{-1}) \\ 0 & 1/\alpha \end{bmatrix} \begin{bmatrix} 1/\beta & 0 \\ (1+z) & 1/\beta \end{bmatrix} \\ \begin{bmatrix} 1/\gamma & (1+z^{-1}) \\ 0 & 1/\gamma \end{bmatrix} \begin{bmatrix} 1/\delta & 0 \\ (1+z) & 1/\delta \end{bmatrix} \begin{bmatrix} \alpha\beta\gamma\delta\zeta & 0 \\ 0 & \alpha\beta\gamma\delta/\zeta \end{bmatrix} \\ &= \begin{bmatrix} 1 & 1+z^{-1} \\ 0 & 1/\alpha \end{bmatrix} \begin{bmatrix} 1/\alpha\beta & 0 \\ 1+z & 1 \end{bmatrix} \\ \begin{bmatrix} 1 & 1+z^{-1} \\ 0 & \alpha\beta\gamma\delta\zeta & 0 \\ 0 & 1/\beta\gamma \end{bmatrix} \begin{bmatrix} 1/\gamma\delta & 0 \\ 1+z & 1 \end{bmatrix} \begin{bmatrix} \alpha\beta\gamma\delta\zeta & 0 \\ 0 & \alpha\beta\gamma/\zeta \end{bmatrix} \end{split}$$

$$= \begin{bmatrix} 1 & 1+z^{-1} \\ 0 & A \end{bmatrix} \begin{bmatrix} B & 0 \\ 1+z & 1 \end{bmatrix}$$

$$\begin{bmatrix} 1 & 1+z^{-1} \\ 0 & C \end{bmatrix} \begin{bmatrix} D & 0 \\ 1+z & 1 \end{bmatrix} \begin{bmatrix} K_0 & 0 \\ 0 & K_1 \end{bmatrix}$$
(2)
$$A = 1/\pi B = 1/\pi B = C = 1/\pi B = 1/\pi B$$

where  $A = 1/\alpha$ ,  $B = 1/\alpha\beta$ ,  $C = 1/\beta\gamma$ ,  $D = 1/\gamma\delta$ ,  $K_0 = \alpha\beta\gamma\delta\zeta$ , and  $K_1 = \alpha\beta\gamma/\zeta$ .

#### 2.3 The 5/3 2-D DWT Algorithm

The data flow of 5/3 1-D DWT based on lifting scheme is shown in Figure 3.

The 5/3 2-D DWT is a multilevel decomposition technique; that decomposes into four subbands such as HH, HL, LH and LL. The data flow of 5/3 2-D DWT based on lifting scheme can be derived and shown in Figure 4. The 5/3 discrete wavelet transform factoring into lifting scheme is represented as [10]:

$$\widetilde{P}_{5/3} = \begin{bmatrix} 1 & \alpha (1+z^{-1}) \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ \beta (1+z) & 1 \end{bmatrix}$$

$$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$$
(3)

### 3 The High-Efficient and Reconfigurable Architectures for 9/7 and 5/3 Lifting-Based 2-D DWT

The proposed reconfigurable architecture for 5/3 and 9/7 lifting based 2-D DWT including horizontal filter (HF) and vertical filter (VF) is shown in Figure 5. In this reconfigurable architecture, the architecture of horizontal filter and the architecture of vertical filter are shown in Figure 6 and 7, respectively. The proposed reconfigurable architecture for modified horizontal filter (HF) consists of eleven delay units, seventeen multiplexers and two processing elements (PEs). The PE(A/B/E) performs O1 for data output 5/3(1) and PE(C/D/F) performs O2 for data output 9/7(1). The architecture of PE(A/B/E) and the architecture of PE(C/D/F) are shown in Figure 8 and 9, respectively. The architecture of scaling normalization (SN) is shown in Figure 10. Filter coefficients of the biorthogonal 9/7 and 5/3 wavelet low-pass filter are quantized before implementation in the high-speed computation hardware. In the proposed architecture, all multiplications are performed using shifts and additions after approximating the coefficients as a Booth binary recoded format (BBRF). The constant multiplier shown in Figure 11 consists of two carry-save-adders (CSA(4,2)), a Carry Lookahead Adder (CLA), and six hardwire shifters and replaces conventional multiplier ( $\otimes$ ) in *PE*(*A*/*B*/*E*), *PE*(*C*/*D*/*F*) and SN. Figure 12 shows architectures of line delays LD, LD1, LD2 and LD3 in vertical filter. We handle borders by the symmetric extension method [10]. Hence, the quality of reconstructed images can be improved.

The proposed reconfigurable architectures for 9/7 and 5/3 DWT reduce the critical path [13]-[19]. In  $N \times N$  2-D DWT, it requires  $9J + 10N(1 - \frac{1}{2^J}) + \frac{4}{3}N^2(1 - \frac{1}{4^J})$  computation

cycles (addition operations) with  $N^2/4+9N$ memories to perform 9/7 2-D DWT, where *J* is number of levels. It requires

$$4J + 2N(1 - \frac{1}{2^J}) + \frac{2}{3}N^2(1 - \frac{1}{4^J})$$
 computation

cycles (addition operations) with  $N^2/4 + 3.5N$ memories to perform 5/3 2-D DWT, where *J* is number of levels. Both of two architectures are 100% hardware utilization.

### **4** Conclusion

Filter coefficients are quantized before implementation using the biorthogonal 9/7 and 5/3 wavelet. The hardware is cost-effective and the system is high-speed. The proposed architecture in 9/7 DWT reduces power dissipation by m compared with conventional architectures in m-bit operand (low-power utilization).

The proposed architecture in 5/3 DWT with 24-bit fixed point operations had been applied to  $512 \times 512$  original images Lena is shown in Figures 13(a) and the reconstructed images Lena is shown in Figure 13(b), respectively. The PSNRs of the reconstructed images Lena is 32.554dB. Hence, the proposed reconfigurable architecture has been applied to image compression with great satisfaction.

In this paper, the high-efficient and low-power reconfigurable architecture for 2-D DWT has been proposed. The proposed reconfigurable architecture performs compression in

$$(9J+10N(1-\frac{1}{2^{J}})+\frac{4}{3}N^{2}(1-\frac{1}{4^{J}})) \cdot T_{a}$$
 computation  
time for 9/7 DWT and

time for 9/7 DWT and  $(\frac{3}{2}N(1-\frac{1}{2^J})+\frac{2}{3}N^2(1-\frac{1}{4^J}))\cdot T_a$  for 5/3 DWT,

where the time unit (Ta is time of addition operation). The critical paths are  $3T_a$  for 9/7 DWT and  $2T_a$  for 5/3 DWT, and the output latency time are  $49T_a$  for 9/7 DWT and  $11T_a$  for 5/3 DWT. Buffer sizes are  $N^2/4+9N$  for 9/7 DWT and  $N^2/4+3.5N$  for 5/3 DWT. The control complexity is very simple.

The comparisons between previous works [13] [18] and this work are shown in Table 1 for 9/7 DWT and Table 2 for 5/3 DWT.

The advantages of the proposed reconfigurable architecture are 100% hardware utilization and ultra low-power. The architecture has regular structure, simple control flow, high throughput and high scalability. Thus, it is very suitable for new-generation image compression systems, such as JPEG-2000. The proposed reconfigurable DWT is a reusable IP, which can be implemented in various processes and combined with an efficient use of hardware resources for the trade-offs of performance, area, and power consumption.

References:

- [1] ITU-T Recommendation T.800. JPEG2000 image coding system – Part 1, ITU Std., July 2002. http://www.itu.int/ITU-T/.
- [2] W. Sweldens, "The lifting scheme: A custom-design construction of biorthogonal wavelet," *Applied and Computational Harmonic Analysis*, vol. 3, 1996, pp. 186-200.
- [3] D. L. Gall, A. Tabiatai, "Sub-band coding of digital images using symmetric short kernel filters and arithmetic coding techniques," *Proc. IEEE Int. Conf. Acoust., Speech, signal Process.*, 1988, pp.761-764.
- [4] G. Beylkin, R. Coifman, V. Rokhlin, "Wavelet in numerical analysis in Wavelets and their applications," New York: Jones and Bartlett, 1992.
- [5] A. N. Akansu, R. A. Haddad, "Multiresolution signal decomposition: Transform, subbands and Wavelets," New York: Academic, 1992.
- [6] I. Sodagar, H.-J. Lee, P. Hatrack, Y.-Q. Zhang, "Scalable wavelet coding for synthetic/natural hybrid images," *IEEE Trans. Circuits and Systems for Video Technology*, Vol. 9, Mar. 1999, pp. 244-254.
- [7] D. Taubman, "High performance scalable image compression with EBCOT," *IEEE Trans. Image Processing*, Vol. 9, 2000. pp. 1158-1170.
- [8] R. Kronland-Martinet, J. Morlet, A. Grossman, "Analysis of sound patterns through wavelet transform," *Int. J. Pattern Recognit. Artif. Intell.*, Vol. 1, No. 2, 1987, pp. 273-302.
- [9] M. A. Stoksik, R. G. Lane, D. T. Nguyen, "Accurate synthesis of fractional Brownian motion using wavelets," *Electronic Letters*, Vol. 30, No. 5, Mar. 1994, pp. 384-284.
- [10] T. Y. Sung, "Memory-efficient and high-performance parallel-pipelined architectures

for 5/3 forward and inverse discrete Wavelet transform," *WSEAS TRANSACTIONS on ELECTRONICS*, Issue 2, Volume 4, February 2007, pp.32-41.

- [11] K. C. B. Tan, T. Arslan, "An embedded extension algorithm for the lifting based discrete wavelet transform in JPEG 2000," *IEEE International conference on acoustic, speech, and signal processing*, Vol.4, 2002, pp.3513-3516.
- [12] K. Parhi, T. Nishitani, "VLSI architectures for discrete wavelet transforms," *IEEE Trans. VLSI Systems*, Vol. 1, No. 2, 1993, pp. 191-202.
- [13] I. Daubechies, W. Sweldens., "Factoring wavelet transforms into lifting schemes," *J.Fourier Anal. Appl.*, Vol.4, 1998, pp. 247-269.
- [14] K. Andra, C. Chakrabarti, T. Acharya, "A VLSI architecture for lifting-based forward and inverse wavelet transform," *IEEE Trans. on Signal Processing*, Vol. 50, 2002, pp.966-977.
- [15] C. Xiong, S. Zheng, J. Tian, J. Liu, "The improved lifting scheme and novel reconfigurable VLSI architecture for the 5/3 and 9/7 wavelet filters," 2004 IEEE International Conference on Communications, Circuits and Systems, Vol. 2, 2004, pp. 728-732.
- [16] T.-Y. Sung, Y.-S. Shieh, "A high-speed / ultra low-power architecture for 2-D discrete Wavelet transform," 2005 IEEE International Conference on Systems and Signals (ICSS-2005), Kaohsiung, Taiwan, April 28-29, 2005, pp. 326-331.
- [17] T.-Y. Sung, C.-W. Yu, Y.-S. Shieh, H.-C. Hsin, "A high-efficient line-based architecture for 2-D lifting-based DWT using 9/7 Wavelet filters," 9th International Conference on Computer Science and Informatics (CSI-2006) in conjunction with 9th Joint Conference on Information Sciences (JCIS-2006), Kaohsiung, Taiwan, Oct. 8-11, 2006, pp.626-629.
- [18] B.-F. Wu, C.-F. Lin, "A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete Wavelet transform of JPEG-2000 Codec," *IEEE Trans. Circuits and Systems for Video Technology*, Vol. 15, No. 12, Dec. 2005, pp. 1615-1628.
- [19] J. S. Chiang, and C. H. Hsia, "An efficient VLSI architecture for 2-D DWT using lifting scheme," *IEEE International Conference on Systems and signals*, April, 2005, pp.528-531.

Table 1 Comparison between previous works and this work (9/7 DWT architecture with single input,  $T_m$ : multiplication time,  $T_a$ : addition time and  $T_d$ : latency delay )

| Archite-<br>cture | Performance                                                                   |
|-------------------|-------------------------------------------------------------------------------|
|                   | Memory Size(bytes):                                                           |
|                   | $N^2/4 + 5.5N$                                                                |
| Wn [18]           | Computation Cycle:                                                            |
| 2005              | $(22J + 6N(1 - \frac{1}{2^J}) + \frac{4}{3}N^2(1 - \frac{1}{4^J})) \cdot T_m$ |
|                   | Border Handling: NO                                                           |
|                   | Hardware Utilization: ≈100%                                                   |
|                   | Memory Size(bytes):                                                           |
|                   | $N^2$                                                                         |
| Andra             | Computation Cycle:                                                            |
| [13] 2002         | $T_d \cdot J + \frac{4}{3}N^2(1 - \frac{1}{4^J}) \cdot T_m$                   |
|                   | Border Handling: NO                                                           |
|                   | Hardware Utilization: ≈ 50%                                                   |
|                   | Memory Size(bytes):                                                           |
|                   | $N^2 / 4 + 9N$                                                                |
|                   | Computation Cycle:                                                            |
| This work         | $(9J+10N(1-\frac{1}{2^J})+\frac{4}{3}N^2(1-\frac{1}{4^J}))\cdot T_a$          |
|                   | Border Handling: YES                                                          |
|                   | Hardware Utilization: ≈100%                                                   |

Table 2 Comparison between previous works and this work (5/3 DWT architecture with dual input,  $T_m$ : multiplication time and  $T_a$ :addition time)

| Archite-<br>cture      | Performance                                                                                              |
|------------------------|----------------------------------------------------------------------------------------------------------|
|                        | Memory Size(bytes):<br>$N^2/4 + 5N$                                                                      |
| Chiang<br>[19]<br>2005 | Computation Cycle:<br>$N^2 \cdot T_m$                                                                    |
|                        | Border Handling: NO                                                                                      |
|                        | Hardware Utilization: $\approx 100\%$                                                                    |
|                        | Memory Size(bytes):<br>$N^2/2 + 4N$                                                                      |
| Andra<br>[20]<br>2002  | Computation Cycle:<br>$(\frac{N^2}{2} + 2N) \cdot T_m$                                                   |
|                        | Border Handling: NO                                                                                      |
|                        | Hardware Utilization: ≈100%                                                                              |
|                        | Memory Size(bytes):<br>$N^2/4+3.5N$                                                                      |
| This work              | Computation Cycle:<br>$(\frac{3}{2}N(1-\frac{1}{2^{J}})+\frac{2}{3}N^{2}(1-\frac{1}{4^{J}}))\cdot T_{a}$ |
|                        | Border Handling: YES                                                                                     |
|                        | Hardware Utilization: ≈100%                                                                              |



Fig. 1. The architecture of 9/7 1-D DWT based on lifting scheme( $P1=\alpha(1+z)$ ,  $U1=\beta(1+z^{-1})$ ,  $P2=\gamma(1+z)$  and  $U2=\delta(1+z^{-1})$ )



Fig. 2. The architecture of modified 9/7 2-D DWT based on lifting scheme



Fig. 3. The data flow of 5/3 1-D DWT based on lifting scheme



Fig. 4. The data flow of 5/3 2-D DWT based on lifting scheme



Fig. 5. The proposed reconfigurable architecture for 5/3 and 9/7 2-D DWT based on lifting scheme







Fig. 7 The architecture of vertical filter (VF) in the proposed reconfigurable architecture



PE(A/B/E)

Fig. 8 The architecture of PE(A/B/E) in the proposed reconfigurable architecture



Fig. 9 The architecture of PE(C/D/F) in the proposed reconfigurable architecture



Fig. 10 The architecture of scaling normalization (SN) in the proposed reconfigurable architecture



Fig. 11 The constant multiplier replaces conventional multiplier ( $\otimes$ ) in *PE*(*A*/*B*/*E*), *PE*(*C*/*D*/*F*) and SN



Fig. 12 The architectures of line delays *LD*, *LD1*, *LD2* and *LD3* in vertical filter



(a) (b) Fig. 13 512×512 Lena (a) Original image (b) Reconstructed image (5/3 DWT with 5-level)

### Folded Reconfigurable Architecture for VLSI Wavelet Filter

Tze-Yun Sung Department of Microelectronics Engineering Chung Hua University Hsinchu City 300-12, Taiwan bobsung@chu.edu.tw

Hsi-Chin Hsin Department of Computer Science and Information Engineering National United University Miaoli 36003, Taiwan hsin@nuu.edu.tw Sheng-Dong Zhang Department of Electrical Engineering National Central University Chungli City 320-01, Taiwan 985401004@cc.ncu.edu.tw

*Abstract:* - In this paper, the high-efficient and reconfigurable architectures for the 9/7-5/3 discrete wavelet transform (DWT) based on convolution scheme are proposed. The proposed parallel and pipelined architectures consist of a high-pass filter (HF) and a low-pass filter (LF). The critical paths of the proposed architectures are reduced. Filter coefficients of the biorthogonal 9/7-5/3 wavelet low-pass filter are quantized before implementation in the high-speed computation hardware. In the proposed architectures, all multiplications are performed using less shifts and additions. The proposed reconfigurable architecture is 100% hardware utilization and ultra low-power. The proposed reconfigurable architectures have regular structure, simple control flow, high throughput and high scalability. Thus, they are very suitable for new-generation image compression systems, such as JPEG-2000.

*Key-Words:* - Folded reconfigurable architecture, 9/7-5/3 discrete wavelet transform (DWT), high-pass filter (HF), low-pass filter (LF), convolution scheme.

### **1** Introduction

In the field of digital image processing, the JPEG-2000 standard uses the scalar wavelet transform for image compression [1]; hence, the two-dimensional (2-D) discrete wavelet transform (DWT) and IDWT has recently been used as a powerful tool for image coding/decoding systems. Two-dimensional DWT/IDWT demands massive computations, hence, it requires a parallel and pipelined architecture to perform real-time or on-line video and image coding and decoding, and to high-efficiency application-specific implement integrated circuits (ASIC) or field programmable gate array (FPGA). At the kernel of the compression stage of the system is the DWT.

Daubechies proposed using the JPEG2000 standard biorthogonal 9/7 wavelet based on convolution scheme for lossy compression [2]. The symmetry of the biorthogonal 9/7 filters and the fact that they are almost orthogonal [2] make them good candidates for image compression application. Le Gall proposed using the JPEG2000 standard biorthogonal 5/3 wavelet based on convolution scheme for lossless compression [2]. The goal of the proposed architectures is to embed the 5/3 DWT computation into the 9/7 DWT computation. The

The National Science Council of Taiwan, under Grant NSC98-2221-E-

coefficients of the filter are quantized before hardware implementation; hence, the multiplier can be replaced by limited quantity of shift registers and adders. Thus, the system hardware is saved, and the system throughput is improved significantly.

In this paper, we proposed a high-efficient architecture for the even and odd parts of 1-D DWT based on lifting scheme. The advantages of the proposed architectures are 100% hardwareutilization, multiplier-less, regular structure, simple control flow and high scalability.

The remainder of the paper is organized as follows. Section 2 presents the 2-D discrete wavelet transform algorithm, and derives new mathematical formulas. In Section 3, the high-efficient and reconfigurable architecture for the 2-D DWT are proposed. Finally, comparison of performance between the proposed reconfigurable architecture and previous works is made with conclusions given in section 4.

### 2 The 2-D DWT Algorithm

The 2-D DWT is a multilevel decomposition technique. The mathematical formulas of 2-D DWT are defined as follows [3]-[4]:

$$LL^{j}(m,n) = \sum_{i=0}^{K-1} \sum_{k=0}^{K-1} l(i) \cdot l(k) \cdot LL^{j-1}(2m-i,2n-k) \quad (1)$$

<sup>216-037</sup> and Chung Hua University, Taiwan under Grant NSC98-2221-E-2216-037, supported this work.

$$LH^{j}(m,n) = \sum_{i=0}^{K-1} \sum_{k=0}^{K-1} l(i) \cdot h(k) \cdot LL^{j-1}(2m-i,2n-k) \quad (2)$$

$$HL^{j}(m,n) = \sum_{i=0}^{K-1} \sum_{k=0}^{K-1} h(i) \cdot l(i) \cdot LL^{j-1}(2m-i,2n-k)$$
(3)

$$HH^{j}(m,n) = \sum_{i=0}^{K-1} \sum_{k=0}^{K-1} h(i) \cdot h(k) \cdot LL^{j-1}(2m-i,2n-k)$$
(4)

where  $0 \le n, m < N_j$ ,  $LL^0(m, n)$  is the input image, *K* denotes the length of filter, l(i) denote the impulse responses of the low-pass filter, and h(k) denote the impulse responses of the high-pass filter, which is developed from  $(K \times K)$  -tap filters, and  $LL^j(m,n)$ ,  $LH^j(m,n)$ ,  $HL^j(m,n)$ , and  $HH^j(m,n)$  denote respectively the coefficients of low-low, low-high, high-low and high-high subbands produced at the decomposition level *j* (also represented by  $LL^j$ ,  $LH^j$ ,  $HL^j$ , and  $HH^j$ ).  $N_i \times N_i$  denotes samples of  $LL^j$ .

According to the mathematical formulas (1), (2), (3) and (4), the decomposition is produced by four 2-D convolutions followed by the decimation both in the row and in the column dimension for each level. In the three-level analysis for 2-D DWT, the data set  $LL^{j-1}$  having  $N_{j-1} \times N_{j-1}$  samples is decomposed into four subbands  $LL^{j}$ ,  $LH^{j}$ ,  $HL^{j}$ , and  $HH^{j}$  each having  $N_{j} \times N_{j}$  (equals to  $(N_{j-1}/2) \times (N_{j-1}/2)$ ) samples.

Let  $LL_m^j(2n)$ , l(i)l(2k), l(i)h(2k), h(i)l(2k)and h(i)h(2k) be 1-D DWT consisting of the  $0 \le n \le N_i$ ; even-numbered samples, and  $0 \le k \le K/2$  .Moreover, let  $LL_m^j(2n+1)$ , l(i)l(2k+1), l(i)h(2k+1), h(i)l(2k+1) and h(i)h(2k+1) be 1-D DWT consisting of the odd-numbered samples, and  $0 \le k \le K / 2$  $0 \le n \le N_i$ ;

 $LL_{m,i}^{j}(n)$ ,  $LH_{m,i}^{j}(n)$ ,  $HL_{m,i}^{j}(n)$ , and  $HH_{m,i}^{j}(n)$  can be expressed as follows:

$$LL_{m,i}^{j}(n) = \sum_{k=0}^{\lceil K/2 \rceil - 1} l(i)l(2k) \cdot LL_{2m-i}^{j-1}(2n - 2k) + \sum_{k=0}^{K-\lceil K/2 \rceil - 1} l(i)l(2k+1) \cdot LL_{2m-i}^{j-1}(2n - 2k - 1)$$
(5)

$$LH_{m,i}^{j}(n) = \sum_{k=0}^{\lceil K/2 \rceil -1} l(i)h(2k) \cdot LL_{2m-i}^{j-1}(2n-2k)$$

$$+ \sum_{k=0}^{K - \lceil K/2 \rceil -1} l(i)h(2k+1) \cdot LL_{2m-i}^{j-1}(2n-2k-1)$$

$$HL_{m,i}^{j}(n) = \sum_{k=0}^{\lceil K/2 \rceil -1} h(i)l(2k) \cdot LL_{2m-i}^{j-1}(2n-2k)$$

$$+ \sum_{k=0}^{K - \lceil K/2 \rceil -1} h(i)l(2k+1) \cdot LL_{2m-i}^{j-1}(2n-2k-1)$$

$$HH_{m,i}^{j}(n) = \sum_{k=0}^{\lceil K/2 \rceil -1} h(i)h(2k) \cdot LL_{2m-i}^{j-1}(2n-2k)$$

$$+ \sum_{k=0}^{\lceil K/2 \rceil -1} h(i)h(2k+1) \cdot LL_{2m-i}^{j-1}(2n-2k-1)$$

$$(8)$$

The above equations imply that  $LL_{m,i}^{j}(n)$ ,  $LH_{m,i}^{j}(n)$ ,  $HL_{m,i}^{j}(n)$  and  $HH_{m,i}^{j}(n)$  can be computed as the sum of two 1-D convolutions performed independently on the even part  $LL_{2m-i}^{j-1}(2n-2k)$  and the odd part  $LL_{2m-i}^{j-1}(2n-2k-1)$ .

#### 2.1 The 2-D DWT Algorithm

The proposed architecture performs parallel and pipelined processing. Each analysis level involves two stages: stage 1 performs row filtering, and stage 2 performs column filtering. After stage 1, the input image of size is decomposed into two subbands (L and H) of size. And stage 1 performs result is stored in the memory. The L subbund inputted first of stage 2 performs. Second is H subbund. In a one-level filter bank for 2-D DWT computation. At the first level, the size of the input image is  $N \times N$ , and the size of the output of each of the three subbands LH, HL and *HH* is  $(N/2) \times (N/2)$ . At the second level, the input is the LL subband whose size is  $(N/2) \times (N/2)$ , and the size of the output of each of the three subbands LLLH, LLHL and LLHH is  $(N/4) \times (N/4)$ . At the third level, the input is the LLLL subband whose size is  $(N/4) \times (N/4)$ , and the size of the output of each of the four subbands LLLLLL, LLLLH, LLLLHL and LLLLHH is  $(N/8) \times (N/8)$ , and so on. Figure 1 shows 1-level 2-D DWT.

The coefficients of the low-pass filter and the high-pass filter have been derived in the biorthogonal 9/7 and 5/3 wavelet [2]. The 9/7 wavelet coefficients are quantized before hardware implementation. We assume that the low-pass filter has nine tapes:  $h_{9/7}(0)$ ,  $\pm h_{9/7}(1)$ ,  $\pm h_{9/7}(2)$ ,  $\pm h_{9/7}(3)$  and  $\pm h_{9/7}(4)$ , and the high-pass filter has seven tapes:

 $g_{9/7}(0)$ ,  $\pm g_{9/7}(1)$ ,  $\pm g_{9/7}(2)$  and  $\pm g_{9/7}(3)$ . The 5/3 wavelet coefficients are quantized before hardware implementation. We assume that the low-pass filter has five tapes:  $h_{5/3}(0)$ ,  $\pm h_{5/3}(1)$  and  $\pm h_{5/3}(2)$ , and the high-pass filter has three tapes:  $g_{5/3}(0)$ ,  $\pm g_{5/3}(0)$ .

### 3 The High-Efficient and Reconfigurable Architectures for 5/3 and 9/7 2-D DWT

The proposed reconfigurable architecture for 5/3 and 9/7 convolution based 2-D DWT including hing-pass filter (HF) and low-pass filter (LF) is shown in Figure 2[5][6]. In this reconfigurable architecture, the input architecture are show in Figure 3,4,5 and 6,the multiplexers architecture are show in Figure 7,the architecture of hing-pass filter (HF) and the architecture of low-pass filter (LF)are shown in Figure 8 and 9, respectively. The proposed reconfigurable architecture for hing-pass filter (HF) consists of one delay units, twenty-eight multiplexers, six adders (It doesn't include carrry save adder (CSA)) and four 9/7 wavelet coefficients processing (PEs).The elements proposed reconfigurable architecture for low-pass filter (LF) consists of 5+5N delay units, seventy-eight multiplexers, seven adders (It doesn't include carrry save adder (CSA)), five 9/7 wavelet coefficients and one 5/3 wavelet coefficients processing elements (PEs).Filter coefficients of the biorthogonal 9/7 and 5/3 wavelet low-pass filter are quantized before implementation in the high-speed computation hardware. In the proposed architecture, all multiplications are performed using shifts and additions after approximating the coefficients as a Booth binary recoded format (BBRF). The constant multiplier shown in Figure 10 consists of two carry-save-adders (CSA(4,2)), a Carry Lookahead Adder (CLA), and six hardwire shifters and replaces conventional multiplier ( $\otimes$ ) in processing elements (PEs). Figure 11 shows architectures of line delay (LD) for Vertical filter in 2-D DWT [7].

The proposed reconfigurable architectures for 9/7 and 5/3 DWT reduce the critical path [5][6]. In  $N \times N$  2-D DWT, it requires  $2J + \frac{19}{2}N(1-\frac{1}{2^{J}}) + \frac{4}{3}N^{2}(1-\frac{1}{4^{J}})$  computation

cycles (addition operations) with  $N^2 + (25/2)N + 14$ memories to perform 5/3 and 9/7 2-D DWT, where J is number of levels. Both of two architectures are 100% hardware utilization.

### 4 Conclusion

Filter coefficients are quantized before implementation using the biorthogonal 9/7 and 5/3 wavelet. The hardware is cost-effective and the system is high-speed. The proposed architecture in 9/7 DWT reduces power dissipation by m compared with conventional architectures in m-bit operand (low-power utilization).

Three standard images have been used for the test: "Lenna" 256×256 (I=1), "Barbara" "Boat" 512×512 (*I*=2)and  $512 \times 512$  (*I*=3). The number of wavelet decomposition levels (L) has been varied from 1 to 3 for 256×256 images and from 1 to 4 for 512×512 image. Table 1 shows the peak-signal-to-noise ratios (PSNRs) comparison among different 9/7 wavelet implementations for multiple images(I) and decomposition levels(L). A-Open-jpeg, B-Low-complexity, C-This work[5][6]. The proposed architecture in 9/7 and 5/3 DWT with 20-bit fixed point operations had been applied to  $512 \times 512$  original images Lena is shown in Figures 12(a) and 13(a) and the reconstructed images Lena is shown in Figure 12(b) and 13(b), respectively. The PSNRs of the reconstructed images Lena are 55.701dB and 42.625dB, respectively. Hence, the proposed reconfigurable architecture has been applied to image compression with great satisfaction. In this paper, the high-efficient and low-power reconfigurable architecture for 2-D DWT has been proposed. The proposed reconfigurable architecture compression performs in

$$(2J + \frac{19}{2}N(1 - \frac{1}{2^J}) + \frac{4}{3}N^2(1 - \frac{1}{4^J}))T_a$$
 computation

time for 9/7 DWT and 5/3 DWT, respectively. where the time unit (Ta is time of addition operation). The critical paths are  $2T_a$  for 9/7 and 5/3 DWT, and the output latency time are  $49T_a$  for 9/7 and 5/3 DWT. Buffer sizes are  $N^2 + (25/2)N + 14$  for 9/7 and 5/3 DWT, respectively. The control complexity is very simple. The comparisons between previous works [6] [8] and this work are shown in Table 2 for 9/7 DWT. The advantages of the proposed reconfigurable architecture are 100% hardware utilization and ultra low-power. The architecture has regular structure, simple control flow, high throughput and high scalability. Thus, it is very suitable for new-generation image compression systems, such as JPEG-2000. The proposed reconfigurable DWT is a reusable IP, which can be implemented in various processes and combined with an efficient use of hardware resources for the trade-offs of performance, area, and power consumption.

#### References:

- [1] ITU-T Recommendation T.800. JPEG2000 image coding system – Part 1, ITU Std., July 2002. http://www.itu.int/ITU-T/.
- [2] T. Acharya, P. S. Tsai, "JPEG2000 Standard for Image Compression," Wiley-Interscience, 2004.
- [3] F. Marino, "Two Fast Architectures for the Direct 2-D Discrete Wavelet Transform," *IEEE Transactions on Signal Processing*, Vol. 49, No. 6, June 2001, pp. 1248-1259.
- [4] M. Antonini, M. Barlaud, P. Mathieu, I. Daubechies, "Image Coding Using Wavelet Transform," IEEE Transactions on Image Processing, Vol.1, No.2, April 1992, pp. 205-220.
- [5] M. Martina and G. Masera, "Low-complexity, efficient 9/7 wavelet filters implementation," in Proceedings of the Intl. Conference on Image Processing (ICIP), Sept. 2005.
- [6] M. Martina and G. Masera, "Multiplierless, folded 9/7 5/3 wavelet VLSI architecture," IEEE Transactions on Circuits and Systems II, vol. 54, no. 9, pp. 770 – 774, Sept. 2007.
- [7] T. Y. Sung, Y. S. Shieh, C. W. Yu, H. C. Hsin, "Low-Power Multiplierless 2-D DWT and IDWT Architectures Using 4-tap Daubechies Filters,"Proceedings of the Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies (2006-PDCAT), Taipei, Taiwan, 2006.
- [8] P. -C. Wu, C. -T. Liu, L. -G. Chen, "An Efficient Architecture for Two-Dimensional Inverse Discrete Wavelet Transform," *IEEE International Symposium on Circuits and Systems*, Vol. 2, May 2002, pp. II-312-II-315.

Table 1 Performance comparison among different 9/7 wavelet implementations for multiple images(*I*) and decomposition levels(L).

| Ι | L | A[dB] | <i>B</i> [dB] | <i>C</i> [dB] |
|---|---|-------|---------------|---------------|
| 1 | 1 | 49.41 | 49.35         | 65.27         |
|   | 2 | 49.16 | 48.79         | 59.27         |
|   | 3 | 49.26 | 48.48         | 55.77         |
| 2 | 1 | 49.57 | 49.48         | 66.01         |
|   | 2 | 49.25 | 49.07         | 60.01         |
|   | 3 | 49.21 | 48.80         | 56.50         |
|   | 4 | 49.21 | 48.53         | 56.50         |
| 3 | 1 | 49.42 | 49.45         | 64.42         |
|   | 2 | 49.11 | 48.92         | 58.41         |
| 5 | 3 | 49.06 | 48.72         | 54.90         |
|   | 4 | 49.06 | 48.58         | 54.90         |

Table 2 Comparison between previous works and this work (9/7 DWT architecture with two input,  $T_a$ :addition time and K : filter length)

| Architecture   | Performance                                                    |  |  |
|----------------|----------------------------------------------------------------|--|--|
|                | Multiplier : 4K                                                |  |  |
|                | adder : 4 <i>K</i>                                             |  |  |
| W/11 [8]       | Memory Size:                                                   |  |  |
| 2005           | $N^2/4 + KN + K$                                               |  |  |
| (Dual-input)   | Computation Cycle:                                             |  |  |
|                | $\approx (2N^2/3) \cdot T_m$                                   |  |  |
|                | Reconfigurable architecture: NO                                |  |  |
|                | adder : 19                                                     |  |  |
|                | Memory Size:                                                   |  |  |
| Martina [6]    |                                                                |  |  |
| 2007           | Computation Cycle:                                             |  |  |
| (Single-input) | $(16N(1-2^{-J})+2N^{2}(1-2^{-2J}))\cdot T_{m}$                 |  |  |
|                | Reconfigurable architecture: YES                               |  |  |
|                | adder : 33                                                     |  |  |
|                | Memory Size:                                                   |  |  |
| This work      | $N^2 + (25/2)N + 14$                                           |  |  |
| (Single-input) | Computation Cycle:                                             |  |  |
|                | $(2J + (19/2)N(1 - 2^{-J}) + (4/3)N^2(1 - 2^{-2J})) \cdot T_a$ |  |  |
|                | Reconfigurable architecture: YES                               |  |  |



Fig. 1. 1-level 2-D DWT



Fig. 2. The proposed reconfigurable architecture for 9/7 and 5/3 2-D DWT based on lifting scheme



Fig. 3. The inputs architecture of horizontal filter



Fig. 4. The inputs architecture of vertical filter



Fig. 5. The inputs architecture of high-pass filter (HF)



Fig. 6. The inputs architecture of low-pass filter (LF)



Fig. 8. The architecture of high-pass filter (HF)



Fig. 10 The constant multiplier replaces conventional multiplier ( $\otimes$ ) in processing elements (PEs)



Fig. 9. The inputs architecture of low-pass filter (LF)



Fig. 11. Line delay(LD) for Vertical filter in 2-D DWT



Fig. 7. The architecture of multiplexers (Mux) (a)~(f) for 1-D or 2-D DWT.



(a) (b) Fig. 12 512×512 Lena (a) Original image (b) Reconstructed image (9/7 DWT with 4-level)



(a) (b) Fig. 13 512×512 Lena (a) Original image (b) Reconstructed image (5/3 DWT with 4-level)

### A Novel Linear Array for Discrete Cosine Transform

Yaw-Shih Shieh Department of Microelectronics Engineering Chung Hua University Hsinchu City 300-12, Taiwan ysdaniel@chu.edu.tw

Tze-Yun Sung Department of Microelectronics Engineering Chung Hua University Hsinchu City 300-12, Taiwan bobsung@chu.edu.tw Hsi-Chin Hsin Department of Computer Science and Information Engineering National United University Miaoli 360-03, Taiwan hsin@nuu.edu.tw

Abstract: - Discrete cosine transform (DCT) and inverse DCT (IDCT) have been widely used in many image processing systems. In this paper, a novel linear-array of DCT and IDCT is derived from the data flow of subband decompositions representing the factorized coefficient matrices in the matrix formulation of the recursive algorithm. For increasing the throughput as well as decreasing the hardware cost, the input and output data are reordered. The proposed 8-point DCT/IDCT processor with four multipliers, simple adders, and less registers and ROM storing the immediate results and coefficients, respectively, has been implemented on FPGA. The linear-array DCT/IDCT processor with the computation complexity O(5N/8) and hardware complexity O(N/2) is fully pipelined and scalable for variable length DCT/IDCT computations.

Key-Words: - DCT/IDCT, subband decomposition, linear-array, pipelined, scalable.

### **1** Introduction

Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards [1]. Though fast Fourier transform (FFT) can be used to implement DCT, it requires complex-valued computations; and moreover, *N*-point DCT by FFT contains  $O(\log 2N + 1)$  stages. The conventional DCT architectures using distributed arithmetic involve complex hardware with a great number of registers [2-6]. Other commonly used DCT architectures with matrix formulation and distributed memory [7-8] are however not suited for VLSI implementation because the hardware complex is proportional to the length of DCT, which leads to the scalability problem of variable length DCT computations. In this paper, we propose a novel linear-array architecture for scalable DCT/IDCT implementation.

The remainder of this paper proceeds as follows. In section 2, we propose the fast DCT/IDCT algorithm based on subband decomposition. In section 3, a programmable and reconfigurable FPGA-based implementation with low hardware cost is proposed for the fast DCT/IDCT computation. The performance comparison with conclusions can be found in section 4.

### 2 Fast DCT/IDCT Algorithm

For a *N*-point signal, x[n], the discrete cosine transform (DCT) [1] is defined as

$$C[k] = \alpha[k] \sum_{n=0}^{N-1} x[n] \cos\left[\frac{(2n+1)k\pi}{2N}\right]$$
(1)

where k = 0,..., N-1,  $\alpha[0] = 1/\sqrt{N}$ , and  $\alpha[k] = \sqrt{2/N}$  for k > 0. Let  $x_L[n]$  and  $x_H[n]$  denote the low-frequency and high-frequency subband signals of x[n], respectively, which are defined as

$$x_{L}[n] = \frac{1}{2} \{ x[2n] + x[2n+1] \}$$
(2)

$$x_{H}[n] = \frac{1}{2} \{x[2n] - x[2n+1]\}$$
(3)

where  $n = 0, 1, 2, \dots, (N/2) - 1$ .

As one can see, the DCT of x[n] can be rewritten as

 $S_H[k]$ 

$$C[k] = \sum_{n=0}^{(N/2)^{-1}} \alpha[k] x[2n] \cos(\frac{(4n+1)k\pi}{2N}) + \sum_{n=0}^{(N/2)^{-1}} \alpha[k] x[2n+1] \cos(\frac{(4n+3)k\pi}{2N}) = 2\cos(\frac{\pi k}{2N}) \sum_{n=0}^{(N/2)^{-1}} \alpha[k] x_L[n] \cos(\frac{(2n+1)k\pi}{N}) - \sum_{L[k]} \alpha[k] x_H[n] \sin(\frac{(2n+1)k\pi}{N})$$
(4)

The National Science Council of Taiwan, under Grant NSC98-2221-E-216-037, and the Chung Hua University, Taiwan, under Grant CHU- NSC98-2221-E-216-037 supported this work.

where  $C_L[k]$  and  $S_H[k]$  are the subband DCT and DST (discrete sine transform) of x[n], respectively.

# 2.1 Fast DCT Algorithm Based on Subband Decomposition

Without loss of generality, the 8-point fast DCT algorithm based on subband decomposition is proposed for the widely used JPEG and MPEG-1/2 standards, which can be easily extended to variable length DCT computations. The vector form of 8-point DCT can be written as

$$\boldsymbol{C}_{8} = [\boldsymbol{T}_{SB\_DCT,8} \quad \boldsymbol{T}_{SB\_DST,8}]_{8\times8} \cdot \begin{bmatrix} \boldsymbol{x}_{L} \\ \boldsymbol{x}_{H} \end{bmatrix}_{8\times1}$$
(5)

where  $C_8 = [C[0] \cdots C[7]]^T$ ,  $x_L = [x_L[0] \cdots x_L[3]]^T$ ,  $x_H = [x_H[0] \cdots x_H[3]]^T$ , and  $T_{SB_DCT,8}$  and  $T_{SB_DST,8}$ denote the 8×4 matrices of subband DCT and subband DST, respectively, which can form orthonormal bases for the two orthogonal subspaces of  $\mathbf{R}^8$ .

The data flow of computing the 2-point subband DCT:  $C_{LL,2}$  and subband DST:  $C_{LH,2}$  for the 8-point DCT is shown in Fig. 1. As one can see, data flow of computing  $C_{HL,2}$  and  $C_{HH,2}$  can be obtained in a similar way, and therefore is not shown in Fig. 1. All of the 2-point subband DCT and DST are given by

$$\boldsymbol{C}_{LL,2} = [\boldsymbol{T}_{SB\_DCT,2} \quad \boldsymbol{T}_{SB\_DST,2}]_{2\times 2} \cdot \begin{bmatrix} \boldsymbol{x}_{LLL} \\ \boldsymbol{x}_{LLH} \end{bmatrix}_{2\times 1}$$
(6)

$$\boldsymbol{C}_{LH,2} = [\boldsymbol{T}_{SB\_DCT,2} \quad \boldsymbol{T}_{SB\_DST,2}]_{2\times 2} \cdot \begin{bmatrix} \boldsymbol{x}_{LHL} \\ \boldsymbol{x}_{LHH} \end{bmatrix}_{2\times 1}$$
(7)

$$\boldsymbol{C}_{HL,2} = [\boldsymbol{T}_{SB\_DCT,2} \quad \boldsymbol{T}_{SB\_DST,2}]_{2\times 2} \cdot \begin{bmatrix} \boldsymbol{x}_{HLL} \\ \boldsymbol{x}_{HLH} \end{bmatrix}_{2\times 1}$$
(8)

$$\boldsymbol{C}_{HH,2} = [\boldsymbol{T}_{SB\_DCT,2} \quad \boldsymbol{T}_{SB\_DST,2}]_{2\times 2} \cdot \begin{bmatrix} \boldsymbol{x}_{HHL} \\ \boldsymbol{x}_{HHH} \end{bmatrix}_{2\times 1}$$
(9)

Thus, we have  $\begin{bmatrix} C \end{bmatrix}$ 

$$\begin{vmatrix} \mathbf{C}_{LL,2} \\ \mathbf{C}_{LH,2} \\ \mathbf{C}_{HL,2} \\ \mathbf{C}_{HH,2} \end{vmatrix} = \mathbf{R}_8 \cdot \mathbf{x}_8$$
(10)

where  $\boldsymbol{x}_8 = [x[0] \cdots x[7]]^T$  is the original signal, and

Similarly, we have the following

$$\boldsymbol{C}_{L,4} = [\boldsymbol{T}_{SB\_DCT,4} \quad \boldsymbol{T}_{SB\_DST,4}]_{4\times 4} \cdot \begin{bmatrix} \boldsymbol{x}_{LL,2} \\ \boldsymbol{x}_{LH,2} \end{bmatrix}_{4\times 1}$$
(12)

$$\boldsymbol{C}_{H,4} = [\boldsymbol{T}_{SB\_DCT,4} \quad \boldsymbol{T}_{SB\_DST,4}]_{4\times 4} \cdot \begin{bmatrix} \boldsymbol{x}_{HL,2} \\ \boldsymbol{x}_{HH,2} \end{bmatrix}_{4\times 1}$$
(13)

Fig. 2 depicts the relationship between  $\hat{C}_{LL,4}$  and  $C_{LL,2}$ , which can be obtained by the following:

$$\hat{\boldsymbol{C}}_{LL,4} = \boldsymbol{T}_{SB\_DCT,4} \cdot \boldsymbol{x}_{LL,2}$$
(14)

$$\boldsymbol{C}_{LL,2} = \boldsymbol{T}_2 \cdot \boldsymbol{x}_{LL,2} \tag{15}$$

where  $T_2$  is the 2×2 transform matrix of the conventional 2-point DCT. Hence, equation (14) can be rewritten as

$$\hat{\boldsymbol{C}}_{LL,4} = \boldsymbol{T}_{SB\_DCT,4} \cdot \boldsymbol{T}_2^{-1} \cdot \boldsymbol{C}_{LL,2}$$
(16)

The relationship between  $S_{LH,4}$  and  $C_{LH,2}$  shown in Fig. 3 is based on the following.

$$\hat{\boldsymbol{S}}_{LH,4} = \boldsymbol{T}_{SB\_DST,4} \cdot \boldsymbol{x}_{LH,2}$$
(17)

$$\boldsymbol{C}_{LH,2} = \boldsymbol{T}_2 \cdot \boldsymbol{x}_{LH,2} \tag{18}$$

Thus, we have

$$\hat{\boldsymbol{S}}_{LH,4} = \boldsymbol{T}_{SB_{-}DST,4} \cdot \boldsymbol{T}_{2}^{-1} \cdot \boldsymbol{C}_{LH,2}$$
(19)

Similarly, based on equation (5) and the following equations:

$$\boldsymbol{C}_{L,4} = \boldsymbol{T}_4 \cdot \boldsymbol{x}_{L,4} \tag{20}$$

$$\boldsymbol{C}_{H,4} = \boldsymbol{T}_4 \cdot \boldsymbol{x}_{H,4} \tag{21}$$

where  $T_4$  is the 4×4 transform matrix of the conventional 4-point DCT. We have

$$\hat{\boldsymbol{C}}_{L,8} = \boldsymbol{T}_{SB\_DCT,8} \cdot \boldsymbol{x}_{L,4}$$

$$= \boldsymbol{T}_{SB\_DCT,8} \cdot \boldsymbol{T}_{4}^{-1} \cdot \boldsymbol{C}_{L,4}$$
(22)

$$\hat{\boldsymbol{C}}_{H,8} = \boldsymbol{T}_{SB\_DST,8} \cdot \boldsymbol{x}_{H,4}$$

$$= \boldsymbol{T}_{SB\_DST,8} \cdot \boldsymbol{T}_{4}^{-1} \cdot \boldsymbol{C}_{H,4}$$
(23)

Fig.4 depicts data flow of computing  $C_{L,4}$  and  $C_{H,4}$  using 4-point subband DCT and DST. Fig. 5 depicts data flow of computing  $\hat{C}_{L,8}$  and  $C_{L,4}$  based on subband decomposition. Data flow of computing  $\hat{S}_{H,8}$  and  $C_{H,4}$  based on subband decomposition is shown in Fig. 6. Data flow of computing  $C_8$  using 8-point subband DCT and DST is shown in Fig. 7. In other words,  $C_8$  can be obtained by

$$\boldsymbol{C}_{8} = \hat{\boldsymbol{C}}_{L,8} + \hat{\boldsymbol{S}}_{H,8} \tag{24}$$

Base on eqs. (12), (13), (16), (19), (22) and (23), we have

$$C_{8} = F_{8} \cdot [C_{LL,2}^{T} \quad C_{LH,2}^{T} \quad C_{HL,2}^{T} \quad C_{HH,2}^{T}]^{T}$$
(25)  
where

$$\boldsymbol{F}_{8} = \begin{bmatrix} \boldsymbol{K}_{3} & \boldsymbol{K}_{4} \end{bmatrix}_{8\times8} \cdot \begin{bmatrix} \begin{bmatrix} \boldsymbol{K}_{1} & \boldsymbol{K}_{2} \end{bmatrix}_{4\times4} & \boldsymbol{0} \\ \boldsymbol{0} & \begin{bmatrix} \boldsymbol{K}_{1} & \boldsymbol{K}_{2} \end{bmatrix}_{4\times4} \end{bmatrix}_{8\times8}$$
(26)

$$\boldsymbol{K}_{1} = \begin{vmatrix} 1.4142 & 0 \\ 0 & 1.3066 \\ 0 & 0 \\ 0 & -0.5412 \end{vmatrix}$$
(27)

$$\boldsymbol{K}_{2} = \begin{bmatrix} 0 & 0 \\ 0.5412 & 0 \\ 0 & 1.4142 \\ 1.3066 & 0 \end{bmatrix}$$
(28)

~

$$\mathbf{K}_{4} = \begin{bmatrix} 1.3066 & 0 & 0 \\ 1.412 & 0 & 0 & 0 \\ 0 & 1.3870 & 0 & 0 \\ 0 & 0 & 1.3066 & 0 \\ 0 & 0 & 0 & 1.1759 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & -0.7857 \\ 0 & 0 & -0.5412 & 0 \\ 0 & -0.2759 & 0 & 0 \end{bmatrix}$$
(29)  
$$\mathbf{K}_{4} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & -0.1056 & 0 \\ 0 & 0.5 & 0 & -0.2071 \\ 0.3007 & 0 & 0.7259 & 0 \\ 0 & 0.5412 & 0 & 1.3066 \\ 0.4500 & 0 & 1.0864 & 0 \\ 0 & 1.2071 & 0 & -0.5 \end{bmatrix}$$
(30)

According to equations  $(27) \sim (30)$ , we have

0

1.2815

$$F_8 = \begin{bmatrix} 2 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1.8123 & 0.7507 & 0 & 0.3605 & 0 & 0 & -0.1493 \\ 0 & 0 & 0 & 1.8478 & 0 & 0.7654 & 0 & 0 \\ 0 & -0.6364 & 1.5364 & 0 & 0.4252 & 0 & 0 & 1.0266 \\ 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 \\ 0 & 0.4252 & -1.0266 & 0 & 0.6364 & 0 & 0 & 1.5364 \\ 0 & 0 & 0 & -0.7654 & 0 & 1.8478 & 0 & 0 \\ 0 & -0.3605 & -0.1493 & 0 & 1.8123 & 0 & 0 & -0.7507 \end{bmatrix}$$
(31)

-0.5308

0

Finally, the proposed 8-point DCT computation based on subband decomposition is as follows:

$$\boldsymbol{C}_8 = \hat{\boldsymbol{F}}_8 \cdot \boldsymbol{R}_8 \cdot \boldsymbol{x}_8 \tag{32}$$
 where

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 (33) 0 0 0.9239 0.3827 0 0 0 0 0 0 -0.3827 0.9239 0 0 0 0  $\hat{F}_8 = 2$ . 0 0 0 0 0.9062 0.3754 0.1802 -0.0746 0 0 0 0 -0.1802 -0.0746 0.9062 -0.3754 0 0 0 0 -0.3182 0.7682 0.2126 0.5133 0 0 0 0 0.2126 -0.5133 0.3182 0.7682

Fig. 8 shows block diagram of the proposed DCT

computation; one of the advantages is that  $R_8$  is orthogonal, and all of the sub-matrices of  $\hat{F}_8$  are orthonormal.

#### 2.2 Fast IDCT Algorithm Based on Subband Decomposition

According to eq. (31), IDCT can be obtained by

$$\boldsymbol{x}_{8} = \boldsymbol{R}_{8}^{-1} \cdot \hat{\boldsymbol{F}}_{8}^{-1} \cdot \boldsymbol{C}_{8}$$
 (34)  
where

As  $R_8$  is orthogonal and all of the sub-matrices of

 $\hat{F}_8$  are orthonormal, the inverse of  $R_8$  and  $\hat{F}_8$  can be obtained easily. In addition, it takes only twenty multiplication operations for both DCT and IDCT.

### **3** A Linear Array for DCT and IDCT

Based on the proposed approach to fast DCT computation shown in Fig. 8, an efficient architecture for implementing the fast DCT/IDCT processor is thus presented in this section. Recall that the DCT of a signal,  $x_8$ , can be efficiently obtained by  $C_8 = \hat{F}_8 \cdot R_8 \cdot x_8$ . Let  $y_8 = R_8 \cdot x_8$ , then we have  $C_8 = \hat{F}_8 \cdot y_8$ . The matrix-vector multiplication of  $\mathbf{R}_8 \cdot \mathbf{x}_8$ , in which six CSA(3,2)s (carry-save-adder (3,2)) and one CLA (carry-look-ahead-adder) [9-10] are utilized, and therefore four simple-addition time and one CLA computation time is required to compute each element of  $y_8$ . The multiplier-array (MA) consisted of four multipliers and the CLA-array (CA) consisted of eight CLAs, respectively, which are used to compute the matrix-vector computation of  $\hat{F}_8 \cdot y_8$ ; thus, only one multiplication time with one CLA computation time

is needed to compute each element of  $C_8$ , i.e. the DCT coefficient. Fig. 12 depicts data flow of the proposed fast DCT processor with pipelined linear-array architecture [11]. As a result, only five multiplication cycles with five addition cycles are needed to compute 8-point DCT. In general, for *N*-point DCT, the computation time and hardware complexity of the proposed fast DCT processor are O(5N/8) and O(N/2), respectively.

Figure 13 shows data flow of the proposed fast IDCT algorithm [11], where  $C_8$  is the DCT of an 8-point signal  $\boldsymbol{x}_8$ ;  $\boldsymbol{z}_8 = \hat{\boldsymbol{F}}_8^{-1} \cdot \boldsymbol{C}_8$ , and  $\boldsymbol{x}_8 = \boldsymbol{R}_8^{-1} \cdot \boldsymbol{z}_8$ . The so-called full-CSA(4,2) (FCSA(4,2)) consisted of two CSA(3,2) and one CLA for the computation of  $z_8$  [21-22]. It is noted that the CLA-array consisted of eight CLAs can also be used for the computation of  $x_8$ . As shown in Fig. 13, only five multiplication cycles with three addition cycles are needed to compute 8-point IDCT. As one can see, the computation time and hardware complexity of the proposed fast IDCT architecture are the same as that of the proposed fast DCT architecture. In addition, only 16-word RAM/registers and 10-word ROM are required to store the intermediate results and constants, respectively; and the latency time is only 5-multiplication-cycle.

Fig. 15 shows system block diagram of the proposed fast DCT/IDCT architecture. The platform for architecture development and verification has been designed as well as implemented in order to evaluate the development cost. It is noted that the throughput can be improved by using the proposed architecture while the computation accuracy is the same as that obtained by using the conventional one with the same word length. Thus, the proposed programmable DCT/IDCT architecture is able to improve the power consumption and computation speed significantly. The proposed DCT/IDCT processor used to compute 8/16/32/64 -point DCT/IDCT are composed mainly of the 8-point DCT/IDCT core; the computation complexity using a single 8-point DCT/IDCT core is O(5N/8) for extending N-point DCT/ IDCT computation. Moreover, the reusable intellectual property (IP) DCT/IDCT core has also been implemented in Matlab<sup>®</sup> for functional simulations. The hardware code written in Verilog<sup>®</sup> is running on a workstation with the ModelSim<sup>®</sup> simulation tool and Xilinx<sup>®</sup> ISE smart compiler.

### 4 Conclusion

By taking advantage of subband decomposition, a high-efficiency architecture with pipelined structures is proposed for fast DCT/IDCT computation. Specifically, the proposed DCT/IDCT architecture not only improves throughput by more than two times that of the conventional architectures [2-6], but also saves memory space significantly [1]. Table 1 comparisons between shows the proposed architecture and the conventional architectures [2-6]. Table 2 shows comparisons with other commonly used architectures [1], [7-8]. In addition, the proposed fast DCT/IDCT architecture is highly regular, scalable, and flexible. The DCT/IDCT processor designed by using the portable and reusable Verilog<sup>®</sup> is a reusable IP, which can be implemented in various processes; combined with efficient use of hardware resources for trade-offs of performance, area and power consumption; and therefore is much suited to the JPEG and MPEG-1/2 applications.

References:

- [1] T. Y. Sung, "Memory-Efficient and high-performance 2-D DCT and IDCT processors based on CORDIC rotation", WSEAS Trans. Electronics, Issue 12, Vol. 3, Dec. 2006, pp.565-574.
- [2] Y. H. Hu, Z. Wu, "An efficient CORDIC array structure for the implementation of discrete cosine transform, *IEEE Trans. Signal Processing*, No.1, Vol.43, Jan. 1995, pp.331-.336.
- [3] H. Jeong, J. Kim, W. K. Cho, "Low-power multiplierless DCT architecture using image data correlation", *IEEE Trans. Consumer Electronics*, No.1, Vol.50, Feb. 2004, pp.262-267.
- [4] D. Gong, Y. He, Z. Gao, "New cost-effective VLSI implementation of a 2-discrete cosine Transform and its inverse", *IEEE Trans. Circuits Syst. for Video Technology*, vol. 14, no. 4, April 2004, pp. 405-415.
- [5] V. Dimitrov, K. Wahid, G. Jullien, "Multiplication-free 8×8 2D DCT architecture using Algebraic integer encoding", *Electronics Letters*, No.20, Vol.40, Sept. 2004, pp.1310-1311.
- [6] M. Alam, W. Badawy, G. Jullien, "A new time distributed DCT architecture for MPEG-4 hardware reference model", *IEEE Trans. Circuits Syst. for Video Technology*, No.5, Vol.15, May 2005, pp.726-730.
- [7] S. F. Hsiao, J. M. Tseng, "New matrix formulation for two-dimensional DCT/IDCT computation and its distributed-memory VLSI

implementation", *IEE Proc.-Vis. Image Signal Process*, No. 2, Vol. 149, April 2002, pp.97-107.

- [8] H. S. Hou, "A fast recursive algorithm for computing the discrete cosine transform", *IEEE Trans. Acoust., Speech, Signal Processing*, No.10, Vol. ASSP-35, Oct. 1987, pp.1455-1461.
- [9] I. Koren, Computer arithmetic algorithm, Second edition, A. K. Peters, Natick, MA, 2002, Chapter 5.
- [10] T. Y. Sung, H. C. Hsin, "Design and simulation of reusable IP CORDIC core for special-purpose processors," *IET Computers & Digital Techniques*, Vol.1, No.5, Sept. 2007, pp.581-589.
- [11] G. H. Golub, C. F. Van Loan, Matrix computations, The John Hopkins University Press, 1996, Chapter 6. Parallel matrix computations, pp.275-307.

| 8-point<br>DCT/IDCT    | The conventional architectures                                      | The proposed<br>high- efficient<br>architecture |
|------------------------|---------------------------------------------------------------------|-------------------------------------------------|
| 200,200                | The parallel<br>architectures with<br>single memory-bank<br>[2]-[6] | This work<br>[Shieh, Sung,<br>Hsin, 2010]       |
| Processors             | 8                                                                   |                                                 |
| Real-multipliers       | 16                                                                  | 4                                               |
| Real-Adders            | 18                                                                  | 26                                              |
| RAM (Registers)        | 64                                                                  | 16                                              |
| ROM                    | 6                                                                   | 10                                              |
| Hardware complexity    | $O(N - \log_2 N + 1)$                                               | O(N/2)                                          |
| Computation complexity | O(2N)                                                               | O(5N/8)                                         |
| Latency                | 16                                                                  | 5                                               |
| Pipelinability         | no                                                                  | yes                                             |
| Scalability            | poor                                                                | better                                          |
| Power consumption      | poor                                                                | better                                          |



Fig. 1 Data flow of computing the 2-point subband DCT

| Table 2 Comparisons of  | the proposed architecture |
|-------------------------|---------------------------|
| and other commonly used | 1 architectures           |

| 8-point<br>DCT/IDCT | Hsiao [7]     | Hsiao [8]    | Shieh, Sung,<br>Hsin, 2010 |
|---------------------|---------------|--------------|----------------------------|
| Real-multipliers    | -             | -            | 4                          |
| CORDIC              | -             | 3            | -                          |
| Real-adders         | 10            | 14           | 26                         |
| Complex-Multipli    | 3             |              | -                          |
| ers                 |               |              |                            |
| Delay elements      | 171           | -            | -                          |
| (Words)             |               |              |                            |
| Hardware            | $O(\log N)$   | $O(\log N)$  | O(N/2)                     |
| complexity          |               |              |                            |
| Computation         | $O(N \log N)$ | $O(N\log N)$ | O(5N/8)                    |
| complexity          |               |              |                            |
| Pipelinability      | no            | yes          | yes                        |
| Scalability         | good          | good         | better                     |



Fig. 2 Data flow of computing  $\overline{C_{LL,4}}$  and  $\overline{C_{LL,2}}$  based



Fig. 3 Data flow of computing  $C_{LH,2}$  and  $\hat{S}_{LH,4}$  based on subband decomposition.



Fig. 4 Data flow of computing  $C_{L,4}$  and  $C_{H,4}$  using 4-point subband DCT and DST



Fig. 5 Data flow of computing  $\hat{C}_{L,8}$  and  $\hat{C}_{L,4}$  based on subband decomposition



Fig. 6 Data flow of computing  $\hat{S}_{H,8}$  and  $C_{H,4}$  based on subband decomposition



Fig. 7 Data flow of computing  $C_8$  using 8-point subband DCT and DST



Fig. 8 Block diagram of the proposed (8-point) fast DCT algorithm based on subband decomposition

C[n]

| Cycles | FA   | МА                                              | CA                         |
|--------|------|-------------------------------------------------|----------------------------|
| Add1   | y[0] |                                                 | <i>C</i> [0]               |
| Add2   | y[1] |                                                 | C[1]                       |
| Add3   | y[2] |                                                 |                            |
| Mul1   | y[3] | y[2]·0.9239, y[2]·(-0.3827)                     |                            |
|        |      | $y[3] \cdot 0.3827$ , $y[3] \cdot (0.9239)$     |                            |
| Add4   | y[4] |                                                 | <i>C</i> [2], <i>C</i> [3] |
| Mul2   | y[5] | $y[4] \cdot 0.9062$ , $y[4] \cdot (-0.1802)$ ,  |                            |
|        |      | $y[4] \cdot (-0.3182)$ , $y[4] \cdot 0.2126$    |                            |
| Mul3   | y[6] | $y[5] \cdot 0.3754$ , $y[5] \cdot (-0.0746)$ ,  |                            |
|        |      | y[5]·0.7682, y[5]·0.5133                        |                            |
| Mul4   | y[7] | $y[6] \cdot 0.1802$ , $y[6] \cdot 0.9062$ ,     |                            |
| —      |      | y[6] · 0.2126 , y[6] · 0.3182                   |                            |
| Mul5   |      | $y[7] \cdot (-0.0746) , y[7] \cdot (-0.3754) ,$ |                            |
|        |      | y[7]·0.5133, y[7]·0.7682                        |                            |
| Add5   |      |                                                 | C[4], C[5],                |
|        |      |                                                 | C[6], C[7]                 |

Fig.9 Data flow of the proposed fast DCT processor with pipelined linear-array architecture (Add.\_: addition-cycle, Mul.\_: multiplication-cycle)

| Cycles | MA                                             | FCSA         | CA          |
|--------|------------------------------------------------|--------------|-------------|
| Mul1   | $C[2] \cdot 0.9239$ , $C[3] \cdot (-0.3827)$   | z[0], z[1]   |             |
|        | $C[2] \cdot 0.3827$ , $C[3] \cdot 0.92393$     |              |             |
| Mul2   | $C[4] \cdot 0.9062$ , $C[5] \cdot (-0.1802)$ , | z[2], z[3]   | C_0+C_1     |
|        | $C[6] \cdot (-0.3182), C[7] \cdot 0.2126$      |              | =C_01       |
| Mul3   | $C[4] \cdot 0.3754$ , $C[5] \cdot 0.3754$ ,    | <i>z</i> [4] | C_01+C_2    |
|        | $C[6] \cdot 0.7682$ , $C[7] \cdot (-0.5133)$   |              | =C_02       |
| Mul4   | $C[4] \cdot (-0.3182), C[5] \cdot 0.7682,$     | z[5]         | C_02+C_3    |
|        | $C[6] \cdot 0.2126$ , $C[7] \cdot 0.5144$      |              | =C_03       |
| Mul5   | $C[4] \cdot 0.2126$ , $C[5] \cdot (-0.5133)$ , | <i>z</i> [6] | C_03+C_4    |
|        | $C[6] \cdot 0.3182$ , $C[7] \cdot 0.7682$      |              | =C_04       |
| Add1   |                                                | <i>z</i> [7] | C_04+C_5    |
|        |                                                |              | =C_05       |
| Add2   |                                                |              | C_05+C_6    |
|        |                                                |              | =C_06       |
| Add3   |                                                |              | C_06+C_7    |
|        |                                                |              | =C_07       |
|        |                                                |              | x[0], x[1], |
|        |                                                |              | x[2], x[3], |
|        |                                                |              | x[4], x[5], |
|        |                                                |              | x[6], x[7]  |

Fig. 10 Data flow of the proposed fast IDCT processor with pipelined linear-array architecture



Fig. 11 System block diagram of the proposed DCT/IDCT architecture

國科會補助計畫衍生研發成果推廣資料表

日期 2010年09月02日

| 國科會補助計畫          | 計畫名稱:座標旋轉原理演算法應用於二維及三維特殊信號處理器之晶片設計與製作<br>(I)<br>計畫主持人:宋志雲                                                                                                                                                                                                                                                          |                                                                                                                                                                                                        |                                                                                                                                                                                                                                                                                                                                                         |  |
|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|                  | 計畫編號:98 -2221-E -216 -037 - 學門領域:積體電路及系統設計                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                        |                                                                                                                                                                                                                                                                                                                                                         |  |
|                  | (中文) 高無寄生動態範圍及無乘法                                                                                                                                                                                                                                                                                                  | 去器之直接數位頻                                                                                                                                                                                               | 率合成器                                                                                                                                                                                                                                                                                                                                                    |  |
| 研發成果名稱           | (英文)High-SFDR and Multiplie                                                                                                                                                                                                                                                                                        | erless Direct Di                                                                                                                                                                                       | igital Frequency Synthesizer                                                                                                                                                                                                                                                                                                                            |  |
| 成果歸屬機構           | 中華大學                                                                                                                                                                                                                                                                                                               | 發明人<br>(創作人)                                                                                                                                                                                           | 宋志雲                                                                                                                                                                                                                                                                                                                                                     |  |
| 技術説明             | (中文)使用混合座標旋轉原理設計<br>乘法器,包含小量之唯獨記<br>生動態範圍超過84.4 dBc。<br>Xilinx陣列處理器上實體模<br>位頻率合成器適合由超大型<br>動態範圍上都有具備優勢。<br>本合成器於高頻條件下,有<br>直接數位頻率合成器,其有                                                                                                                                                                            | 及製作直接數位對<br>「                                                                                                                                                                                          | 順率合成器。此一設計之架構為無<br>。)以及疊流資料路徑,所產生無寄<br>電 1P6M CMOS製程設計,並且在<br>昆合座標旋轉原理為基礎之直接數<br>在硬體成本,功率消耗以及無寄生<br>.範圍,達到169.7dBc。比較現存的<br>動態範圍。                                                                                                                                                                                                                       |  |
|                  | (英文)This research presents a<br>(CORDIC) algorithm for d<br>digital frequency synthe<br>architecture with small<br>provides a spurious free<br>SoC (System on Chip) has<br>emulated on the Xilinx F<br>architecture is suitable<br>terms of hardware cost,<br>bit word length, the hig<br>the proposed DDFS is sup | hybrid COordin<br>esigns and impl<br>sizer (DDFS). T<br>ROM (16X4 -bit)<br>dynamic range<br>been designed<br>PGA. It is show<br>for VLSI imple<br>power consumpti<br>h-frequency SFD<br>erior in terms | hate Rotation DIgital Computer<br>ementations of the direct<br>The proposed multiplier-less<br>and pipelined data path<br>(SFDR) of more than 84.4 dBc. A<br>by 1P6M CMOS, and then<br>on that the hybrid CORDIC-based<br>ementations of the DDFS in<br>on, and SFDR. In case of 16-<br>DR is 169.7 dBc. As one can see,<br>of SFDR, hardware cost, and |  |
| 產業別              | 電機及電子機械器材業                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                        |                                                                                                                                                                                                                                                                                                                                                         |  |
| 技術/產品應用範圍        | 無線數位高頻寬網絡設備及晶片                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                        |                                                                                                                                                                                                                                                                                                                                                         |  |
| 技術移轉可行性及<br>預期效益 | 可轉移晶片設計原始碼及相關實明                                                                                                                                                                                                                                                                                                    | 会資料,改進相關                                                                                                                                                                                               | 設備之性能。                                                                                                                                                                                                                                                                                                                                                  |  |

註:本項研發成果若尚未申請專利,請勿揭露可申請專利之主要內容。

# 98年度專題研究計畫研究成果彙整表

| 計畫主持人:宋志雲 計畫編號:98-2221-E-216-037- |                                       |                         |                               |                    |       |                                                      |      |
|-----------------------------------|---------------------------------------|-------------------------|-------------------------------|--------------------|-------|------------------------------------------------------|------|
| 計畫名                               | <b>稱:</b> 座標旋轉原                       | (理演算法應用於二)              | 維及三維特                         | 殊信號處理器             | 器之晶片設 | 计與製                                                  | 作(I) |
| 成果項目                              |                                       | 實際已達成<br>數(被接受<br>或已發表) | 量化<br>預期總達成<br>數(含實際已<br>達成數) | 本計畫實<br>際貢獻百<br>分比 | 單位    | 備註(質化說<br>明:如數個計畫<br>共同成果、成果<br>列為該期刊之<br>封面故事<br>等) |      |
|                                   |                                       | 期刊論文                    | 2                             | 2                  | 100%  |                                                      |      |
|                                   | 於古芝佐                                  | 研究報告/技術報告               | 0                             | 0                  | 100%  | 篇                                                    |      |
|                                   | ····································· | 研討會論文                   | 2                             | 2                  | 100%  |                                                      |      |
|                                   |                                       | 專書                      | 0                             | 0                  | 100%  |                                                      |      |
|                                   | <b>東</b> 毛川                           | 申請中件數                   | 0                             | 0                  | 100%  | 14                                                   |      |
|                                   | 守州                                    | 已獲得件數                   | 0                             | 0                  | 100%  | 17                                                   |      |
| 國內                                | 技術移轉                                  | 件數                      | 0                             | 0                  | 100%  | 件                                                    |      |
|                                   |                                       | 權利金                     | 0                             | 0                  | 100%  | 千元                                                   |      |
|                                   | 參與計畫人力<br>(本國籍)                       | 碩士生                     | 1                             | 0                  | 100%  | k -b                                                 |      |
|                                   |                                       | 博士生                     | 1                             | 0                  | 100%  |                                                      |      |
|                                   |                                       | 博士後研究員                  | 0                             | 0                  | 100%  | 八八                                                   |      |
|                                   |                                       | 專任助理                    | 0                             | 0                  | 100%  |                                                      |      |
|                                   | 論文著作                                  | 期刊論文                    | 4                             | 4                  | 100%  |                                                      |      |
|                                   |                                       | 研究報告/技術報告               | 0                             | 0                  | 100%  | 篇                                                    |      |
|                                   |                                       | 研討會論文                   | 4                             | 4                  | 100%  |                                                      |      |
|                                   |                                       | 專書                      | 0                             | 0                  | 100%  | 章/本                                                  |      |
|                                   | 專利                                    | 申請中件數                   | 0                             | 0                  | 100%  | 14                                                   |      |
|                                   |                                       | 已獲得件數                   | 0                             | 0                  | 100%  | 17                                                   |      |
| 國外                                | 技術移轉                                  | 件數                      | 0                             | 0                  | 100%  | 件                                                    |      |
|                                   |                                       | 權利金                     | 0                             | 0                  | 100%  | 千元                                                   |      |
|                                   |                                       | 碩士生                     | 1                             | 0                  | 100%  |                                                      |      |
|                                   | 參與計畫人力                                | 博士生                     | 1                             | 0                  | 100%  | 1-5                                                  |      |
|                                   | (外國籍)                                 | 博士後研究員                  | 0                             | 0                  | 100%  | ] 人次                                                 |      |
|                                   |                                       | 專任助理                    | 0                             | 0                  | 100%  |                                                      |      |

|            | 擔任研討會議程委員 | 及分組討論主席 |           |
|------------|-----------|---------|-----------|
| 其他成果       |           |         |           |
| (無法以量化表達之成 |           |         |           |
| 果如辦理學術活動、獲 |           |         |           |
| 得獎項、重要國際合  |           |         |           |
| 作、研究成果國際影響 |           |         |           |
| 力及其他協助產業技  |           |         |           |
| 術發展之具體效益事  |           |         |           |
| 項等,請以文字敘述填 |           |         |           |
| 列。)        |           |         |           |
|            |           |         |           |
| 成          | 果項目       | 量化      | 名稱或內容性質簡述 |

|         | 成果項目            | 量化 | 名稱或內容性質簡述 |
|---------|-----------------|----|-----------|
| 科       | 測驗工具(含質性與量性)    | 0  |           |
| 枚       | 課程/模組           | 0  |           |
|         | 電腦及網路系統或工具      | 0  |           |
| ;†<br>▶ | 教材              | 0  |           |
| E<br>M  | 舉辦之活動/競賽        | 0  |           |
| 眞       | 研討會/工作坊         | 0  |           |
| 頁       | 電子報、網站          | 0  |           |
| 3       | 計畫成果推廣之參與(閱聽)人數 | 0  |           |

### 國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)、是否適 合在學術期刊發表或申請專利、主要發現或其他有關價值等,作一綜合評估。

|   | 1. | 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估                       |
|---|----|-----------------------------------------------------|
|   |    | 達成目標                                                |
|   |    | □未達成目標(請說明,以100字為限)                                 |
|   |    | □實驗失敗                                               |
|   |    | □因故實驗中斷                                             |
|   |    | □其他原因                                               |
|   |    | 說明:                                                 |
|   | 2. | 研究成果在學術期刊發表或申請專利等情形:                                |
|   |    | 論文:■已發表 □未發表之文稿 □撰寫中 □無                             |
|   |    | 專利:□已獲得 ■申請中 □無                                     |
|   |    | 技轉:□已技轉 □洽談中 ■無                                     |
|   |    | 其他:(以100字為限)                                        |
|   | 3. | 請依學術成就、技術創新、社會影響等方面,評估研究成果之學術或應用價                   |
|   |    | 值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)(以                   |
|   |    | 500 字為限)                                            |
|   |    | 本研究使用混合座標旋轉原理設計及製作直接數位頻率合成器。此一設計之架構為無乘法             |
|   |    | 器,包含小量之唯獨記憶體(-位元)以及疊流資料路徑,所產生無寄生動態範圍超過84.4          |
|   |    | dBc。系統晶片由台積電 1P6M CMOS 製程設計,並且在 Xilinx 陣列處理器上實體模擬。證 |
|   |    | 明此一以混合座標旋轉原理為基礎之直接數位頻率合成器適合由超大型積體電路製作,在             |
|   |    | 硬體成本,功率消耗以及無寄生動態範圍上都有具備優勢。非常適用於無線網路之晶片。             |
|   |    | 未來將進一步研究硬體架構的精簡,高頻之無寄生動態範圍提升以及雜訊比(PSNR)的改           |
|   |    | 進。                                                  |
| 1 |    |                                                     |