# DYNAMIC POWER ANALYSIS OF DATA PATH CIRCUITS IN MODIFIED DWT ARCHITECTURE AT 65 nm TECHNOLOGY

Sathiyabama B.<sup>1</sup>, Malarkkan S.<sup>2</sup>

<sup>1</sup>Research Scholar, Sathyabama University, Chennai-600119, India. <sup>2</sup>Manakula Vinayagar Institute of Technology, Puduchery, India.

### **ABSTRACT**

Power dissipation in CMOS circuits exponentially increases with technology scaling. It is required to reduce both dynamic and leakage power by means of analysis carried out at various level of abstraction. In this paper, DWT architecture based on lifting scheme is considered and dynamic power reduction is achieved by suitable modification of the architecture. The interdependency of scaling and dilation coefficients is simplified to single hierarchy, which reduces latency and increases throughput. Wallace tree multiplier and carry select adder are used in realizing this 1D-DWT architecture that operates at a maximum frequency of 280 MHz. Power consumption of the multiplier is optimized by voltage scaling technique. The 1D-DWT architecture is modified and its performances are analyzed. Simulation results shows that the dynamic power consumption is reduced by 37%. The proposed design is implemented using 65 nm TSMC low power library cells and is synthesized using Synopsys DC

Key words: Dynamic power dissipation, DWT, Lifting Scheme, Adders, Multipliers

#### I. INTRODUCTION

Mobile phones and other similar hand held devices that support image/video applications demand high speed and low power architectures with reduced memory size for DWT processing.DWT is a standard recommended by JPEG 2000 as it supports features like progressive transmission, higher compression and region of interest encoding schemes. Convolution based DWT or FIR filter bank based DWT architectures occupy large area as they require more number of multipliers and adders, thus making the computations complex and time consuming.

General approach for 2D-DWT is to apply the 1D-DWT row-wise which produces L and H sub bands and then process these sub-bands column-wise to get LL, LH, HL and HH coefficients. To perform lifting based DWT, several architectures like direct mapped [2], folded [3], and flipping [4] for single level and multi-level DWT have been proposed. Many architectures that implement the Two-Dimensional separable Forward (2D-DWT) and Inverse DWT (2D-IDWT) in order to be applied on 2D signals have been presented in the past [5], [6], [7] and [8]. These architectures are consisting of filters for performing the 1D-DWT and memory units for storing the results of the transformation.

DWT, which is also present in streaming multimedia applications that are characterized by high

throughput requirements, imposes the need for optimizing the design of the filters in terms of speed. Moreover, portable multimedia devices require low power consumption for increasing the battery lifetime and this can be achieved by minimizing the storage size and number of memory accesses [9]. Low power DWT architectures based on pipelining and parallel processing has been discussed in [10] and [11], in their work low power is achieved by modifying the architecture to reduce number of computations the design was implemented on FPGA. Many of the low power techniques reported in literature [12], [13], [14] and [15] for DWT propose modifications in the architecture level to reduce power dissipation. Power reduction can be accomplished at various levels of abstraction starting from architecture level to circuit level. Power reduction at the sub system level or at the circuit level can be accomplished when ASIC design of DWT architecture is performed. Many of the work reported in literature have restricted to FPGA implementation. In this paper, DWT architecture is considered at various levels of abstraction to demonstrate the dynamic power reduction techniques, using 65 nm TSMC libraries.

This paper organized as follows: Section II discusses wavelet transforms and DWT architecture Section III discusses dynamic low power reduction techniques, Data path circuits of DWT processor architecture sub systems are given Section IV, Section

V presents ASIC implementation of modified DWT architecture based on low power adders. Section VI discusses implementation results and performance comparison and section VII presents conclusion.

### II. DWT ARCHITECTURE

In wavelet analysis, signals are represented using a set of basic functions derived by shifting and scaling a single prototype function, referred to as "mother wavelet", in time [16]. Wavelet transforms are closely related to tree structured digital filter banks and multi resolution analysis. A set of wavelet basis functions can be generated by translating and dilating the mother wavelet. A number of architectures have been proposed for calculation of DWT [2], [3], [4], [5] and [6]. The architectures are mostly folded and can be broadly classified into serial architectures (where the inputs are supplied to the filters in a serial manner) and parallel architectures (where the inputs are supplied to the filters in a parallel manner). A methodology for implementing lifting-based DWT that reduces the memory requirements and communication between the processors, when the input is broken up into blocks is presented in [17]. Figure 1 (a) and 1 (b) shows the forward and inverse DWT based on lifting scheme architecture.

The  $z^{-1}$  blocks are for delay,  $\alpha$ ,  $\beta$ ,  $\gamma$ ,  $\delta$ ,  $\zeta$  are the lifting coefficients and the shaded blocks are registers. 9/7 filter has been used for implementation which requires four steps for lifting and one step for scaling. The input signal  $x_i$  is split into two parts even part  $x_{2i}$  and odd part  $x_{2i+1}$  then the first step of lifting performed is given by the equations [17].

$$d_1^1 = \alpha (x_{2i} + x_{2i+2}) + x_{2i+1}$$
 [1]

$$a_i^1 = \beta (a_i^1 + a_i^1 - 1) x_{2i}$$
 [2]

Then the second lifting step performed by

$$d_1^2 = \gamma (a_i^1 + a_i^1 + 1) d_i^1$$
 [3]

$$a_i^2 = \delta (a_i^2 + a_i^2) + a_i^1$$
 [4]

Then scaling is performed and the following equations are obtained:

$$a_i = \zeta \ a_i^2 \tag{5}$$

$$d_i = \sigma_i^2 / \zeta$$
 [6]

The predict step helps determine the correlation between the sets of data and predicts even data samples from odd. These samples are used in the update step for updating the present phase. Some of the properties of the original input data can be maintained in the reduced set also by construction of a new operator using the update step. The lifting coefficients have constant values of -1.58613, -0.0529, 0.882911, 0.44350, -1.1496 for  $\alpha, \beta, \gamma, \delta, \zeta$  respectively.  $a_i$  and  $d_i$  are DWT outputs after level 1 decompisition.

# II. SOURCES OF POWER DISSIPATION IN CMOS VLSI CIRCUITS

Power consumption in CMOS digital circuits is divided two major components Static and Dynamic [18]. Dynamic power dissipation was dominating with 250 nm technology, with technology scaling towards lower geometries (65nm and below), leakage power has significantly increased as per roadmap. However, dynamic power has also exponentially increased; this is due to increase in switching current and frequency of operation of CMOS circuits. There are various low power reduction techniques such as [18]: Reducing voltage for lower performance blocks, Cut off of power on blocks when they are not required, Combination of multiple voltage and/or Power Gating (shutdown), Lower voltage when blocks not needed, but leave them powered enough to save state without extra retention Vary the voltage and/or frequency dynamically, Vary the well voltage to adjust threshold voltage(vth), which in turn increases speed (forward bias) / reduces leakage (backward bias Reduce gate lengths in transistors along the non-critical paths, Source biasing, push the transistor to operate in cut-off region by increasing the source ground potential and Isolation/level shifting bugs, and transistor sizing/gate sizing technique.

Power reduction techniques mentioned above are to be applied to the DWT architecture to optimize for low power. The major building blocks in DWT and IDWT (as shown in Figure 2) are the adders, multipliers, registers and control unit for data flow control. As the focus of this work is to reduce power dissipation at various levels of abstraction, adders and multipliers are designed with low power techniques.



Fig. 1. Lifting based architecture (a) Forward DWT (b) Inverse DWT [17]

## III. DATA PATH CIRCUITS OF DWT PROCESSOR

Adders and multipliers are data path circuits of the DWT processor. An adder is the most commonly used arithmetic block in the Central Processing Unit (CPU) of a microprocessor, a Digital Signal Processor (DSP), and even in a variety of ASICs. In a DWT processor, adder is one of the important building blocks, required to compute the DWT coefficients of input signal. Multiplier used in a DWT processor also requires adder to add the partial products. Hence, design and analysis of adders and multipliers are considered in this section. Speed and optimization of power of an adder is significant, to improve the overall performance of the system. But an adder also experiences the power-delay trade off. That is, its power dissipation

increases with reduction in delay and vice versa. There are various architectures for adder design. 4-bit adders can be of different types. Some of those are Carry look Ahead Adder, Ripple Carry Adder, Carry Save Adder, Carry Select Adder [19]. These adders are designed with lowpower full adder module.

In many digital signal processing operations-such as correlations, convolutions, filtering, and frequency analysis-one needs to perform multiplication. Multiplication algorithms will be used to illustrate methods of designing different cells so that they fit into a larger structure. In order to introduce these designs, simple and serial and parallel multipliers will be High-speed parallel multipliers introduced. becoming one of the keys in RISCs (Reduced Instruction Set Computers), DSPs (Digital Signal Processors), and graphics accelerators and so on. Parallel multipliers are used in data processor as well as in digital signal processors. There are various multiplier architectures reported in literature, [20-21] Wallace tree, booths multiplier, BZ-FAD multiplier, Shift and Add multiplier and Array multiplier are most popular for DSP applications. These adders are designed with adders and optimizing the power consumption using transistor sizing.

# IV. MODIFIED LIFTING BASED DWT ARCHITECTURE FOR LOW POWER

Lifting equations presented in (1) – (6) when realized using HDL model is a sequential process, as the scaling factors and are dependent on previous samples, thus introducing latency. In order to increase

throughput and latency modified equations are derived. The modified lifting equations eliminate dependency of outputs on previous samples. We have obtained the equations for  $a_i$  and  $d_i$  by substituting (4) in (3), (3) in (2) and so on. The lifting coefficients were substituted and the results were scaled by multiplying with 256 to avoid decimal and to round off the values. The modified lifting scheme equations are:

- $a_i = 294^* (8(6^*x_{2i}+4^*x_{2i-2}+x_{2i+4}+x_{2i+4}+x_{2i-4}+4^*x_{2i+2})-5^*(3^*x_{2i+1}+x_{2i+3}+x_{2i3}+3^*x_{2i-1}) +100^*(2^*x_{2i}+x_{2i+2}+x_{2i2}) -180^*(2^*x_{2i}+x_{2i+2}+x_{2i2}) +113^*(x_{2i+1}+x_{2i1}) +21^*(2^*x_{2i}+x_{2i+2}+x_{2i2}) -13^*(x_{2i+1})+x_{2i}+x_{2i-1})$  [7]
- $d_i = 19^*(3^*x_{2i}+3^*x_{2i+2}+x_{2i+4}+x_{2i-2}) + (-12)^*(2^*x_{2i+1}+x_{2i-1}+x_{2i+3}) + 226^*(x_{2i}+x_{2i+2})-406^*(x_{2i}+x_{2i+2}) + x_{2i+1}$  [8]

These equations are obtained by taking coefficients as common. The equations have initial latency, as the input samples need to be stored before DWT  $a_i$  and  $d_i$  coefficients computations.

The design of low power architecture to reduce dynamic power dissipation is based on equations (7) and (8). From the equation the following are the observations made:

- a<sub>i</sub> and d<sub>i</sub> coefficients are computed based on input samples and lifting coefficients. Every output sample depends upon x<sub>0</sub> to x<sub>4</sub> input samples. Input samples are multiplied by coefficients as per the equations.
- Common factors are identified between a<sub>i</sub> and d<sub>i</sub>
  equations and these common functions are
  realized once and are reused to reduce the circuit
  complexity.
- Lifting coefficients are stored in memory and are retrieved only once and used for computation of a<sub>i</sub> and d<sub>i</sub> components.
- Pipelined architecture is developed to realize a<sub>i</sub> and d<sub>i</sub> equations.

The proposed architecture shown in Figure 3 takes two inputs and gives two outputs per cycle. Data 1 and Data 2 are the odd and even input samples given to hardware in single clock for 100% hardware utilization. This architecture is very simple design as compared to other architectures suggested in [22] which have complex control path to achieve 100% hardware utilization. The row processor and column



Fig. 3. Row processor and column processor for modified lifting DWT

processor shown in Figure 4 are realized using modified lifting scheme based equations.

Based on the architecture shown in Figure 3 and equations presented in (7) and (8), the top level model for the architecture is shown in Figure 4. A detailed data flow for the proposed architecture is presented in the Figure 4. The modified architecture derived consists of the following blocks: parallel input and serial output register, serial input and parallel output register, Multiplier and adders and control unit.

# V. IMPLEMENTATION RESULTS AND DISCUSSION

In this work, the adders and multipliers are modeled using HDL and is synthesized using TSMC 65nm CMOS libraries using Synopsys DC. The synthesis results generate reports that provide information on area, delay and power dissipation. The results obtained are presented in table 1. Carry select adders consumed less power with moderate delay. Different Multipliers are designed using carry select adders. The dynamic power and leakage are analyzed in 65nm cmos libraries using Synopsys DC. In order to reduce power dissipation of adder and multiplier, multi VDD technique is adopted. Reducing VDD supply voltage, the power consumption is minimized which

doesn't affect the area occupied. From the results obtained it is found that power consumption is a quadratic function of voltage (Power =  $fCV_{DD}^2$ ). Decrease in supply voltage increases the overall delay (Delay =  $(KV_{DD}/V_{DD} - V_t)^{\alpha}$ .) Power and delay performance of Wallace multiplier with voltage scaling technique is given in Table 3.

Table 1
Full Adder Design Comparisons

| Type of adder (16 -bit)   | No. of transistors | Power –<br>μ W | Delay –ps |
|---------------------------|--------------------|----------------|-----------|
| Ripple carry adders       | 286                | 40.5505        | 600       |
| Carry save adder          | 92                 | 18.9241        | 74        |
| Carry select adder        | 102                | 16.897         | 65        |
| Carry look<br>ahead adder | 621                | 55.1482        | 62        |



Fig. 4 Modified lifting scheme architecture to reduce dynamic power

Table 2
Power comparison of multipliers

| Multipliers                 | Total<br>Dynamic<br>power (uw) | Cell<br>Leakage<br>power (uw) |
|-----------------------------|--------------------------------|-------------------------------|
| BZ-FAD Multiplier           | 161.27                         | 5.32                          |
| Shift and Add<br>Multiplier | 241.14                         | 4.71                          |
| Booth Multiplier            | 468.02                         | 12.69                         |
| Array Multiplier            | 298.83                         | 10.24                         |
| Wallace Tree Multiplier     | 341.62                         | 13.81                         |

Table 3
Reduction in Dynamic power with voltage scaling

| Voltage level | Power in uw | Delay in ns |
|---------------|-------------|-------------|
| 1.8           | 341.62      | 1.76        |
| 1.5           | 156.12      | 1.934       |
| 1.2           | 82.61       | 2.34        |
| 1             | 41.29       | 3.99        |

The modified DWT row architecture are implemented using carry select adder and Wallace tree multiplier. The HDL model is developed for the architecture and the design is verified for its functionality using test bench in ModelSim. The functionally correct HDL code is synthesized using Synopsys DC targeting TSMC 65 nm library and

technology files. The reports obtained are complied and presented in Table 4.

Table 4
ASIC synthesis results of modified lifting based DWT

| Parameters                    | DWT          | Modified lifting based DWT |
|-------------------------------|--------------|----------------------------|
| Area (sq.mm)                  | 21984.657462 | 29542.89061                |
| Power (μW)                    | 962.9536     | 604.712                    |
| Operating frequency (max) MHz | 212          | 278                        |

From the results obtained and tabulated in Table 4, it is found that due to changes in architecture that reduces number of stages in DWT computation, the dynamic power dissipation is reduced by 37%. However, the area is increased due to increase in registers and intermediate storage units. Further, the synthesized design has minimum delay and zero slack requirement.

### VI. CONCLUSION

In this work, a modified lifting based 1D-DWT architecture is proposed, designed and implemented using 65 nm TSMC low power design library. Modified Lifting based DWT is considered to illustrate the techniques that can be adopted to reduce dynamic power consumption. Adders and Multipliers are important data path circuits for DWT architecture. Initially the various adders are designed with low power CMOS full adder using 65 nm library cells. Then different multipliers are developed and modeled with 65 nm library cells which are adopted with voltage scaling technique. Finally the modified1D-DWT architecture is constructed using carry select adder and Wallace tree multiplier and modeled in HDL code which is synthesized using Synopsys DC targeting TSMC 65 nm library and technology files. Due to changes in architecture, the number of stages in 1D-DWT computation decreased and the dynamic power dissipation reduced by 37% with area trade off. Modification in the architecture level as well as at different abstraction levels are considered for power reduction. Low power library cells from Synopsys design ware are considered for synthesis. In order to

further reduce power dissipation various other dynamic low power techniques can be introduced for optimization.

### **REFERENCES**

- [1] I. Daubechies and W. Sweldens, "Factoring Wavelet transforms into Lifting Schemes," The J. of Fourier Analysis and Applications, vol. 4, 1, pp. 247–269, 1998.
- [2] C.C. Liu, Y.H. Shiau, and J.M. Jou, "Design and Implementation of a Progressive Image Coding Chip Based on the Lifted Wavelet Transform," in Proc. of the 11th VLSI Design/CAD Symposium, Taiwan, 2000.
- [3] C.J. Lian, K.F. Chen, H.H. Chen, and L.G. Chen, "Lifting Based Discrete Wavelet Transform Architecture for JPEG 2000," in IEEE International Symposium on Circuits and Systems, Sydney, Australia, pp. 445–448, 2001.
- [4] C.T. Huang, P.C. Tseng, and L.G. Chen, "Flipping Structure: An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Transform," in IEEE Transactions on Signal Processing, pp. 1080–1089, 2004.
- [5] C. Chakrabarti and M. Vishwanath, "Efficient realizations of the discrete and continuous wavelet transforms: from single chip implementations to SIMD parallel computers," IEEE Trans. Signal Processing, vol. 43, no.3, pp. 759-771, March 1995.
- [6] C. Chakrabarti and M. Vishwanath and R. M. Owens, "Architectures for wavelet transforms: A survey," Journal of VLSI Signal Processing, vol. 4, no. 2, pp 171-192, 1996.
- [7] Vishwanath, R. M. Owens, M. J. Irwin "VLSI architectures for the discrete wavelet transform", IEEE Trans. Circuits and Syst. II, vol. 42, no. 5, May 1995.
- [8] N.D. Zervas, G.P. Anagnostopoulos, V. Spiliotopoulos, Y. Andreopoulos and C.E. Goutis, "Evaluation of design alternatives for the 2-D-discrete wavelet transform", IEEE Trans. Circuits and Syst. Video Technol., vol. 11, no. 2, pp. 1246-1262, December 2001.
- [9] F. Catthoor, S. Wuytack, E. De Greff, F. Balasa, L. Nachtergale, A. Vandecappele, "Custom Memory Management Methodology -Exploration of Memory management Organization for Embedded Multimedia System Design", Kluwer Academic Publishers, 1998.
- [10] Bing-Fei Wu and Chung-Fu Lin, "A High-Performance and Memory-Efficient Pipeline Architecture for the 5/3 and 9/7 Discrete Wavelet Transform of JPEG2000 Codec," IEEE Trans. on circuit and systems for video

- *Technology*, vol. 15,no. 12, pp. 1615–1627, December 2005.
- [11] Nagabushnam, Cyril Prasanna Raj P, Ramachandra, Design and FPGA Implementation of Modified Distributive Arithmetic Based DWT-IDWT Processor for Image Compression, European Journal of Scientific Research, Vol. 2, pp. 23-26, 2009.
- [12] Cyril Prasanna Raj P, Low power DWT for image compression, SASTech Journal, Vol.7, pp. 56-61, 2008.
- [13] F. Marino, "Efficient high-speed/low-power pipelined architecture for the direct 2-D discrete wavelet transform," *IEEE Trans. Circuits Systems II, Analog Digit. Process.*, vol.47, no.12, pp.1476-1491, 2000.
- [14] T. Park and S. Jung, "High speed lattice based VLSI architecture of 2D discrete wavelet transform for real-time video signal processing," IEEE Trans. Consumer Elect., vol.48, no.4, pp.1026-1032, 2002.
- [15] Yeong-Kang Lai, Lien-Fei Chen and Yui-Chih shih, "A Highperformance and Memory-Efficient VLSI Architecture with Parallel Scanning method for 2-D Lifting-Based Discrete Wavelet Transform" IEEE Transaction on Consumer Electronics, vol. 55, No. 2, May 2009.
- [16] P.P. Vaidyanathan, *Multirate systems and Filter Banks*, Englewood Cliffs, Prenctice-Hall, 1993.

- [17] Tinku Acharya and Chaitali Chakrabarti, "A Survey on Lifting-based Discrete Wavelet Transform Architectures", Journal of VLSI Signal Processing 42, 321–339, 2006.
- [18] Neil H.E Weste and David Harris, *CMOS VLSI Design A Circuit and System Perspective*, 3<sup>rd</sup> edition,
  Pearson Education, 2005.
- [19] C. Wey, C.H. Huang and H.C. Chow, A New Low-Voltage CMOS 1-Bit Full Adder for High Performance Applications , *IEEE*, pp. 21- 24, 2002.
- [20] F. Vasefi, Z. Abid, Low power n-bit adders and multiplier using lowest-number-of-transistor 1-bit adders, in: IEEE Conference Proceeding of CCECE/CCGEI, Saskatoon, May 2005, pp. 1731-1734.c
- [21] Shanthala. S, Cyril Prasanna Raj P and Dr. S. Y. Kulkarni "Design and VLSI implementation of Pipelined Multiply Accumulate Unit" was presented at International Conference on Emerging Trends in Engineering and Technology (ICETET 09) during 16<sup>th</sup> 18<sup>th</sup> December 2009 at G.H. Raisoni College of Engineering, Nagpur (Maharashtra).
- [22] A.D. Darji, A.N. Chandorkar, and S.N. Merchant, Memory Efficient and Low power VLSI architecture for 2-D Lifting based DWT with Dual data Scan Technique, Recent Researches in Circuits, Systems and Signal Processing, ISBN: 978-1-61804-017-6.