# Ultra Low Power 2-tier 3D Stacked Sub-threshold H.264 Intra Frame Encoder

Sandeep Kumar Samal<sup>1</sup>, Kiyoung Kim<sup>2</sup>, Youngchan Kim<sup>2</sup>, Taesung Kim<sup>2</sup>, Hyuk-Jae Lee<sup>2</sup>, Taewhan Kim<sup>2</sup> and Sung Kyu Lim<sup>1</sup>

<sup>1</sup>School of ECE, Georgia Institute of Technology, Atlanta, GA, USA

<sup>2</sup>School of ECE, Seoul National University, Seoul, Korea

Email: *{*sandeep.samal, limsk*}*@gatech.edu

Digital circuits used in sensor networks require longer battery life and do not demand a fast frequency of operation. Sub-threshold circuits for such applications are an attractive option. Three dimensional ICs (3DICs) on the other hand is an emerging technology which helps in miniaturization and reduction in interconnects, resulting in power saving and performance improvement. Several works on sub-threshold circuits and TSV based 3DICs have been studied independently but none have studied the impact of 3D stacking of sub-threshold circuits. We design and study an ultra-low power 2-tier 3D sub-threshold implementation of H.264 intra frame encoder that encodes video frames. The encoder consumes  $0.73\mu W$  power at 16.13 KHz clock frequency for a typical application of encoding a Common Image Format (CIF) frame. The motivation is to assess the feasibility of the use of extreme low power video encoders in image sensor based sensor networks. Low power operation is highly beneficial to such unattended sensor networks by extending their battery life. Subthreshold design helps us in this respect while 3D stacking minimizes footprint area, helps in off-chip to on-chip memory integration and improves timing performance.

## I. DESIGN FLOW

We used Global Foundry 130nm technology for the design. The nominal supply voltage is 1.5V. We sized the basic logic gates to minimize propagation delay mismatch at 0.4V and then design twodie 3D stacked H.264 for 1.5V and 0.4V supplies and compare their performance. The energy per cycle vs supply voltage curve [1] for our standard cells and the reliable functioning of D flip-flop influenced the choice of 0.4V as the sub-threshold supply. Since process variations in sub-90nm technologies show significant impact on sub-threshold operation, we chose the larger technology. We exclude the external memory from the present implementation and sub-threshold register files are used for internal memory. The standard cells were studied for process variations, thermal variations and supply variations but we present only the full chip thermal and IR drop analysis of 3D sub-threshold design.

We used the Fast H.264 encoder architecture in [2] .The partitioning of H.264 was done by keeping the prediction phase and reconstruction phase in separate tiers. Fig. 1 shows the details of the architecture and its partitioning. Our RTL-GDSII tool chain is based on commercial tools and enhanced with our in-house tools to handle TSVs and 3D stacking. The standard cells were sized with Cadence Virtuoso and then their libraries characterized using Encounter Library Characterizer. The entire 3D netlist was synthesized by Design Compiler. The layout of individual dies was done with Encounter Digital Implementation System and the 3D power and timing analysis was carried out using Synopsys PrimeTime. Modelsim was used to simulate the CIF encoding test bench and generate the activity file for power calculations. To carry out 3D IR Drop analysis, we first generate a 3D technology file using TSVs with face-to-back bonding of dies. We then modify the library and layout files of each die and combine them to generate 3D design files. Rings and stripes on the top metal layers are used for power supply to the Metal1 corewires. The stripes are used only in bottom tier to supply power to cells between the distributed signal TSVs. Only four power TSVs at the ring intersections are used to have a strict analysis. VoltageStorm was then used to analyze the static IR drop in this 3D design. The current sources for the static IR drop analysis were obtained from the cell powers calculated using PrimeTime. For thermal simulation, we first build a 3D mesh for our chip, and computed the thermal conductivity for each grid using layout and stacking information. Using the cell powers, we build a power density map. Ansys Fluent then solved the thermal differential equations using the power density map and thermal conductivity information to obtain the temperature map. In this simulation, we assumed adiabatic side walls, no heat sink and ambient temperature was set at 27*<sup>o</sup>C*.

## II. FULL CHIP ANALYSIS

Super-threshold 2D design of H.264 encoder was used as the baseline and we compare the super-threshold 3D, sub-threshold 2D and sub-threshold 3D designs with it. The layouts of the individual dies designed for sub-threshold are shown in Fig. 2. The TSVs are placed in a distributed form and placement of cells is done accordingly. The comparison of the power and timing performance of the 2D and 3D designs at nominal 1.5V and sub-threshold 0.4V is summarized in Table I. Fig. 3 show the temperature map based on CIF image frame encoding application. Table II summarizes the internal maximum temperature and maximum static IR drop for power values based on the same application.



Fig. 1. 2-tier partitioning for 3D implementation of H.264 encoder in [2]



Fig. 2. Layouts of individual dies in 3D designs with 0.4V sub-threshold voltage



Fig. 3. Temperature map of 3D design with 0.4V sub-threshold voltage

We observe that going into sub-threshold regime gives a huge benefit in power with a compromise on operating frequency. The switching and internal power consumption reduces more than 40000 times by going sub-threshold. The leakage power is proportional to the square of supply only and hence its reduction is not so significant. However, the overall power reduction is still very high. 3D integration saves 56% footprint area and improves timing performance of the subthreshold design with very minor power overhead over that of 2D subthreshold. This minor power overhead is due to the extra buffering in individual dies designed separately as standard 3D layout tools are not available. Combined 3D placement can help us overcome this issue. The full chip 3D IR drop analysis shows that the maximum IR drop is just  $3\mu V$  with just 4 power bumps at the corners of the rings. The reason is that extremely low current is tapped from source. Most of the drop is through the TSVs in going from one die to another. This minor drop does not affect sub-threshold performance much. In practice, we have more number of power TSVs reducing resistance further. However, external supply noise may result in failure of the standard cells, especially D flip-flops. The thermal analysis also shows that 3D stacking has no detrimental effect on sub-threshold designs and we can even get rid of the heat sink. The maximum temperature for subthreshold design remains at ambient temperature while it increases to 59°C for super-threshold design. The variation of temperature over both the 3D designs is extremely small due to even power density distribution. to be considered. The main of  $\theta$  and the set of  $\theta$  and t

We also studied the general trend of power vs minimum clock period for the design using individual standard cell spice simulations and interpolating them based on the actual design simulations at 1.5V and 0.4V supplies. The plot is shown in Fig. 4. As discussed earlier, the minimum energy per cycle is obtained at a supply close to the threshold voltage. We can select the frequency of the H.264 encoder based on application requirement and try to minimize power for that frequency. For reliability issues, we may compromise and set a little higher supply voltage. The effect of external variations in temperature and voltage on the performance and reliability also needs



Fig. 4. The trade-off between power and minimum clock period with change in supply voltage

#### III. CONCLUSIONS

We designed an ultra-low power H.264 intra frame encoder for sensor applications using sub-threshold supplies with 3D stacking using TSVs and analyzed its performance. We observed that carefully planned sub-threshold 3D stacking can be a promising approach for ultra-low power miniaturized unattended sensor networks in general, video encoders being the candidate for our study. While the footprint area was reduced by more than 50%, the internal thermal and IR drop issues were negligible. A correct choice of supply voltage based on application requirement, minimum energy condition, and the reliability and robustness of the design are the primary factors to be considered during the process of designing miniaturized ultra-low power circuits.

### IV. ACKNOWLEDGMENTS

This work was supported by the CISS funded by the MEST Global Frontier Project, Govt. of South Korea (CISS 2011-0031863).

#### **REFERENCES**

- [1] A. Wang and A. Chandrakasan, "A 180-mV Subthreshold FFT Processor Using a Minimum Energy Design Methodology," *IEEE Journal of Solid-State Circuits*, vol. 40, no. 1, pp. 310–319, 2005.
- [2] J. S. Jung, Y. J. Jo, and H. J. Lee, "A Fast H.264 Intra Frame Encoder with Serialized Execution of 4x4 and 16x16 Predictions and Early Termination," *Journal of Signal Processing Systems*, vol. 64, no. 1, pp. 161–175, 2011.

|                            | super-threshold |         | sub-Vth |                      | comparison        |
|----------------------------|-----------------|---------|---------|----------------------|-------------------|
|                            | 2D              | 3D      | 2D      | 3D                   | 3Dsub<br>2D super |
| Supply voltage $(V)$       | 1.5             | 1.5     | 0.4     | 0.4                  |                   |
| Area $(mm \times mm)$      | 1.5x1.5         | 1.0x1.0 | 1.5x1.5 | $\overline{1.0x}1.0$ | 0.44              |
| Wirelength $(m)$           | 3.04            | 3.02    | 2.90    | 2.97                 | 0.977             |
| Target clock period $(ns)$ | 20              | 20      | 62000   | 62000                | 3100              |
| Timing slack $(ns)$        | 3.83            | 3.83    | 862     | 4062                 |                   |
| Internal power( $\mu$ W)   | 18400           | 18400   | 0.4     | 0.4                  | 1/46000           |
| Switching power( $\mu$ W)  | 7780            | 9490    | 0.15    | 0.19                 | 1/40947           |
| Leakage power( $\mu$ W)    | 2.9             | 2.9     | 0.14    | 0.14                 | 1/20.71           |
| Total power( $\mu$ W)      | 26200           | 27900   | 0.69    | 0.73                 | 1/35890           |

TABLE I. DESIGN AND PERFORMANCE COMPARISON OF H.264 ENCODER IN 2D AND 3D SUPER-THRESHOLD AND SUB-THRESHOLD OPERATION

TABLE II. MAX STATIC IR DROP AND MAX TEMPERATURE COMPARISON IN 3D DESIGNS. (THERMAL SIMULATION WITHOUT HEAT SINK AND AMBIENT TEMPERATURE= 27*oC*)

|                                  | 3D super-threshold 3D sub-threshold |          |
|----------------------------------|-------------------------------------|----------|
| Max Static IR Drop               | 41mV                                | $3\mu V$ |
| Max Temperature( ${}^{\circ}C$ ) | 59.90                               |          |
| Avg Temperature( ${}^{\circ}C$ ) | 59.66                               |          |