# Automated Generation of All-Digital I/O Library Cells for System-in-Package Integration of Multiple Dies

M.Lee, A.Singh, H.M.Torun, J.Kim, S. Lim, M.Swaminathan, and S.Mukhopadhyay School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA minah.lee@gatech.edu

Abstract—This paper presents an automated cell library generation flow for all-digital I/O circuits for SiP integration. Given parameterized models of SiP wire traces, our method automatically designs, optimizes, and generates layouts of I/O cells for delay/energy minimization. We demonstrate that automated I/O library cell generation can reduce maximum dieto-die communication delay or energy for a given multi-die SiP design and associated interposer wire traces. The proposed flow is demonstrated considering 28nm CMOS technology and interposer based SiP integration.

Keywords—System-in-package (SiP), 2.5D integration, interface circuits, I/O library, automated flow

# I. INTRODUCTION

The system-in-package (SiP) allows heterogeneous integration in a single module which can reduce design cost and increase yield compared to system-on-chip (SoC) while promising much higher performance than on-board integration [1-4]. When a SoC is partitioned into multiple dies and integrated as a SiP, the on-chip wires between IPs are replaced by die-to-die (D2D) interconnects in the interposer (Fig. 1). Unlike highly diffusive on-chip wires, the wires in silicon interposers show appreciable inductive properties similar to off-chip wires. However, the traditional I/O cells used to drive off-chip wires are complex mixed-signal circuits, consume appreciable power, and require custom design. Total number of I/O cells connecting on-interposer wires in a SiP will be much larger than number of off-chip I/Os in the original SoC (Fig. 1). Hence, I/O cells for SiP should be simple and low-power for inductive on-interposer wires. Alldigital I/O cells with full-swing signaling, similar to those for on-chip wires, are desirable to achieve this goal. Moreover, there is likely to be wires of various lengths in a SiP integrating multiple dies, so a single I/O cell may not provide the optimal propagation delay and energy behavior across different trace lengths. Therefore, auto-generated I/O circuits should be optimized to minimize delay/energy.

This paper presents a tool for automated design of alldigital I/O cells for 2.5D SiP (interposer) integration. Given an interposer technology, wire-dimensions, and wire-length distributions of the SiP design, our tool designs delay or energy-minimized I/O cells, and generates its layout and timing/power library. The tool uses a circuit-package cosimulation environment that couples SPICE models for I/O cells (drivers, receivers) with parametric models of interposer wire traces. Our tool generates cell library both as soft-macro (register transfer logic, RTL level) and as hard macro (layout) in a target CMOS technology. We demonstrate autogeneration of delay/energy optimized I/O cells in 28nm CMOS technology for different wire-length, and interposer parameters. With a case study on a SiP-based multi-core mesh NOC structure, we show that wire distribution dependent optimization of I/O library cell can help enhance delay/energy characteristics of die-to-die communication in SiP, compared to design of conventional I/O cells for target output impedance.



Fig. 1. (a) On-chip wires without any need for I/O circuits for SoC integration, (b) I/O circuits are required for SiP integration to drive long



Fig. 2. All-digital I/O and full-swing digital signaling.

#### II. CHIP-INTERPOSER CO-SIMULATION

We develop a chip-interposer co-simulation flow to accurately characterize delay and energy in the physical link (driver, wire, and receiver) of an interposer wire. Our design uses full-swing digital signaling and all-digital I/Os based on CMOS inverters as transceivers as shown in the Fig. 2. Receiver side termination is eliminated for full-swing signaling. Interposer wires in SiP show transmission line behavior even at moderate frequencies (~1-2 GHz), so accurate modeling of the wire impedance is necessary for cosimulation. For an accurate model of interposer including micro-bumps, we use machine learning techniques as described in [5].

#### III. CELL LIBRARY GENERATION FLOW

Fig. 3 shows our proposed I/O cell library generation flow. The trans-receiver circuits are considered as inverter chain. We define sizes of the first and last inverter in the driver stage are 1 and D, respectively. Likewise, we define sizes of the first and last inverter stage in the receiver as R and 1. Now, the design of the I/O cell can be defined as design of the entire driver and receiver chain, i.e. selecting final driver (D) and receiver (R) sizes, as well as number of inverters in the driver (N<sub>driver</sub>) and receiver (N<sub>receiver</sub>) chains.

For each driver (D) and receiver (R) pair, we find an optimal ratio (f) between each stage of driver and receiver inverter chain. Consider energy minimization as an example. For very large ratios (f), the number of stages required to drive a fixed final stage is less which reduces the switching power but increase the short circuit power because of slow slew rates



Fig. 3. Proposed I/O cell library generation flow.

dominates the total power. Similarly, for large number of stages (smaller f), the total power is dominated by switching power. Therefore, we get an optimal number of stages for energy optimization with respect to ratio f and we obtain f=8 as optimum ratio. On the other hand, for propagation delay minimization, the driver and receiver chain is sized based on effective fanout ( $C_{drv}/C_{invx1}$ ) and is obtained to be 4.

The next step is to select the optimal driver/receiver for energy and delay minimization. Consider the example of delay minimization for a target wire length and interposer technology. The overall flow starts with a set of available driver and receiver sizes (i.e. set of R, and D). For each pair in the set, we first perform elaboration of the entire driver/receiver chain based on f=4. Next, for all the driver options, we perform co-simulation where the wire model incorporates interposer technology and length properties. We select the subset of the driver/receiver pairs which interposer output swing is greater than 90% of the full-swing and finally, from this subset, we select the optimum I/O cell for minimum delay. The same process can be performed for minimum energy as well by using f=8 for elaboration.

Once the driver/receiver chain are finalized our flow generates the RTL for these driver/receiver. We automatically insert the modified RTL into a baseline template consisting of rest of the functional logic for the I/O cell. Using standard cell library, the RTL is synthesized and placed and routed to generate the layout for the I/O cell. The final layout and extracted netlist can be passed to a cell library characterization tool, such as SiliconSmart to generate the final timing and power library of the I/O cell.

## IV. EXPERIMENTAL RESULTS

We demonstrate the proposed flow for I/O cell generation in 28nm CMOS technology and for silicon interposer.

#### A. Cell Library for Different Interposer

We demonstrate the design flow considering different interposer wire parastics i.e. wire dimensions and spacing/shielding between wires (Fig. 4). Case 1 represents a high bandwidth SiP system with fine wire pitch, which leads to more resistive wires and coupling capacitance thereby limiting maximum data rate. Case 2 represents a SiP system which can achieve higher data rates (higher width, spacing, and thickness) but has lower wiring density. Case 3 differs in bump size/pitch with respect to other two cases.



Fig. 4. (a) Transmission line [5], (b) C4 bump, and (c) physical dimensions of various package models.

 TABLE I

 I/O CELLS FOR DIFFERENT INTERPOSER (1MM WIRE)

|                         | Delay Minimization |         |         | Energy Minimization |        |        |
|-------------------------|--------------------|---------|---------|---------------------|--------|--------|
|                         | Case 1             | Case 2  | Case 3  | Case 1              | Case 2 | Case 3 |
| TX, RX sizes            | x72, x5            | x59, x4 | x80, x5 | x3, x3              | x3, x1 | x3, x1 |
| Propagation delay [ps]  | 45                 | 43      | 43      | 193                 | 164    | 193    |
| Energy per bit [pJ/bit] | 0.144              | 0.117   | 0.145   | 0.093               | 0.084  | 0.088  |

The generated delay and energy optimized I/O cells for these interposers in 1mm wire are shown in Table I. As case 1 and 3 are more resistive than case 2, they require bigger driver/receiver sizes than case 2 for delay minimization. On the other hand, driver/receiver sizes for minimum energy are nearly same because x3 driver is the smallest size that achieves 90% voltage swing constraints for all interposer cases. Delay from the energy minimized I/O is much larger for case 1 and 3, compared to the one for case 2.

#### B. Cell Library for Different Wire Lengths

It is essential to design I/O circuit optimized for different ranges of wire lengths in a SiP to achieve high data rates as well as to minimize energy consumption. Table II shows driver/receiver sizes, delay, energy for various lengths for delay or energy minimization considering the interposer technology from case 2 in Fig. 4. In general, driver size increases with increasing wire lengths for both energy or delay minimization. As expected, delay minimization requires higher driver/receiver sizes than energy minimization.

# C. Case Study on An Illustrative SiP

We have applied the proposed flow to an illustrative SiP design shown in Fig. 5 (a, b). It consists of CPU, GPU, baseband and several other modules in mesh structure. Fig. 5 (c) shows the wire length distribution for silicon interposer routing among all the modules. Traditionally, off-chip I/O cells are designed to match a target impedance (~50 $\Omega$ ) to minimize reflection in off-chip wires. Therefore, for a comparison, we first designed I/O cells to match target impedance. Table III (A) shows worst delay and average energy of these I/O cells, referred to as the conventional I/O cells. Table III (B, C) summarizes all the I/O cells that are created with the optimization methods discussed previously. I/O cells are optimized (delay or energy) individually for different wire-lengths (referred to as 'Individually optimized I/O'). The worst-case delay and average energy (=total energy of all wires divided by the number of wires) are reported for analysis. Individually optimized I/Os for minimum delay, shows 13% less worst-case delay and 33% less average energy consumption compared to conventional I/O. Likewise, individually optimized I/Os for minimum energy, shows 198% higher worst-case delay but 52% less energy consumption compared to the conventional I/O. Table III (C) shows the result when only one I/O cell is generated using proposed flow considering delay or energy minimization for the maximum wire-length and placed for all length of wires. We refer to this design as 'Optimized I/O for longest wire'. Using the optimized I/O for longest wire for minimum delay results in 7% less worst-case delay and 36% less average energy consumption compared to the conventional I/O cell. Likewise, when optimized I/O to minimize energy dissipation for longest wire is used, we observe 174% higher worst-case delay but 66% less average energy. In summary, we observe that I/O cells generated by proposed flow can lower worstcase delay as well as reduce average energy dissipation compared to conventional I/O.

|               | TABLE II                   |                   |
|---------------|----------------------------|-------------------|
| I/O CELLS FOI | R DIFFERENT WIRE LENGTHS ( | INTERPOSER CASE2) |
|               |                            |                   |

|                         | Delay Minimization |         |          | Ener   | ergy Minimization |         |  |
|-------------------------|--------------------|---------|----------|--------|-------------------|---------|--|
|                         | 1mm                | 5mm     | 10mm     | 1mm    | 5mm               | 10mm    |  |
| TX, RX sizes            | x59, x4            | x79, x5 | x151, x5 | x3, x1 | x12, x1           | x28, x3 |  |
| Propagation delay [ps]  | 43                 | 69      | 104      | 164    | 192               | 162     |  |
| Energy per bit [pJ/bit] | 0.117              | 0.451   | 0.814    | 0.084  | 0.337             | 0.639   |  |
|                         |                    |         |          |        |                   |         |  |

| I ABLE III                                      |       |
|-------------------------------------------------|-------|
| I/O CELLS FOR AN ILLUSTRATIVE SIP (INTERPOSER C | CASE2 |

|                         | Conventional  | Individually optimized I/O (B) |              | Optimized I/O for longest wire (C) |             |  |
|-------------------------|---------------|--------------------------------|--------------|------------------------------------|-------------|--|
|                         | I/O (47Ω) (A) | Delay Min.                     | Energy Min.  | Delay Min.                         | Energy Min. |  |
| TX, RX sizes            | x128, x4      | x59-x64, x4                    | x3-x6, x1-x2 | x66, x4                            | x7, x3      |  |
| Worst delay [ps]        | 55            | 48                             | 164          | 51                                 | 151         |  |
| Average Energy [pJ/bit] | 0.187         | 0.125                          | 0.089        | 0.119                              | 0.063       |  |

# D. Cell Library with ESD protection

Transistor based ESD protection avoids a sudden electricity flow and protects IC. The delay/energy minimized I/O cells with and without ESD protection are shown in Table IV. As ESD protection increases the load capacitance, I/O with ESD protection requires bigger driver/receiver sizes for delay minimization. On the other hand, driver/receiver sizes for minimum energy are same, but I/O with ESD protection consumes more energy.

#### V. CONCLUSIONS

This paper presents automated flow for generating alldigital I/O library cells for large-scale 2.5D SiP integration. Given a 2.5D packaging (interposer) technology, our flow generates I/O layout and timing/power library with the objective of minimizing delay or energy. Our flow includes chip-interposer co-simulation to consider inductive property of on-interposer wire, and at the same time minimizes communication delay/energy. similar to buffer design/insertion for on-chip signaling. We demonstrate our flow for different wire lengths, package dimensions, and finally, apply the flow to generate I/O cells for an illustrative SiP design for a multi-core processor. The proposed flow shows the feasibility of automatically generating I/O cells for different wires in a large scale SiP design, while minimizing delay/energy.

#### ACKNOWLEDGEMENT

This material is based on work supported by the DARPA CHIPS project (#N00014-17-1-2950).

## REFERENCES

- K. Saban, "Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency", Xilinx WP380.
- [2] L. England et al., "Advanced packaging saves the day! How TSV technology will enable continued scaling," IEDM, 2017.
- [3] I. Bolsens, "Pushing the boundaries of Moore's Law to transition from FPGA to All Programmable Platform", ISPD 2017.
- [4] Daniel. S. Green, "Common Heterogenous Integration and Intellectual Property (IP) Reuse Strategies (CHIPS)", CHIPS Proposers Day, https://www.darpa.mil/attachments/CHIPSoverview%20Sept212016P roposerDay.pdf
- [5] H. M. Torun et al., "A Bayesian Framework for Optimizing Interconnects in High-Speed Channels." NEMO2018 TABLE W.

| TADLE IV                                                          |           |           |                     |        |  |  |
|-------------------------------------------------------------------|-----------|-----------|---------------------|--------|--|--|
| I/O CELLS WITHOUT AND WITH ESD PROTECTION (INTERPOSER CASE2, 1MM) |           |           |                     |        |  |  |
|                                                                   | Delay Min | imization | Energy Minimization |        |  |  |
|                                                                   | w/o ESD   | w/ ESD    | w/o ESD             | w/ ESD |  |  |
| TX, RX sizes                                                      | x59, x4   | x68, x5   | x3, x1              | x3, x1 |  |  |
| Propagation delay [ps]                                            | 43        | 44        | 164                 | 164    |  |  |
| Energy per bit [pJ/bit]                                           | 0.117     | 0.125     | 0.084               | 0.087  |  |  |
|                                                                   |           |           |                     |        |  |  |



Fig. 5. (a) Floor plan, (b) interposer routing layout, and (c) wire length distributio of a mesh NOC structure.