# Monolithic 3D Compute-in-Memory Accelerator with BEOL Transistor based Reconfigurable Interconnect

Yandong Luo<sup>1</sup>, Sourav Dutta<sup>2</sup>, Ankit Kaul<sup>1</sup>, Sung-kyu Lim<sup>1</sup>, Muhannad Bakir<sup>1</sup>, Suman Datta<sup>2</sup>, Shimeng Yu<sup>1</sup> <sup>1</sup>Georgia Institute of Technology, Atlanta, GA, USA <sup>2</sup>University of Notre Dame, Notre Dame, IN, USA Email: shimeng.yu@ece.gatech.edu

Abstract-CIM-based inference engine with FeFET is projected to exhibit excellent energy efficiency, but its area scaling is still limited by availability of logic voltage compatible FeFET at leading-edge node. This work performs system-technology co-design (STCO) of a monolithic 3D (M3D) CIM accelerator using back-end-of-line (BEOL) compatible oxide channel MOSFET and FeFET, where Wdoped In<sub>2</sub>O<sub>3</sub> (IWO) NMOS is utilized to design area-efficient M3D write circuit and the IWO-based FeFET is adopted as the routing switch for reconfigurable interconnect. From a system-level evaluation, our M3D IWO FeFET design (utilizing a hybrid 22nm/7nm M3D partition) shows 2.9× times higher energy efficiency than a 7nm SRAM design with comparable chip area. The chip area can be further reduced by improving the electron mobility of the oxide channel.

#### INTRODUCTION I.

CIM is one of the promising paradigms for deep neural network (DNN) acceleration. A CIM array (Fig. 1(a)) consists of three parts: memory array with either 8T-SRAM or emerging non-volatile memories (eNVMs); write circuit for weight programming, which requires I/O transistor to provide high write voltage; and mixed-signal peripheral circuit for partial sum processing including analog-to-digital converter (ADC) and shift-add. Among eNVMs, ferroelectric field effect transistor (FeFET) stands out due to its high on-state resistance ( $R_{ON}$ >100k $\Omega$ ) and low write energy (<10fJ). According to DNN+NeuroSim [1], the energy per op for 22nm FeFET CIM array with 2-bit per cell outperforms that of 8T-SRAM at 7nm node (Fig. 1(b)). However, the area cost for FeFET CIM array is 9× larger than 7nm SRAM, which is limited by the availability of FeFET today only at legacy node (e.g. 22nm). One solution is to harness M3D integration [2] where the memory cell and write circuit are fabricated on the top tier utilizing design rules for a legacy node, while the mixed signal and logic peripheral circuits are fabricated on the bottom tier with a leading-edge node. In such M3D scheme, the chip area is still limited by the FeFET write circuit as it is difficult to reduce the FeFET write voltage below 1V.

Besides the technological challenges mentioned above, another limitation of the architectural design arises from the rigidity of today's CIM design, which is usually customized for one specific DNN model. The hardware overhead to achieve reconfigurability between different DNN models is significant. For example, in FPGA, the connection box and switching box consume up to 70%-80% of the total chip area [3], because the routing element based on NMOS pass transistor + SRAM configuration bit cell is not area efficient.

We address both technological and architectural challenges by using IWO NMOS [4] and IWO FeFET [5] to design areaefficient M3D write circuit for CIM array and routing switch at BEOL for reconfigurable CIM architecture, respectively. **II. IWO TRANSISTOR TECHNOLOGY** 

In the sequential M3D scheme, the processing temperature of top tier transistors should be constrained below 400°C in order not to degrade the bottom silicon CMOS transistors. Technology options include silicon recrystallization by laser annealing [6] and amorphous oxide semiconductor, where IWO is one of the promising channel materials [7]. In the IWO material, the tungsten acts as both electron donor and stabilizer by absorbing the oxygen vacancies. Fig. 2(a) shows the device structure of a double-gated IWO transistor on the top tier. The IWO channel is sandwiched between the top and back gate electrodes. Thanks to the large bandgap of the IWO, IWO transistor can endure high voltage (2.5V) even with a short gate length of 30nm~50nm. By integrating hafniumzirconium oxide (HZO) into the gate stack, IWO FeFET is obtained, which can be utilized as a memory device for storing configuration bit and as a pass transistor concurrently. It should be noted that only NMOS is available for IWO. Hence, to design M3D circuit, transistor-level partitioning is needed by using IWO transistor on the top tier and PMOS on the Si, which is interconnected by the inter-tier vias (MIV).

Fig. 2(b) shows the I-V curve of the IWO transistor, CMOS logic and 2.5V I/O transistor. The performance of IWO transistor is noticeably inferior to CMOS logic due to its relative low electron mobility (~20cm<sup>2</sup>/V·s). Nevertheless, under high V<sub>gs</sub> (2.5V), IWO transistor shows competitive performance over I/O transistor. Therefore, it can be regarded as a good replacement for NMOS I/O transistor on the top tier. A comparison between IWO transistor and laser recrystallized Si transistor is made in Table I. IWO transistor shows lower leakage and the capability to be processed at even lower temperature. More importantly, it offers higher endurance to high voltage operation thus benefits the write circuit design.

**III. PERIPHERAL CIRCUIT DESIGN WITH IWO DEVICE** A. Level Shifter and Output Driver Design

The high programming voltage (2~4V) of FeFET is delivered by the write circuit using level shifter and the output driver. M3D write circuit is designed with IWO NMOS and Si PMOS, as shown in Fig. 3(a). The parasitic capacitance and resistance of the 3D interconnect are estimated to be 0.18fF and 91 $\Omega$ , respectively, assuming 6 metal layers at the bottom tier with M1~M4 pitch=40nm, M5~M6 pitch=64nm and MIV diameter of 30nm [8].

The timing diagram of the write circuit is shown in Fig. 3(b). The M3D design with the same transistor size as CMOS (IWO-same) and a scaled-up design (IWO-sized up) are

25.3.1

considered. 1V (I/O voltage at 7nm node) and 2.5V is assumed as the VDDL and VDDH, respectively. For IWOsame, the delay of the input stage of the driver is increased by about 0.52ns compared to the CMOS baseline, which is attributed to the degraded  $I_{ON}$  of IWO transistor when operating at VDDL. Therefore, when the IWO transistor width of the input stage is sized up to 720nm (in IWO-sized\_up), almost the same performance as CMOS design is achieved.

From the layout in Fig. 3(c), it is observed that the width of the IWO-based level shifter is reduced by 41% compared with the CMOS design due to the reduced gate length (50nm vs. 270nm). The height of the IWO-based level shifter is slightly increased by 0.27 $\mu$ m due to the increased transistor width and the additional gate contact for double-gate device structure. In the M3D write circuit, the layout is partitioned at half-height and then folded. MIVs are inserted at the crosssection, which induces height overhead of half metal pitch and therefore the area of the M3D design is about 51.2% of the 2D unfolded layout.

# B. Routing Switch Design with BEOL FeFET

The schematic of a 6T-routing switch is shown in Fig. 4(a). The typical NMOS+SRAM routing element is replaced by an IWO FeFET on the top tier and the peripheral circuits are fabricated at the bottom tier. The circuit schematic of an inputoutput pair is shown at the bottom of Fig. 4(a). A minimumsized PMOS level restorer is utilized to restore the signal and then the output drivers deliver the signal across the wires to the next routing switch. The width of FeFET and PMOS level restorer should be carefully chosen so that the PMOS is not too strong to pull down the output of FeFET. Fig. 4(b) shows the timing diagram of the IWO FeFET (22nm) and CMOS routing switch (7nm). The periphery of IWO FeFET switch is assumed to be 7nm at the bottom tier. The width of the IWO FeFET is 400nm so that its output can be pulled down. The output delay of the IWO FeFET is only increased by 0.1ns compared to CMOS. To further scale down FeFET width, the electron mobility in the oxide channel should be increased (Fig. 4(c)). It is noted that the routing switch requires low  $R_{ON}$ ( $<10k\Omega$ ) as opposed to that of weight memory cell. To use W=100nm BEOL FeFET, a mobility increase of 4~5× is needed. The layout of 7nm CMOS 6T-routing switch and 22nm BEOL FeFET routing switch with W=100nm are shown in the right of Fig. 5. Placing routing switch at BEOL saves the bottom tier silicon area to fabricate other circuits. The unit area for each circuit module is summarized in Table II.

# IV. RECONGIRUABLE CIM ARCHITECTURE DESIGN

A similar design principle to FPGA is adopted for the reconfigurable CIM architecture (Fig. 5). It consists of processing element (PE), switch box (S) and connection box (C). An  $8 \times 8$  PE array with BEOL FeFET weight memory is assumed in this work. In each PE, there are  $4 \times 2$  CIM array with each array size being  $144 \times 128$ . The input is delivered to the read word line (RWL) by the input driver. The partial sum current is sensed by the peripheral ADC. The write operation is conducted through the write word line (WWL) and write bit line (WBL), which are driven by the BL and WL level shifters with M3D design, respectively. Each PE also consists of input buffer, accumulation units, special function units for batch

norm (BN), ReLU and pooling. All those modules are placed at the bottom tier with 7nm process. The 256-bit bandwidth routing switch is placed at the top tier using IWO FeFET. Its level restorers, output drivers and programming circuit are placed at the bottom tier with 7nm process. It is assumed that CMOS circuit can be fabricated underneath the routing switch.

Fig. 6 show the design automation flow to achieve the compiling-time reconfigurability for DNN mapping. The input is a dataflow graph (DFG) of the DNN model, which consists of the operations including vector-matrix multiplication, BN, ReLU and pooling. Next, the DFG is converted to a hardware DFG with the architecture-specific information such as the PE size. In the hardware DFG, the large weight blocks are partitioned according to the PE size and split into multiple PEs. Additional edges are induced for the partial sum reduction from different PEs. For each layer, the BN and ReLU are mapped into the PE that produces the final output. In this design automation flow, simulated annealing is utilized to minimize the total communication distance between PEs. Finally, the routes are set up using the Pathfinder algorithm. If a link is shared by multiple PEs, they are properly scheduled to avoid the conflicts.

#### V. SYSTEM LEVEL PERFORMANCE EVALUATION

To evaluate the system-level performance, we first update the technology file of NeuroSim with the device parameters of the IWO devices. New circuit modules such as the M3D level shifter and IWO FeFET-based routing switch are also included. The M3D temperature profile is estimated using the integrated compact thermal model [2] assuming air-cooling. Next, a customized simulator is built using the reconfiguration information from the compiler and the hardware performance from the NeuroSim. In this work, 7nm 2D SRAM, 22nm 2D FeFET and hybrid M3D design (22nm top/7nm bottom) are included. device The parameters and architecture configurations are included in Table III. The 7nm 2D SRAM design uses SRAM-CIM array with CMOS routing switch. The cell-bit precision of the FeFET and SRAM are assumed to be 2-bit and 1-bit, respectively. To demonstrate the reconfigurability, the performance of ResNet-20 (w/ 0.27M weights) and ResNet-32 (w/ 0.46M weights) using the same hardware are evaluated for CIFAR-10 dataset.

Fig. 7 shows that compared to 7nm SRAM design, hybrid M3D design with BEOL FeFET achieves  $2.9 \times$  higher energy efficiency due to a lower ADC energy consumption. With the reported IWO FeFET [9], the area is 11% higher than 7nm SRAM due to the sized-up routing switch at the top tier. With projected  $5 \times$  improved mobility, the area becomes 18% smaller than 7nm SRAM. Further device engineering is needed to enhance the transport in the oxide channel as well as to reduce the interface contact resistance for BEOL transistor.

### ACKNOWLEDGMENT

This work is supported by ASCENT, one of the SRC/DARPA JUMP centers, and by IMPACT, one of the SRC nCORE centers.

# References

X. Peng et al., IEDM 2019. [2] X. Peng et al., IEDM 2020. [3] X. Chen et al., TCAS-I, 2019. [4] H. Ye et al., IEDM 2020. [5] S. Dutta et al., IEDM 2020. [6] F.K. Hsuch, et al., IEDM 2017. [7] W. Chakraborty et al., VLSI 2020. [8] Y.J. Lee et.al., ICCAD 2012. [9] S. Dutta et al., arXiv:2105.11078, 2021.

# Introduction and Motivation



IWO NMOS on the top tier and PMOS on Si. Parasitic capacitance and resistance are circuit design for an input-output pair. (b) The timing diagram of the routing introduced by the metal stacks and MIV. (b) The timing diagram during the level shifter switch. The impact of R, C from M3D interconnect is negligible. The width operation. For IWO transistor, two schemes are considered: one with the same size as Siratio of the PMOS and pass gate should be properly chosen for correct NMOS transistor (IWO-same) and the other with increased width to match the delay of circuit operation. Therefore, 400nm width FeFET is chosen with reported CMOS design (IWO-sized\_up). (c) The layout of the unfolded M3D and 2D CMOS level IWO FeFET device parameters [9]. (c) The output delay vs. electron shifter design with 2.5V I/O transistor design rule. The M3D design will be folded at the mobility improvement. To use W=100nm BEOL FeFET, the mobility should half-height of 2D design by inserting MIVs, which introduces height overhead. The area be improved by  $4\sim5$  times. For each mobility value, the BEOL FeFET width of the M3D design is 34% of 2D CMOS design, as listed in Table II.

