# An Energy-Recovering Reconfigurable Series Resonant Clocking Scheme for Wide Frequency Operation Ignatius Bezzam, *Member, IEEE*, Chakravarthy Mathiazhagan, Tezaswi Raja, *Member, IEEE*, and Shoba Krishnan, *Member, IEEE* Abstract—On-chip low skew clock distribution driving large load capacitances can consume as much as 70% of the total dynamic power that is lost as heat, resulting in high cooling costs. To mitigate this, an energy recovering reconfigurable series resonance solution with all the critical support circuitry is described. This LC resonant clock driver on a 22 nm process node saves about 50% driver power (>40% overall) and has 50% less skew than non-resonant driver at 2 GHz, while operating down to 0.2 GHz for dynamic voltage and frequency scaling. Reconfiguring for pulse mode operation enables further power saving, using latches instead of flip-flop banks, for double data rate applications. Tradeoffs in timing performance versus power, based on theoretical analysis, are compared and verified, to enable synthesis of an optimal topology for a given application. *Index Terms*—Clocks, dynamic voltage and frequency scaling, high speed integrated circuits, low-power design, resonant drivers, systems-on-chip, timing. #### I. INTRODUCTION P OWER dissipation considerations continue to dictate the use of multi-core architecture in processors and systems-on-chip (SoCs) in technologies beyond 45 nm [1], [2]. A full chip clock distribution network (CDN), meeting stringent timing requirements, can alone take 25% of total power in processors and sometimes as much as 70% in SoCs [3]. Transistor scaling using "More of Moore's law" reduces area and gives faster devices. However, constant voltage scaling can result in significantly higher power densities [4]. Due to the cooling costs needed to contain this, there has been an abrupt halt in the clock frequency increase even though the transistors themselves can switch much faster [5]. The energy used in each period to charge the clock grid node capacitance (C) can be recovered and reused with an integrated inductor (L) in parallel, forming a continuous parallel resonant (CPR) tank network [3]. The recovered energy would have Manuscript received November 22, 2014; revised February 22, 2015; accepted March 29, 2015. Date of current version June 24, 2015. This work was supported by the School of Engineering SCU. This paper was recommended by Associate Editor B.-D. (Brian) Liu. - I. Bezzam and S. Krishnan are with Santa Clara University, Santa Clara, CA 94024 USA (e-mail: ibezzam@scu.edu; skrishnan@scu.edu). - C. Mathiazhagan is with Indian Institute of Technology Madras, Chennai, India (e-mail: mathi@ee.iitm.ac.in). - T. Raja is with NVIDIA Corporation, Santa Clara, CA 94024 USA (e-mail: traja@nvidia.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSI.2015.2423797 been otherwise dissipated as heat. LC resonant circuit operation for reducing power densities in high speed clocking applications has been extensively reported [6]–[11]. Such recovery techniques are currently used in nanometer commercial processors for global clocking [1], [8]. Even in multi-core processors, total power consumption can be further reduced by using inductors. To reduce more power, modern high performance mobile designs are also using increasing number of voltage domains and regional clock trees [12]. The use of dynamic voltage and frequency scaling (DVFS) technique for switching power reduction requires the clocking distribution scheme to track as well. It is beneficial to extend the resonant solutions from global to regional clocking [13]–[16]. However, the smaller capacitance values from local trees will dictate larger values of inductances for the same LC resonance frequency [17]. This paper examines solutions to various limitations in using resonant clock drivers. Resonant clock solutions extending the operating frequency range have been reported [1], [17]–[21]. While [1], [20] use multiple inductors to switch into the parallel resonance structure, [17]–[19], [21] use a series resonance topology. Pulsed mode series resonance described in [17], [19] uses special latches to achieve best savings of power and area. The series resonance driver scheme in [18] generates flat-top outputs. However, the control signals necessary to support its robust and low power operation may need special circuits which are not detailed. In this work, a reconfigurable generalized series resonance (GSR) scheme combining the above, with support circuitry, is proposed. This can be dynamically reconfigured into various series or parallel resonance modes of operation as optimal for the application. For real life clocking applications in high speed computing and communication, timing closure is of utmost importance for functionality, performance, and yield [11], [12]. Lowering power at the expense of timing parameters like insertion delay variation, slew rates and skew may not be acceptable [3], [12]. This work arrives at closed-form design equations determining the power consumption improvements in clock drivers and analyzes the timing performance at the clock sink points. It also does a comparative analysis of generalized series resonant (GSR) with continuous parallel resonant (CPR) and split-driver non-resonant (NR) topologies [3], [22]. This paper is organized as follows. In Section II, theoretical power and delays of series resonant topologies that can be configured from GSR are analyzed. Section III describes the design of critical support circuit elements that can be used for all (c) Output at Capacitor for finite inductor $Q_L$ Fig. 1. (a) Pulsed Series Resonance (PSR) where the inductor is periodically connected to the load capacitance with controlled input pulse width $T_{PW}.$ Output has a pulse of width $T_{RES}$ driving a higher capacitive load at resonance. For an ideal inductor $(Q_L\gg 10),$ both input and output are from 0 to $V_{DD}.$ (b) Series RLC model for analysis with bottom switch $S_r$ closed and top switch $S_u$ opened, during the time 0 to $T_{PW}.$ (c) Output pulse with non-ideal inductor $(Q_L<10)$ when cycling though one clock period. Input pulse width $T_{PW}$ must be larger than damped oscillation cycle $T_R.$ Voltage $V_C$ on the capacitor $(Q_C>30)$ does not swing rail-to-rail. Extra power is needed to restore $V_C$ to $V_{DD}$ rail. (a) Switching model. (b) Series resonant tank. (c) Output at capacitor for finite inductor $Q_L$ . the configurations. In Section IV, tradeoffs of various configurations are tabulated. Section V validates the design with simulation results from a 22 nm process technology. Section VI concludes the paper. ## II. SERIES RESONANCE POWER AND TIMING In this section, the basic series resonance operation is analyzed and then generalized as the reconfigurable GSR. The power and performance relations for driving large capacitive loads are derived in order to select the optimum configuration for the given application. ## A. Pulsed Series Resonance (PSR) Driver A way to save the energy stored on the large load capacitance, using an inductor in series with switch $S_r$ , is shown in Fig. 1(a). The preliminary implementation was presented in [17] and the theoretical analysis with performance trade-off equations is detailed here. Controlled by an input pulse stream signal $PLS\_CLK$ , $S_r$ closes when the output needs to go low. The series inductor allows the energy stored on the load capacitor to be transferred to the $V_{DD}/2$ node and then recovered back immediately to make the output go high. This creates a pulse of resonance period $T_{RES}$ . Energy can be recycled with the series LC resonant tank $(T_{RES}=1/f_{RES}=2\pi\sqrt{L_SC_L})$ formed, shown in Fig. 1(b), when $S_r$ is closed [17], [19]. Thus, the pull-up switch does not need to charge the output to $V_{DD}$ all the way from 0 V. When input signal $PLS\_CLK$ is high, the resonant tank is formed and when low, the driver is in non-resonant mode. The input stream $PLS\_CLK$ is required to have certain minimum width $(T_{PW})$ , as shown in Fig. 1(a), to generate a resonant pulse stream at the output with minimum power [19]. Analysis of Fig. 1(b) is first done for a step input from the closing of the $S_r$ (NMOS) switch. In Fig. 1(b), the total resistance is the series combination $R_T = (R_r + R_W + r_S)$ . $R_r$ is the active resistance of switch $S_r$ , $R_W$ is the wiring parasitic resistance and the inductor resistance $r_S = 2\pi f L_S/Q_L$ [23], [24]. Here, $Q_L$ is the component quality factor of the inductor at frequency f. $R_W$ is modeled as the lumped equivalent of the distributed clock tree wire impedance. The overall tank quality factor $Q = \sqrt{(L_S/C_L/R_T)}$ is degraded from $Q_L$ , as $R_T$ is larger than $r_S$ . The output impedance of the $V_{DD}/2$ supply too is included in $r_S$ if significant. The parasitic equivalent series resistance (ESR) of the load capacitance is ignored in this comparative analysis, but can be factored as the component quality factor $Q_C$ . The voltage loop in Fig. 1(b) yields, $$R_T i_L(t) + \int \frac{i_L(t)}{C_L} dt + L_S \frac{di_L(t)}{dt} = \frac{V_{DD}}{2}.$$ (1) This leads to a second order differential equation for inductor current $i_L(t)$ , with initial value of 0 and $di_L/dt = 0$ , as $$\frac{d^2 i_L}{dt^2} + \frac{R_T}{L_S} \frac{di_L}{dt} + \frac{i_L}{L_S C_L} = 0.$$ (2) For the underdamped case having complex conjugate roots, the inductance needs to have a minimum value given by the condition $L_S > R_T^2 C_L/4$ [19]. Solving (2) gives the inductor current as $$i_L(t) = \frac{V_{DD}}{2\sqrt{L_S/C_L}\sqrt{1 - \frac{1}{4Q^2}}}e^{-tR_T/2L_S}\sin(2\pi f_R t).$$ (3) where the damped oscillation frequency $f_R$ is given by, $$f_R = \frac{1}{2\pi} \sqrt{\frac{1}{L_s C_L} - \frac{R_T^2}{4L_S^2}} = f_{RES} \sqrt{1 - \frac{1}{4Q^2}} = \frac{1}{T_R}.$$ (4) The current peaks are limited between $\approx \pm V_{DD}/2\sqrt{L_S/C_L}$ . With $T_{RES} < T_R < T_{PW} < T_{CLK}$ , the capacitor output voltage can be derived by integrating the current in $C_L$ to give, $$V_C(t) = \frac{V_{DD}}{2} + \frac{V_{DD}}{2} e^{-tR_T/2L_S} \cos(2\pi f_R t) - \frac{1}{2Q} \frac{V_{DD}}{2} e^{-tR_T/2L_S} \sin(2\pi f_R t).$$ (5) For large values of tank Q (>10), the two frequencies $f_R$ and $f_{RES}$ can be taken as equal and the last term in (5) can be neglected. However, on chip Q values can be quite small (<6) and the second order effects need to be considered for accurate analysis with $f_R < f_{RES}$ . To meet underdamped condition needed for PSR operation, we need Q > 0.5. Fig. 1(c) shows the detailed output amplitude levels. The energy recovery process is done through the inductor current in resonant mode. The output voltage rises high by itself till a certain voltage recovery point, without drawing current from the $V_{DD}$ power supply. But the highest voltage recovery point from freewheeling resonance oscillation is less than $V_{DD}$ . The first maximum is at $t=T_R$ . Substituting from the RLC series resonance Q expression $R_T/L_S=2\pi f/Q$ , the first maximum value at $t=T_R$ from (5) can be approximated as, $$V_{OH} = \frac{1}{2} V_{DD} (1 + e^{-\pi/Q}). \tag{6}$$ To reach 90% of $V_{DD}$ , as normally required, a $Q \geq 14$ is needed. As this is generally too high to realize on chip, the output is pulled up to rail using the $S_u$ (PMOS) pull up switch, forcing the final $V_{OH}$ to $V_{DD}$ . The capacitor output will ring with minimum value at $t=T_R/2$ . The minimum voltage logic low $V_{OL}$ can be calculated from (5) at $T_R/2$ as, $$V_{OL} = \frac{1}{2} V_{DD} (1 - e^{-\pi/2Q}). \tag{7}$$ To reach the standard 10% of $V_{DD}$ , we need a $Q \geq 7$ , which is less difficult to achieve than 14 needed for the $V_{OH}$ . For Q < 4, lower $V_{OL}$ can be obtained by using a smaller inductor bias like $V_{DD}/4$ . This will also change (1) and (5) giving a lower $V_{OH}$ than (6), but that is already taken care of by pull up switch $S_u$ . If the width of the input pulse $(T_{PW})$ is sufficient to allow the inductor current waveform to go through a complete oscillation of $T_R=1/f_R$ , all possible energy can be recovered. The power needed to drive $C_L$ rail-to-rail is the well-known expression $C_L V_{DD}^2 f_{CLK}$ [3]. In the case of PSR, the power needed to pull the output only from $V_{OH}$ to full $V_{DD}$ can be obtained as $(V_{DD}-V_{OH})V_{DD}C_L f_{CLK}$ [25], yielding the relation, $$P_{PSR} = V_{DD} \left[ V_{DD} - \frac{1}{2} V_{DD} (1 + e^{-\pi/Q}) \right] C_L f_{CLK}$$ $$= \frac{1}{2} (1 - e^{-\pi/Q}) C_L V_{DD}^2 f_{CLK}. \tag{8}$$ This is valid for all frequencies where $f_{CLK} < f_R$ . At $Q = \pi$ , PSR takes about 1/3 power of NR. In Fig. 1(c), the propagation delay to $V_{DD}/2$ of the falling edge is less than a quarter cycle $(T_R/4)$ , and taken as $T_{RES}/4$ $(T_{RES}=2\pi\sqrt{L_SC_L})$ for large Q. Combining with underdamped condition for inductance as $L_S>R_T^2C_L/4$ , the delay at minimum inductance for PSR can be approximated as, $$t_{PD} \ge \left(2\pi\sqrt{\frac{R_T^2 C_L}{4}C_L}\right)/4 = \frac{1}{4}\pi R_T C_L$$ (9) Inductance values larger than the minimum $(R_T^2C_L/4)$ will give larger delays but gain higher tank Q, resulting in less power consumption. For large Q values, the fall time from 90% to 10% points can be approximated from the second term in (5) as $T_{\rm fall} < 0.29 T_R$ [26]. $T_{\rm rise}$ is larger than $T_{\rm fall}$ for lower Q (<10) as it includes the RC based pull up time in Fig. 1(c). Again, for high Q and minimum inductance, we get $T_{\rm fall} \geq 0.91 R_T C_L$ . Increasing inductance improves power savings but decreases slew rates. Thus there is a tradeoff between delay and power savings within the scheme while choosing component values. Larger inductor size also implies larger metal area. As keeping $T_R$ small implies using smaller inductors, it is attractive to use PSR. The resonance time itself is relatively flexible to choose with $T_R < T_{CLK}$ . This inequality requirement between $C_L$ , $L_S$ and $T_{CLK}$ values provides an extra degree of freedom as the resonance frequency can be kept much higher than clock frequency $(f_R > f_{CLK})$ . When operating with Fig. 2. (a) Generalized Series Resonance (GSR) with pull up and pull down switches for rail-to-rail operation. (b) An equivalent series resonant circuit model for GSR with $S_r$ closed, $S_u$ open and $S_d$ open. (a) Switching model. (b) Series resonant tank. Fig. 3. The required timing diagram for generating rail-to-rail (0 to $V_{DD}$ ) clock output pulses shown is crucial for controlling the switching operation in GSR. The equal pulse widths of $V_{SR}$ generated from rising and falling edges of the clock input can be used to logically derive the switch control signals $\overline{V_{UP}}$ and $V_{DN}$ to generate ideal 50% duty cycle output clock at $V_C$ . All voltage signals swing $0-V_{DD}$ . The $i_L$ current peaks are $\approx \pm V_{DD}/2\sqrt{L_S/C_L}$ . narrow output pulses, $T_R$ is a fraction of the minimum period $T_{CLK}$ across DVFS. This gives the wide frequency operation feature of PSR, down to the lowest clocking frequency. It is optimal to use PSR with level-sensitive latches that only depend on the controlled falling edge [17]. The pulse mode of operation can also save power downstream by replacing master slave flipflops with the level-sensitive latches that take lower power [17], [19]. #### B. Generalized Series Resonance (GSR) Drivers Fig. 2 shows a series resonance scheme, termed as generalized series resonance (GSR), as it is a generalized form of PSR [17]. As the values of Q for on-chip metal inductors are usually very low (<4), PSR output is pulled to rail from (6) by using a separate switch $S_u$ in Fig. 1(a). GSR has an additional pull down switch $S_d$ to improve the $V_{OL}$ from (7), giving rail-to-rail operation. The output of PSR is a narrow pulse stream rather than near 50% duty cycle of standard clocks. With switch control timing shown in Fig. 3, this can be overcome and 50% duty cycle outputs obtained. The active high control signal $V_{SR}$ is derived (shown later in Section III) from both edges of the incoming 50% duty cycle clock shown in Fig. 3. The switch $S_r$ in series with inductor is closed twice in a cycle, first to store the discharging energy and later to recover it. The input pulse stream $V_{SR}$ controlling the switch $S_r$ needs to have a specific width $(T_R/2)$ for resonance. The active high control signal $V_{SR}$ is derived from both edges of the incoming 50% duty cycle clock shown. After the resonant recovery during $V_{SR}$ pulse, the active low $\overline{V_{UP}}$ pulls up the output to $V_{DD}$ . The active high $V_{DN}$ signal pulls down the low going output signal all the way to ground, after the $V_{SR}$ pulse. Transferring the energy between the inductor and load capacitor during the resonance periods effectively conserves switching energy. Compared to PSR, the inductor in GSR is switched at twice the rate $(2f_{CLK})$ of the incoming clock, but is on for half the duration $(T_{PW} \approx T_R/2)$ . The governing equations during $S_r$ closure are same as (1) and (2) derived for PSR. The inductor current is then given by (3) and the capacitor voltage by (5). However, the waveforms last only for half the cycle. The energy recovery process can be seen from the inductor current $i_L$ into the $V_C$ node, where the current during discharge is recovered back for charging. When $V_{SR}$ pulse closes $S_r$ for half the resonance period, the $V_C$ is discharged to lowest point $V_{OL} = 0.5V_{DD}(1-e^{-\pi/2Q})$ as per (7). The $S_r$ switch is ideally opened when the current is zero and all the charge is stored on the $V_{DD}/2$ node. The $V_{DN}$ signal then closes, connecting output to ground and forcing the $V_{OL}$ to 0 V rail. When the $V_{SR}$ pulse comes next in charging phase, it will follow (5) again with a half cycle time shift starting from 0 V. It will not reach the PSR maximum recovery point $V_{OH}$ but will be shifted down by $V_{OL}$ . This will give maximum resonance recovery point rising from ground as, $$V_C(T_R) = V_{OH} - V_{OL}$$ $$= \frac{1}{2} V_{DD} (1 + e^{-\pi/Q}) - \frac{1}{2} V_{DD} (1 - e^{-\pi/2Q})$$ $$= \frac{1}{2} V_{DD} (e^{-\pi/Q} + e^{-\pi/2Q}).$$ (10) When the $\overline{V_{UP}}$ signal becomes active, it will pull up from the above $V_C$ ( $T_R$ ) value to $V_{DD}$ . From (10), it can be seen that the voltage recovery point is lower than in PSR (6), requiring more energy to replenish, for the rail-to-rail operation. The power needed in GSR to pull $V_C$ from the value in (10) to $V_{DD}$ at frequency $f_{CLK}$ can be derived similar to (8) as, $$P_{GSR} = [V_{DD} - V_C(T_R)] V_{DD} C_L f_{CLK}$$ $$= \left(V_{DD} - \frac{1}{2} V_{DD} e^{-\pi/2Q} - \frac{1}{2} V_{DD} e^{-\pi/Q}\right) V_{DD} C_L f_{CLK}$$ $$= \left(1 - \frac{1}{2} e^{-\pi/2Q} - \frac{1}{2} e^{-\pi/Q}\right) C_L V_{DD}^2 f_{CLK}. \tag{11}$$ This is less than the $C_L V_{DD}^2 f_{CLK}$ power taken by NR and, for a $Q=\pi$ nearly 50% savings is predicted. GSR power consumption is independent of $f_{CLK}$ as long as it is sufficiently lower than resonant frequency $f_R$ . Thus, the power savings are valid over DVFS clock frequency range. The tank Q for GSR can be maximized as the inductor is free to be placed after the lumped interconnect $R_W$ and closer to the load $C_L$ . By thus connecting the inductor branch closer to the load, the series resonance total resistance can be reduced to as low as $R_T=(R_r+r_s)$ . This may require multiple inductors in parallel in the clock tree branches [3]. This will prevent significant Q degradations, improving the energy savings further. The same assumption for underdamped condition $L_S>R_T^2C_L/4$ is made [18], as in PSR, implying a Fig. 4. GSR control signals of Fig. 3 are generated from a regular clock. Matched delays create pulse widths that are replica of resonance time $T_R$ . Pulse series resonance (PSR) with PMOS driver is used as a voltage doubler. GSR inductor control output $V_{SR}$ is at double the supply voltage to reduce switch on-resistance. Dashed lines indicate adjustable capacitance values. minimum value of inductance and Q>0.5. Equation (9) gives GSR delay as well. #### III. CONTROL SIGNAL CIRCUIT DESIGN This section details the important support circuits for realization of the complete GSR solution in practice and how they can be used in other configurations as well. Low power implementation of one or more of the following functions are needed for resonant and non-resonant operation, - 1) Pulse generators with controlled width. - 2) Multiple non-overlapping pulse streams. - 3) Voltage doublers. - 4) Extra supply voltage $V_{DD}/2$ . Fig. 4 shows how 1, 2, and 3 may be realized. An optimum delay of $0.5T_R$ is generated from the RLC network and inverter in the input stage of Fig. 4. The series inductor $(L_D)$ is a replica of $L_S$ , and the matching Miller capacitance $C_{M1}$ tracks the load $C_L$ from Fig. 1 and 2. The pulse width, $0.5T_R \leq \pi \sqrt{L_S C_L}$ in Fig. 3, is determined by $\pi \sqrt{L_D C_{M1}}$ . The pulse width $T_{PW} = 2\pi \sqrt{L_{PW}(C_{Mr} + C_{M2})}$ is set slightly larger than $0.5T_R$ by sizing the inductor $L_{PW}$ accordingly. Here $C_{Mr}$ is the non-negligible gate capacitance of the switching transistor corresponding to $S_r$ in Fig. 2. Capacitor $C_{M2}$ is also matched to $C_L$ like $C_{M1}$ , and can in fact absorb the line capacitance of multiple drivers, or $S_r$ switches of distributed inductors paralleled. This replica timing eliminates the need for synchronization with conventional DLL/PLL circuitry that may require more area and power. Repeated low going pulses are generated from both the edges of the input CLOCKin using an XNOR gate and the replica delayed signal. The XNOR output can be inverted to obtain the $V_{SR}$ signal that controls the GSR inductor switch. The other two signals $\overline{V_{UP}}$ and $V_{DN}$ are readily obtained through logical operations of CLOCKin and the XNOR output. Thanks to the Miller gain around the $C_{M1}$ buffer [27], it is not necessary to have the entire load capacitance duplicated for replica delay. This saves power in charging and discharging this capacitor as well. During run-time, to account for inductor and load capacitance variations, the variable resistor $R_{opt}$ can be tuned to adjust the RLC delay and change $T_R$ appropriately. Pulsed Series Resonance Generalized Series Resonance (GSR) $V_{DD}$ CMOS GSR (d) reconfigured as (a) NR (b) CPR GSR<sub>OUT</sub> (c) PSR Dotted line devices are turned off with appropriate setting of gate control signals. $V_{LB}$ (c) PSR configuration: V<sub>UP</sub>=V<sub>SR</sub> (a) NR configuration with (b) CPR configuration: (d) GSR full configuration $V_{LB} = 0.25 V_{DD} - 0.5 V_{DD}, V_{DN} = 0$ $V_{SR}=0$ , $V_{LB}=0$ $\overrightarrow{V}_{LB} = \overrightarrow{V}_{DD}/2$ , $\overrightarrow{V}_{UP} = \overrightarrow{V}_{DD}$ $V_{LB}=V_{DD}/2$ $\frac{1}{2} tR_T/2L_S \left[\cos(2\pi f_R t)\right]$ $V_{C}(t)$ Voltage on Load Capcitor $\sin(2\pi f_R t)$ (Ignoring capacitor ESR for all cases) $R_P = (Q_L^2 + 1)r_S, L_P = L_S (Q_L^2 + 1)/Q_L^2$ Tank $Q_{CPR} = R_p / \sqrt{L_p / C_L}$ $0.5 (1-e^{-\pi/Q}) C_L V_{DD}^2 f_{CLK}$ $(1-0.5e^{-\pi/2Q}-0.5e^{-\pi/Q}) C_L V_{DD}^2 f_{CLK}$ Driver Power $(P_D)$ $C_L V_{DD}^2 f_{CLK}$ $f_{RES}=1/2\pi\sqrt{L_pC_L}$ $f_{RES}=1/2\pi\sqrt{L_SC_L}$ $f_{RES}=1/2\pi\sqrt{L_SC_L}$ $Q_{CPR} \approx Q_L = R_P/2\pi f L_P = 2\pi f L_S/r_S$ $Q_{PSR}=2\pi f L_S/(R_r+R_W+r_S)< Q_L$ $Q_{GSR} = 2\pi f L_S / (R_r + r_S) < Q_L$ $R_P = (Q_L^2 + 1)r_S, L_P = L_S (Q_L^2 + 1)/Q_L^2$ $0.69 R_{NR} C_L$ $R_{NR} = (R_u + R_d)/2 + R_W$ $\pi \sqrt{L_S C_L} / 2 \ge \pi R_T C_L / 4$ $\pi \sqrt{L_S C_L} / 2 \ge \pi R_T C_L / 4$ $\pi \sqrt{L_P C_L} / 2 \geq \pi R_p C_L$ Driver Delay $R_T(PSR) = (R_r + R_W + r_S)$ $R_T(GSR) = (R_r + r_S)$ $R_p > R_T \ge R_{NR}$ $R_{NR} < R_T < R_p$ $R_{NR} \leq R_T(GSR) \leq R_T(PSR) \leq R_T$ $R_{NR} \leq R_T(PSR) \leq R_p$ $\approx 0.29 T_R = 0.91 R_T C_L$ $\approx 0.29 T_R = 0.91 R_T C_L$ $\approx 0.29 T_{CLK} @ f_{CLK} = f_{RES}$ Rise/Fall times $2.2\times(R_{w/d}+R_W)C_L$ $(R_u + R_w)$ . $C_L << T_R < T_{CLK}$ $(R_u+R_w).C_L << T_R < T_{CLk}$ $(R_u+R_w).C_L << T_R < T_{CLK}$ $L_P = 1/4\pi^2 f_{CLK}^2 C_L$ $L_S = L_P (f_{CLK}/f_R)^2$ $L_S = L_P (f_{CLK}/f_R)^2$ Driver Inductor None Sizing $L_p < 4R_p^2 C_L$ $L_P > L_S > R_T^2 C_L / 4$ $L_P > L_S > R_T^2 C_L / 4$ Active Area ≈ NR Proportional to $C_L$ < NR Active Area Active Area ≈ 2×NR Driver Area and Routing Lengths Large Inductor metal Area Ind. Metal area < CPR Ind. Metal area < CPR $< 0.05C_L$ [13], [32] $2C_L & 2L_S$ **Predriver Capacitor** $\leq 0.5C_L$ [14] $C_L \& L_S$ & Inductor Overhead & (n/a) & (n/a) or $0.1 \times C_L \& 10 \times L_S$ or $0.2 \times C_L \& 20 \times L_S$ Predriver Power $(P_P)$ $\leq 0.5C_L V_{DD}^2 f_{CLK}$ $\approx 0.1 \ C_L V_{DD}^2 \ f_{CLK}$ $\approx 0.2 C_L V_{DD}^2 f_{CLK}$ $< 0.05C_L V_{DD}^2 f_{CLK}$ for n stages $n \ge 3$ & for min. delay shared over $\geq$ 4 drivers shared over $\geq$ 4 drivers $(P_D + P_P)$ Total Power $< 1.5C_L V_{DD}^2 f_{CLK}$ $+ 0.1)C_LV_{DD}^2 f_{CLK}$ $0.2)C_L V_{DD}^2\,f_{CLK}$ for $Q > \pi$ Predriver Delay $n \times 0.69 R_{NR} C_L$ $0.69 R_{NR} C_L$ $T_R + 0.69 R_{NR} C_L$ $T_R + 3 \times 0.69 \ R_{NR} \ C_L$ Application & Fixed $f_{CLK}$ for Global CDN Pulse mode DDR Latches Low Frequency Testing General Purpose Low Power. Key Feature Smallest Delays Lowest power at high $f_{CLK}$ Lowest Power DVFS Driving standard gates. TABLE I PERFORMANCE POWER AREA TRADEOFFS Cont. Parallel Resonance Non Resonance (Bold text indicates the configuration giving the best performance metric) A poly-resistive ladder network with switches can be used to digitally tune for lowest power at run-time. Capacitors $C_{M1}$ and $C_{M2}$ can be varied to match the loads used, during die to die calibrations, to give lowest power. In Fig. 2, the $S_r$ switch on-resistance $R_r$ in GSR, for the same device size as NR, will be higher due to the source bias voltage of $V_{LB} = V_{DD}/2$ in the NMOS. The drain source resistance is inversely proportional to gate source voltage $V_{GS}$ in the quasi-static linear regime and given as $L/\mu CoxW(V_{GS}-V_t)$ [27], [28]. While $V_{GS}$ is full gate voltage of $V_{DD}$ in NR case, in GSR it is only half that, as the source is biased at $0.5V_{DD}$ . Transistor width (W) can be increased to compensate for this, but will increase area and capacitance. Other alternative is to drive the gate with higher voltage [19]. To illustrate, for a 22 nm process with Cox=35 fF/ $\mu^2$ , $\mu=500$ cm<sup>2</sup>/Vs and $(V_{GS}-V_t)>1$ V, a W/L of 600 is sufficient to give $R_r$ less than 1 $\Omega$ when driving 1 pF of load capacitance. The inductor switch ( $M_r$ in Table I) gate capacitance $C_{Mr}$ is typically 10 fF. This is equivalent to a typical medium-sized standard cell inverter in DSM technologies, termed INV, having a NMOS W/L of 250 and a PMOS W/L of 350. The GSR doubler itself takes equivalent of 5 INVs. The rest of the signal generation logic takes an equivalent of 16 INVs. The total driver gate switching power, even for the doubled voltage, is less than 6% of load power. Resonant technique [29] is used to drive the $V_{SR}$ line itself as shown in Fig. 4. A low power voltage doubler scheme for $V_{SR}$ shown in Fig. 4 uses pulsed resonance technique. The GSR inductor switch control output $(V_{SR})$ can swing at twice the supply voltage [30]. The circuit is actually a PMOS complement of the PSR driver discussed in Section II-A. When the PMOS switch is closed, the inductor series resonates with the capacitance $C_{M2}$ and the additional GSR driver gate capacitance $C_{Mr}$ . The series inductor $(L_{PW})$ needs to be large enough to give the $0.5T_R$ timing needed at $V_{SR}$ . For large load capacitances (>10 pF) the resonant inductance values are quite small (<0.1 nH) allowing the use of larger values of $L_{PW}$ to give lower area $C_{M2}$ . Otherwise, each pF of load would require 100 INVs for $C_{M2}$ . For load capacitors, a $Q_C$ > 30 is realizable at 5 GHz giving less than 1 $\Omega$ of series resistance per 1 pF. With 1/3 power savings of PSR scheme and Miller gain >4, the equivalent capacitor overhead for $C_L V_{DD}^2 f_{CLK}$ power is less than $0.8C_L$ for the predriver. When shared by 4 or more drivers in the entire SoC, this overhead is less than $0.2C_L$ . PSR takes roughly half the overhead of GSR. The signal generators of Fig. 4 can be shared among multiple GSRs with the same $T_R$ requirements to reduce power and area overhead. The use of additional inductors, even in the predrivers, further lowers the power needed for driving internal gate capacitances. The bias voltages needed by CPR, PSR, and GSR are readily available in modern, multi-voltage domain SoCs, especially in mobile processors. The $V_{DD}/2$ bias line draws no effective power as more current is pushed into it than pulled out. The output impedance requirement of this, as a fraction of total resistance $R_T$ , can be calculated so that Q is not degraded to adversely affect the condition for underdamped oscillation and energy savings. This source impedance is targeted to be less than 10% of the switch on-resistance $R_T$ . It is also possible to use the CPR, PSR, and GSR drivers, replacing the inductor bias voltage of $(V_{LB})$ at $V_{DD}/2$ , with a large bias capacitor. Although these capacitors are very large $(8C_L \text{ to } 10C_L)$ , it has become an acceptable tradeoff in low power processors [1]. GSR overhead of about $2C_L$ in Fig. 4 is thus feasible. It takes several cycles for the output clock to be stable after turn on, so clock gating is not possible with these schemes [9], [11]. PSR and GSR described here do not lose any cycles in settling to the final waveform and thus can be clock gated. Though three inductors are used in the predriver and driver together, the actual values of these inductors for large loads are small and the metal overhead is not a limiting factor in terms of area or routing blockage. PSR driver needs only a portion of the support circuits from GSR [17]. It is well suited to drive level sensitive latches. In the absence of the voltage doubler, inductor bias $V_{LB}$ as low as $V_{DD}/4$ may be used, to achieve lower $V_{OL}$ levels, when tank Q is very small (<3). The pulse widths are programmed to full $T_R$ rather than $0.5T_R$ in GSR. The pulses are also available on both edges of the clock to support double data rate (DDR) by PSR. The PSR can create the controlled sharp falling edges needed to correctly trigger latches. The width $T_{PW}$ needs to be large enough to complete one cycle of LC resonance and meet the latch transparency window target [17]. #### IV. GENERALIZED SERIES RESONANCE CONFIGURATIONS Table I shows a transistor level implementation of GSR topology and its reconfigurations for NR, CSR, or PSR operation. When all devices are used, GSR operation is enabled with correct control signals as in Fig. 3. The equations for CPR are derived similar to PSR and included here for comparison [14], [16]. Table I illustrates the following important pros and cons in choosing a scheme for a given application [17]: - 1) Power and DVFS: While NR needs no inductors, the resonance schemes need a characterized inductor L that sets $f_{RES} = 1/2\pi\sqrt{LC_L}$ . For CPR, $f_{RES} = f_{CLK}$ , so different inductor $L_p$ values are needed to get minimum power at different clock rates. For a given $L_p$ , the frequency range of power savings is only an octave or so. This is a severe limitation in DVFS systems that aggressively scale down frequencies and supply voltages to the minimum needed at run-time. With large variations in load capacitances over PVT corners, even the best choice of $L_p$ may not be optimal in actual operation without run-time tuning. Power savings in CPR over NR are not uniform, but frequency dependent, as shown in Table I. For GSR and PSR, the resonance time $T_{RES}$ need only be less than minimum $T_{CLK}$ . This inequality requirement enables the DVFS support by PSR and GSR. It also has the benefit of providing an extra degree of freedom for handling variations in $C_L$ and $L_S$ . The component $Q_L$ (before skin-effect [23]) is higher for PSR/GSR, than CPR, since resonance frequencies are higher. - 2) Delays: NR gives the shortest insertion delay. The propagation delay of CPR is larger than that of NR. This adversely affects skew due to the larger absolute variations and sensitivities. PSR and GSR resonate at much higher frequencies at the edges of the clock rather than the whole period like CPR, giving lesser delay than CPR. - 3) Rise/Fall Times: In resonant schemes, the rise/fall times depend on the resonance period $T_{RES}$ ( $T_{\rm rise/fall}=0.29T_{RES}$ ). For CPR, this is nearly 30% of $T_{CLK}$ , so the rise/fall times are long for lower frequencies, causing increased timing delays. This further leads to increase in power of the receiving gates due to short circuit currents. In contrast, since $T_R$ in PSR and GSR is much smaller than minimum $T_{CLK}$ , the slew rates are fast, well controlled and fixed, resulting in low skew values. - 4) Area of Driver: The inductor value for a given resonance frequency and capacitance is given by $1/4\pi^2 f_{RES}^2 C_L$ . For nominal load capacitance values of 1 pF, an $L_P$ of more than 25 nH is needed for CPR at clock speed of 1 GHz. In GSR/PSR, giving some margin for pull up/down time, the oscillation time $(T_R = 1/f_R)$ is usually set at about $1/5^{\rm th}$ of nominal $T_{CLK}$ , resulting in $5\times$ larger value for resonance frequency than the clock [19]. The series inductor value is then smaller, given by $L_S = L_P (f_{CLK}/f_R)^2$ . For the 1 pF load at 1 GHz clock rate, $T_R$ can be set to 0.2 ns using a 1 nH inductor resulting in a 5 GHz of $f_R$ . Both PSR and GSR need less metal area for inductors in the driver compared to CPR. Inductor metal area for PSR and GSR can be on top of the driver active area and not encroach on other active areas. The inductor metal usage can sometimes affect critical performance due to routing blockages in the clock tree synthesis. PSR can also use bond wire inductors or off-chip inductors, especially for low frequency operation [19]. For comparison, NR needs 8 INVs to drive a load of 1 pF with optimal delays [31]; CPR takes less than 4 INVs; PSR takes 5 INVs and GSR 15. 5) Predriver Overhead: All the driver schemes shown need additional circuitry for input pulse stream generation. NR and GSR need non-overlapping pulses. CPR needs a minimum timing pulse width for a given driver size for proper operation [32]. Keeping the pulse widths minimum will minimize the static leakage in large driver devices. The predriver requirements are also important in determining total power and silicon area. When driving entire clock tree loads (>100 pF), the matched capacitors $C_{M1}$ and $C_{M2}$ in Fig. 4 can take excessive area. Making the inductors $L_D$ and $L_{PW}$ 10 times or more can scale the capacitance area down by $10\times$ . Inductors' extra metal area is not considered as they can be stacked on top of the active area of the predriver. The PSR predriver takes an equivalent of only 6 INVs compared to 16 for GSR. However, NR driver does need predrivers (nearly 5 INVs) to reduce delays in driving the large gate capacitance of clock drivers, using tapered buffers [14], [15], [22], [25], [31]. In an NR H-tree clock distribution, the extra capacitance driven can be 50% of $C_L$ for optimal delays, leading to 50% more power [14]. CPR buffer sizes are small compared to other schemes [11], [13], [32]. 6) Application: Power consumed in postprocessing of resonant clock waveforms may need to be considered for the given application. Due to the sinusoidal nature of the distributed clock signal in CPR, special flip-flops are often needed to capture data correctly [15], [26], [33]. The pulsed output of PSR can drive simpler latches, instead of full master-slave flip-flops, saving more power and even area [17]. As shown in Table I, in what is commonly termed as PPA optimization, power, performance, and area are considered simultaneously. An optimal configuration (indicated by bold text in the table) may be selected for best performance or lowest power, at the frequency of operation. As an example, for low frequency operation, NR may be used as timing is not critical. CPR needs minimum driver and buffer sizes and is ideal for single frequency operation like in global clock distribution. For DVFS in regional clocks, PSR or GSR provides power savings at all clock rates. For DDR operation, PSR is the best, operating on both edges of the clock. PSR can have lower Q than GSR and give lesser power savings. GSR, like NR, can drive standard gates without needing special buffers or latches and thus preferred over PSR by current synthesis tools. GSR may also be used in data path using dynamic logic for power savings [21], [34], [35]. ### V. RESULTS AND DISCUSSIONS Series resonance concept in the form of PSR has been silicon proven, albeit at low frequencies [19]. The GSR solution currently does not have measured data, but the agreement between the theory presented here and simulated benchmarks shown below is a validation of the feasibility. ## A. GSR Functionality and Performance The functionality and robustness of the new GSR driver and predriver circuitry is verified by 22 nm SPICE simulations Fig. 5. Monte Carlo simulations (>100 runs) of GSR with predriver from Fig. 4 to check robustness over 30% variations in values of active devices and passive components. Temperature is swept from $-25\,^{\circ}\mathrm{C}$ to 125 $^{\circ}\mathrm{C}$ . Signals correspond to Fig. 3 waveforms and an INV buffer giving CLOCKout. Fig. 6. GSR voltage and frequency scaling operation for DVFS showing power for a 20 pF load ( $L_s=50~\mathrm{pH}$ and $r_S<0.5~\Omega$ @ $f_R=5~\mathrm{GHz}$ ). Higher $V_{DD}$ supply gives large frequency sweep but takes more power. Power is saved by moving to an operating point of the lowest $V_{DD}$ for a given frequency. NR power at 1 V/1 GHz is about 20 mW for comparison. across 30% variation in LC component values and transistor model parameters [36], [37]. The results plotted in Fig. 5 show that, in spite of some outliers, the GSR output $V_C$ is still functional to drive a standard local buffer to generate the CLOCKout signal. This can drive flip-flops and other parts of the digital system. The pulse width of $V_{SR}$ varies to track the changes in the LC resonance time that come from variations in load capacitance. The pull up $\overline{V_{UP}}$ and pull down $V_{DN}$ signals are always non-overlapping. Operation at multiple voltages is shown in Fig. 6. Power drawn is plotted for driving a 20 pF load in the functional frequency range to show suitability for DVFS technique. With the inductor connected close to the load as in Fig. 2, resistance $R_W$ does not degrade the tank Q (>3) and the output swings rail-to-rail. Lower supply voltages give lower maximum frequency but take less power at functional frequencies. The ability to scale voltage down to the minimum needed voltage and power at any given frequency enables DVFS. The quadratic relation of power to $V_{DD}$ explains the spacing between the curves in Fig. 6. The GSR power at 1 V and 1 GHz is only 50% of $C_L V_{DD}^2 f_{CLK}$ NR power (20 mW) as per (11). #### B. Comparative Analysis Power and timing performance for GSR are compared with NR and CPR. PSR has results similar to GSR and not shown here, but described in [17]. Fig. 7. Comparing CPR $(f_{RES}=1~{\rm GHz},L_P=1.25~{\rm nH}~{\rm and}~r_S<2.5~\Omega)$ output waveform $(V_C)$ for a 20 pF load with NR and GSR $(f_R=5~{\rm GHz},L_s=50~{\rm pH}~{\rm and}~r_S<0.5~\Omega)$ . Propagation delays $(t_{PD})$ to mid-points at 50% marker are shown on the individual curves. The NR curve is the fastest with maximum pull up and down strengths. Using same device sizes, CPR launches a rising sinusoid, whose falling edge does not need a triggering input. Thus no $t_{PD}$ is shown for falling edge of CPR. GSR has smaller delays than CPR. 1) Drivers: Fig. 7 shows detailed capacitor voltage waveforms as per (5) and corresponding equations for NR and CPR from Table I. The simulated delays and transition times of drivers for a 20 pF load and 3 $\Omega$ of series resistance $R_T$ (excluding interconnect $R_W$ ) are shown. The simulated delay values are within 10% of the theoretical calculations. The predriver delays are not factored for simplicity as they do not affect slew rates appreciably. The delay expressions in Table I are based on simplified linear models, meant for comparative analysis. GSR includes additional RC delays from the pull up/down for the non-resonant portion of its rise/fall times, as seen in Fig. 7. 2) Clock Tree Sub-System: In order to verify the tradeoff presented, the various clock drivers are tested under identical IC implementation parasitics from a symmetric H-tree benchmark [17], [38]. The resonance inductance values are derived from a standard metal spiral inductor [8] of 0.5 nH with $r_S < 10 \Omega$ with a $Q_L > 3$ at 5 GHz. The clock tree global interconnect is distributed on a metal layer with wires that typically have 0.1 $\Omega/\mu m$ resistance and 0.2 fF/ $\mu m$ capacitance [38]. Clock distribution is done using 6 segments of 1.25 mm each with 8 wires in parallel to reduce the nominal interconnect resistance to less than 2 $\Omega$ . A $\pm 30\%$ random variation in length is considered for determining the clock skew. By keeping effective series resistance $R_T < 0.2 \Omega$ a tank Q > 1 is obtained, which is sufficient for successful GSR operation. The effect of finite component $Q_C$ (>30) of the load capacitance is also factored in the simulations, in terms of ESR. For a 1 V nominal operation, driving a distributed load totaling 160 pF, Fig. 8 compares NR, CPR, and GSR power consumptions calculated across frequencies using SPICE simulations. The predriver power is included in Fig. 8 in order to see a direct comparison between driver solution use cases. Multiple unit inductors of 0.5 nH are distributed in parallel along the tree to get the low 6 pH value required to resonate at 5 GHz. In Fig. 8, GSR trend follows (11) and the NR and CPR track the theoretical equations for $P_D$ from Table I. NR takes the highest power ( $P_D$ ), GSR less, and CPR takes the least. The global interconnect lines reduce the output swing at higher frequencies due to RC delays. This can result in lower power than calculated. NR predrivers can improve the attenuated swing and minimize delays using tapered buffers, but at the Fig. 8. Power consumption in a 160 pF load versus frequency for NR, GSR ( $L_S=6$ pH and $r_S<0.1~\Omega$ @ 5 GHz) and CPR ( $L_P=160$ pH and $r_S<0.3~\Omega$ @ $f_{RES}=1$ GHz) and for $V_{DD}=1$ V. Dotted lines show theoretical calculations. CPR is optimal at its resonance frequency $f_{RES}$ and is not operated below $0.8f_{RES}$ . Inductor sizes are constant for CPR and GSR during the frequency sweep. Fig. 9. Simulated skews in the 160pF H-tree across operating frequencies and topologies are shown for 1 V operation. Skew is highest for CPR which has the largest power savings. NR has 10 ps more skew than GSR. expense of 50% more power [14]. Table I shows GSR predriver power overhead $(P_P)$ of about $0.2C_LV_{DD}^2f_{CLK}$ . GSR driver takes about 50% of NR driver power of $C_L V_{DD}^2 f_{CLK}$ . At 2 GHz, as seen in Fig. 8, total GSR simulated power $(P_D + P_P)$ is about 57% of NR power, compared to 47% from Table I calculations. While the lumped model analysis is only accurate to 20%, it shows the comparative benefits of one topology over another. The actual power values from simulations are also different due to voltage dependent non-linear capacitances not accounted for in the theory. Short circuit currents in the NR predriver tapered buffers also cause deviation from the theory. It can be seen from Fig. 7 and Fig. 8 that, as the propagation delays and rise/fall times get larger, less power is consumed by GSR and CPR, compared to NR, at higher frequencies. This is similar to the principle of adiabatic reversible logic, where slower transition times can give power savings [6], [7]. Receiving local buffers will have varying logic thresholds that will cause appreciable skew for large slew rates. These thresholds will also vary due to dynamic supply variations causing jitter. For minimum skew, it is preferred to drive NR without distributed predrivers [12]. Similarly, GSR and CPR with all inductors at source give minimum skew. However, due to Q degradation, this will consume more power than inductors distributed at sink points. Fig. 9 shows skews extracted from simulations over the DVFS frequency range. This is the true clock performance for a given power that needs to be considered. The GSR can give the lowest skew all the way to 2 GHz, using the well-controlled falling edge as trigger. CPR shows the highest skew and, like NR, cannot achieve functional swing at 2 GHz With wider interconnects, target skew and functionality can be met in CPR, and NR as well, but at the expense of significant increase in the load capacitance and power [3], [12]. This again illustrates the fundamental trade-off between energy and delay, as one has to be increased to decrease the other. GSR always needs to operate below resonance frequency $f_R$ . However, with run-time reconfiguration to CPR, using the same inductor, its operation can be extended to $f_R$ . #### VI. CONCLUSIONS Standard DSM CMOS implementation of a reconfigurable on-chip LC resonant clock distribution solution is shown. This generalized series resonance (GSR) technique can achieve 50% driver power savings compared to non-resonant drivers, while reducing the skew by 50% (below 10 ps) to make it easier to achieve timing closure. GSR has the lowest skew and fast slew rates for a given power consumption. This series resonance scheme supports DVFS operation and has several advantages over parallel resonant drivers (CPR). The GSR driver gives rail-to-rail outputs that can directly interface to standard cell library flip-flops and logic, and also allows clock gating. It has digitally controlled pulse width tuning to account for inductor variations. GSR topology is flexible and can be easily reconfigured to give other schemes like PSR, CPR, and NR, allowing a minimum power solution for the application. All the important circuitry for realization of the drivers is described to enable the GSR driver's deployment. Design equations for delay and power based on theoretical analysis have been derived and tabulated. These are verified to be accurate with simulations on a 22 nm process node. The overheads in power consumption and delays, in implementing resonant and non-resonant schemes, are accounted for in the comparative analysis. The performance, power, and area (PPA) tradeoff for different schemes can be directly derived from Table I, to select the optimal solution for the given application. This work does not necessitate the use of high-Q custom inductors that need more active area or specialty processes. This work advances the use of energy saving resonance in future SoCs and processors by providing a comprehensive trade-off analysis. Further work is now possible to develop automatic place and route (APR) solutions to automatically synthesize series resonance solutions, thus allowing their main stream deployment. Future work will discuss optimal layout implementation of GSR with multiple inductors and distributed parasitics for power and delay optimizations in asymmetric trees. # REFERENCES - [1] E. J. Fluhr, S. Baumgartner, D. Boerstler, J. F. Bulzacchelli, T. Diemoz, D. Dreps, G. English, J. Friedrich, A. Gattiker, T. Gloekler, C. Gonzalez, J. D. Hibbeler, K. A. Jenkins, Y. Kim, P. Muench, R. Nett, J. Paredes, J. Pille, D. Plass, P. Restle, R. Robertazzi, D. Shan, D. Siljenberg, M. Sperling, K. Stawiasz, G. Still, Z. Toprak-Deniz, J. Warnock, G. Wiedemeier, and V. Zyuban, "The 12-core POWER8 processor with 7.6 Tb/s IO bandwidth, integrated voltage regulation, resonant clocking," IEEE J. Solid-State Circuits, vol. 50, pp. 10–23, 2015. - [2] T. N. Theis and P. M. Solomon, "In quest of the next switch: Prospects for greatly reduced power dissipation in a successor to the silicon fieldeffect transistor," *Proc. IEEE*, vol. 98, no. 12, pp. 2005–2014, Dec. 2010 - [3] X. Hu and M. Guthaus, "Distributed LC resonant clock grid synthesis," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 59, no. 11, pp. 2749–2760, Nov. 2012. - [4] L. Chang, D. Frank, R. K. Montoye, S. J. Koester, B. L. Ji, P. W. Coteus, R. H. Dennard, and W. Haensch, "Practical strategies for power-efficient computing technologies," *Proc. IEEE*, vol. 98, pp. 215–236, Feb. 2010. - [5] S.-Y. Wu, C. Y. Lin, S. H. Yang, J. J. Liaw, and J. Y. Cheng,, "Advancing foundry technology with scaling and innovations," in *Proc, Int. Symp. VLSI-TSA*, 2014, pp. 1–3. - [6] K. Suhwan, C. Ziesler, and M. Papaefthymiou, "Charge-recovery computing on silicon," *IEEE Trans. Comput.*, vol. 54, no. 6, pp. 651–659, Jun. 2005. - [7] Y. Ye and K. Roy, "Energy recovery circuits using reversible and partially reversible logic," *IEEE Trans. Circuits Syst. I, Fundam. Theory Appl.*, vol. 43, no. 9, pp. 769–778, 1996. - [8] V. Sathe, V. Arekapudi, C. Ouyang, M. Papaefthymiou, A. Ishii, and S. Naffziger, "Resonant-clock design for a power-efficient, high-volume x86-64 microprocessor," *IEEE J. Solid-State Circuits*, vol. 48, no. 1, pp. 140–149, Jan. 2013. - [9] S. C. Chan, K. L. Shepard, and P. J. Restle, "Uniform-phase, uniform amplitude, resonant-load global clock distributions," *IEEE J. Solid-State Circuits*, vol. 40, no. 1, pp. 102–109, Jan. 2005. - [10] J. Rosenfeld and E. Friedman, "Design methodology for global resonant H-tree clock distribution networks," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 15, no. 2, pp. 135–148, Feb. 2007. - [11] S. C. Chan, P. J. Restle, T. J. Bucelot, J. S. Liberty, S. Weitzel, J. M. Keaty, B. Flachs, R. Volant, P. Kapusta, and J. S. Zimmerman, "A resonant global clock distribution for the cell broadband engine processor," *IEEE J. Solid-State Circuits*, vol. 44, no. 1, pp. 64–72, Jan. 2009. - [12] M. Guthaus, G. W. Silke, and R. Reis, "Revisiting automated physical synthesis of high-performance clock networks," ACM Trans. Design Autom. Electron. Syst., vol. 18, no. 2, pp. 31:1–31:2, Mar. 2013, Art. 31. - [13] A. Ishii, J. Kao, V. Sathe, and M. Papaefthymiou, "A resonant-clock 200 MHz ARM926EJ-S microcontroller," in *Proc. IEEE Eur. Solid-State Circuits Conf.*, Sep. 2009, pp. 356–359. - [14] A. J. Drake, K. J. Nowka, T. Y. Nguyen, J. L. Burns, and R. B. Brown, "Resonant clocking using distributed parasitic capacitance," *IEEE J. Solid-State Circuits*, vol. 39, no. 9, pp. 1520–1528, Sep. 2004. - [15] H. Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, "Ultra low-power clocking scheme using energy recovery and clock gating," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 17, no. 1, pp. 33–44, Jan. 2009. - [16] V. Sathe, "Hybrid resonant-clocked digital design," Ph.D. dissertation, Dept. Elec. Eng. Comp. Sci., Univ. Mich, Ann Arbor, MI, USA, May 2007 - [17] I. Bezzam and S. Krishnan, "A pulsed resonance clocking for energy recovery," in *Proc. IEEE Int. Symp. Circuits Syst.*, Melbourne, Australia, 2014, pp. 2760–2763. - [18] R. K. Jana, G. L. Snider, and D. Jena, "Energy-efficient clocking based on resonant switching for low-power computation," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 61, no. 5, pp. 1400–1408, May 2014. - [19] H. Fuketa, M. Nomura, M. Takamiya, and T. Sakurai, "Intermittent resonant clocking enabling power reduction at any clock frequency for near/sub-threshold logic circuits," *IEEE J. Solid-State Circuits*, vol. 49, no. 2, pp. 536–544, Feb. 2014. - [20] K. Ikeuchi, K. Sakaida, K. Ishida, T. Sakurai, and M. Takamiya, "Switched resonant clocking (SRC) scheme enabling dynamic frequency scaling and low-speed test," in *Proc. Custom Integr. Circuits Conf.*, 2009, pp. 33–36. - [21] I. Bezzam, S. Krishnan, T. Raja, and C. Mathiazhagan, "Low power low voltage wide frequency resonant clock and data circuits for SoC power reductions," in *Proc. IEEE Latin Amer. Symp. Circuits Syst.*, Peru, Feb. 2013, pp. 1–4. - [22] C. Yoo, "A CMOS buffer without short-circuit power consumption," IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 4, no. 9, pp. 935–937, Sep. 2000. - [23] A. Zolfaghari, A. Chan, and B. Razavi, "Stacked inductors and transformers in CMOS technology," *IEEE J. Solid-State Circuits*, vol. 36, no. 4, pp. 620–628, Apr. 2001. - [24] C. Yue and S. Wong, "On-chip spiral inductors with patterned ground shields for Si-based RF ICs," *IEEE J. Solid-State Circuits*, vol. 33, no. 5, pp. 743–752, May 1998. - [25] J. M. Rabaey, A. Chandarakasan, and B. Nokolic, *Digital Integrated Circuits: A Design Perspective*, 2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 2003, pp. 349–361. - [26] S. Esmaeili, A. Al-Khalili, and G. Cowan, "Dual-edge triggered sense amplifier flip-flop for resonant clock distribution," *IET Comput. Digit. Tech.*, vol. 4, no. 6, pp. 499–514, 2010. - [27] B. Razavi, Chapter 2 in Design of Analog CMOS Integrated Circuits, 1st. ed. New York: McGraw-Hill, 2000. - [28] W. Daly and J. W. Poulton, *Digital Systems Engineering*, 1st ed. New York: Cambridge Univ. Press, 2008. - [29] D. Campolo, M. Sitti, and R. S. Fearing, "Efficient charge recovery method for driving piezoelectric actuators with quasi-square waves," *IEEE Trans. Ultrason., Ferroelectr., Freq. Control*, vol. 50, no. 1, pp. 1–9, Jan. 2003. - [30] T. Lee, "Passive RLC networks," in The Design of CMOS RF Integrated Circuits. New York: Springer. - [31] J. Rabaey, "Optimizing power@design time-circuit level techniques," in Low Power Design Essentials, 1st ed. New York: Springer, 2009, pp. 86–88. - [32] S. E. Esmaeili, A. J. Al-Kahlili, and G. E. R. Cowan, "Estimating required driver strength in the resonant clock generator," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 8, pp. 927–930, Aug. 2012. - [33] S. E. Esmaeili, A. J. Al-Kahlili, and G. E. R. Cowan, "Low-swing differential conditional capturing flip-flop for LC resonant clock distribution networks," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 8, pp. 1547–1551, Aug. 2012. - [34] I. Bezzam, C. Mathiazhagan, and S. Krishnan, "Low power SoCs with resonant dynamic logic using inductors for energy recovery," in *Proc. IEEE/IFIP 20th Int. Conf. VLSI Syst.-on-Chip (VLSI-SoC)*, Santa Cruz, CA, USA, 2012, pp. 307–310. - [35] R. K. Jana, G. L. Snider, and D. Jena, "Resonant clocking circuits for reversible computation," in *Proc. 12th IEEE Conf. Nanotechnol. (IEEE-NANO)*, Aug. 2012, vol. 1, no. 6, pp. 20–23. - [36] Arizona State University Predictive technology models (PTM) [Online]. Available: http://ptm.asu.edu/ - [37] W. Zhao and Y. Cao, "New generation of predictive technology model for sub-45 nm early design exploration," *IEEE Trans. Electron Devices*, vol. 53, no. 11, pp. 2816–2823, Nov. 2006. - [38] C. N. Sze, P. Restle, G.-J. Nam, and C. J. Alpert, "Clocking and the ISPD'09 clock synthesis contest," in *Proc. ISPD*, 2009, pp. 149–150. **Ignatius Bezzam** (M'91) received the B.Tech. degree from Indian Institute of Technology Madras, Chennai, in 1983 and the M.S. degree in electrical engineering from San Jose State University, San Jose, CA, USA, in 1995. He is currently pursuing the Ph.D. degree in electrical engineering at Santa Clara University, Santa Clara, CA, USA. He holds several patents in AMS and PLL IC design and publications in ISSCC, ISCAS, LASCAS and ESSCIRC. He reviews papers for ISCAS and LASCAS. He has had a successful career record at National Semiconductor, Maxim Integrated Products, Toshiba, Raytheon (Fairchild) and Arasan Chip Systems. Chakravarthy Mathiazhagan received the B.Tech. degree from Indian Institute of Technology Madras, Chennai, in 1983 and the Ph.D. degree in physics from University of California, Irvine, CA, USA, in 1992 Since 1993, he has been with the faculty of the Department of Electrical Engineering, Indian Institute of Technology, Madras. His current research interests include high speed circuits and instrumentation. **Tezaswi Raja** received his M.S. and Ph.D. degrees in computer engineering from Rutgers University, New Brunswick, NJ, USA, in 2002 and 2004 respectively. He is currently working as an Engineering Design Manager at NVIDIA Corporation, Santa Clara, CA, USA. He is also an adjunct faculty at Santa Clara University, Santa Clara, CA, USA, teaching graduate level courses in low power design. His research interests include low-power design, circuit testing, adaptive clocking techniques, power delivery, ultra-low voltage design and emerging nanotechnology. Dr. Raja has served on the Program committee of International VLSI Design Conference and has been an author and reviewer for *Journal of Electronic Testing: Theory and Applications* (JETTA), IEEE TRANSACTIONS ON VLSI SYSTEMS, Design Automation Conference (DAC), International Conference on Computer Aided Design (ICCAD) and others. He is the co-inventor on 7 issued patents and has 18 pending. **Shoba Krishnan** (M'99) received her B. Tech. degree from Jawaharlal Nehru Technological University, India, in 1987 and M.S. and Ph.D. degrees from Michigan State University, East Lansing, MI, USA, in 1990 and 1993 respectively. She is an Associate Professor in the Department of Electrical Engineering at Santa Clara University, Santa Clara, CA, USA, and has been part of the Electrical Engineering faculty since 1999. From 1995 to 1999 she was with the Mixed-Signal Design Group at LSI Logic Corporation, Milpitas, CA, USA, where she worked on high-speed data communication IC design and testing. Dr. Krishnan's expertise and research interests include analog and mixedsignal integrated circuit design and testing with special emphasis on power, clock and data I/O circuits. Her research group focuses on challenges in integration of these blocks into standard digital IC environment.