Chip Architect: Intel's Prescott Prospects

April 16, 2002: Intel's Prescott Prospects

Intel's Prescott Prospects

(by Hans de Vries)

(1) VLSI 2002 symposia

(2) 64 bit Yamhill implementation may cost less then 2 % extra die

(3) Improved Simultaneous Multithreading

(4) A 4 GHz Pentium 4 or an 8 GHz Pentium 5 ?

(5) Intel versus AMD

VLSI 2002 symposia.

The Prescott is Intel's 90 nm version of the Pentium 4 architecture with architectural enhancements like simultaneous multithreading for the desktop, A large L2 cache, probably 1 Megabyte and, according to a number of rumors: A 64 bit extension codenamed Yamhill. The advanced program of the VLSI 2002 symposia to be held in Honolulu this summer shows about 18 presentations related to the Pentium 4 architecture. We've compiled a list of the relevant abstracts here. Just by looking at the 75-word limited abstracts one can see that considerable advances are made in a number of areas. Very high speed L1 caches, 32 kbyte and 16 kbyte, larger than the current 8 kbyte and much faster also, a larger register file: 256 words of 32 bit compared to 128 words now. There also will be presentations on a new Integer Execution Unit and a new Address Generator. Some of these units were already disclosed earlier. The 5 GHz ALU at the fall 2001 IDF and the register file at last years VLSI symposia. The speed of the building blocks is impressive:

In a 130 nm process at a low 1.2 Volt:

- 4.0 GHz 32 bit Address Generator

- 5.0 GHz Integer Execution Core

- 4.5 GHz 32 k byte Cache

In a 100 nm process

- 6.0 GHz 16 k byte Cache

These speeds are especially impressive if you compare them with the current Northwood in 130 nm with 60 nm gate-lengths. This processor runs at 'only' 1.5 GHz on the relatively low voltage of 1.2V (see the schmoo -plot here). The Northwood Address Generators and the Integer Execution Units run at double that speed (3 GHz ) but are basically 16 bit while the units here are supposed to be really 32 bit. The register file was reported as running at 6 GHz last year (voltage?) and AnandTech has some pictures of a 10 GHz 32 bit ALU at a high 1.8V here.

One can truly appreciate the speed of these caches when compared with new "One step forwards, Two step backwards" JEDEC DDR II standard. The access-frequency of the much smaller 2kB...4kB row buffers in these chips is only 83 MHz....100 MHz for DDRII-333 and DDRII-400 respectively. An astonishing 100 times slower then the larger caches presented here in when implemented in Prescott's 90 nm process and with a decent voltage. The fact that this standard is designed by a JEDEC committee seems to result in something which can be supported by all members including the weakest of weakest DRAM companies.

64 bit Yamhill implementation may cost less then 2 % extra die space.

The Yamhill rumor

Just to start with the rumor first: The 64 bit Yamhill extension is supposed to be Intel's answer to AMD's Hammer family. One can imagine that the Pentium Architects are thankful to AMD for extending the x86 architecture to 64 bit. An architecture that was supposed to reach it's end of life stage with the introduction of Itanium and it's descendents. Hammer however is up and running now, outperforming the fastest members of the Itanium Family at least in integer performance. Intel management has to make U-turns ones in while, like with their Rambus-only policy lately

Basic 64 bit integer operations

A 64 bit extension by itself does not imply that the Integer Execution Unit and the Integer Register File have to be extended to 64 bit. A minimal implementation would simply use the 32 bit integer pipeline for 64 bit integer operations. The Floating Point/MMX/SSE pipelines are already 64 bit. No need for changes here.

The dual 'Rapid Execution' Units and the 32 bit register file run a twice the frequency and are together able to handle two 64 bit operations per cycle. (The Hammer is able to do 3 per cycle but its 64 bit additions might have twice the latency) The mechanisms to decode an operation into 2 sub-operations are already available in the pipeline. The 128 bit XMM/SSE operations for example are handled in two 64 bit pieces.

It would be advantageous if the basic functional timing of the rapid executions engines can remain the same. The current ones handle 32 bit additions as two skewed 16 bit ones. the 2nd addition starts 1/2 a cycle after the first when the carry bit is available. The newer integer ALU's seems to be fully 32 bit ALU's The same trick may thus be used to handle a 64 bit addition as two skewed 32 bit ones. Hardware for a full 32 bit addition takes about 15-20% longer as that for a 16 bit addition. It seems that Intel's circuit designers have closed this gap with novel design techniques like 'forward body biasing' et-cetera.

More 64 bit operations

The Rapid Execution Engine handles 32 bit logic and additive operations which are easy to extend to 64 bit. Other integer operations are more complicated. 32 integer multiplies are currently handled by the floating point multiplier. This unit can handle 64 bit multiplications as a result of the 80 bit floating point format that uses a 64 bit mantisse. The shifts and divide probably remain in the lower priority legacy area. This hardware is designed without the extreme efforts in circuit design and layout used for the ultra high frequency integer ALU's. The integer divide typically uses only a fraction of the transistors used for the multiply.

A simple shift and subtract state machine generates 2 result bits per cycle. It has to be adapted for 64 bit operation. The remaining legacy integer operations once invented long ago are probably left untouched and not extended to 64 bit.

64 bit general purpose registers

A minimal implementation would add eight 32 bit registers to the basic set of eight to extent them to 64 bit.. This is in fact similar to adding the architectural (register) state for an extra thread. A processor supporting two 64 bit threads would need 32 general purpose registers of 32 bit. The integer register file is likely to support 256 words. The architectural state would need 32 of them, leaving 2x112 for the renamed registers which means that each of the two threads can use almost all the resources of the pipeline when the other is waiting or restarting after a miss-prediction.

The first level caches.

The 2nd and 3rd level caches as well as the memory interface do not need any changes specific to Yamhill: The virtual address which is expanded to anywhere between 32 and 64 bit (Hammer uses 48 bit) is already translated to the physical address here. Physical addresses have broken the 32 bit barrier for quite a while now. The Level 1 data cache however and to a lesser extend the Instruction Trace Cache do need modifications for Yamhill. A lot of the extra die area would come from the increased TLB's (Translation Look Aside Buffers) that translate the higher address bits of the virtual address into a physical address. The current SMT capable Xeon L1 Data cache has a combined TLB for both threads and individual Instruction TLB's for each of the two threads. The current Data Cache Address Generators calculate first the lower 16 bit to index into the cache followed 1/2 a cycle later by the higher 16 bits to access the TLB. A similar solution as in the Integer Execution Unit is needed here. The 32 lower bits are needed in the first half cycle with the remaining higher bits coming in the next 1/2 cycle to address the TLB. The fact that the Address generator now works on 32 bit might be an indication of Yamhill.

HOME

Improved Simultaneous Multithreading.

Doubling the L1 Data Cache Frequency

The most important contribution to improved Hyper Threading comes from doubling the L1 Data Cache Frequency. In the first SMT Xeon the two threads have to compete for a single read port that runs at half the speed of the rest of the circuits. The 32 kB cache abstract mentions dual ports but I presume that dual means: 1 read and 1 write port, just like in the current 8 k byte L1 cache.

Doubling the Integer Register File

Expanding the register file from 128 to 256 words means that both of the two threads have about enough renamed registers to cover all operations-in-flight of the entire pipeline. It's not so likely that the desktop version of Prescott would support more then 2 threads. The PC market is basically an upgrade market.

Too big steps first make people wait longer to buy until the systems become available and then makes them wait longer before buying their next system because that next system needs to be so much better again. The server 'Xeon' version however might well support four threads. A feature which would be disabled on the desktop version. The increased SMT support provided by new building blocks like the ones which will be presented at the VLSI 2002 symposia would make that a worthwhile step.

Further and Future Improvements

Having SMT firmly on the tracks opens a whole range of further improvements that become interesting.

An obvious next step would be to bring the rest of the pipelines into the same clock domain as the Integer Execution Units: The Floating Point Units. The MMX/XMM/SSE units. Optimal support for 4 threads would also need a full speed pipeline from the Instruction Trace Cache to the execution units.

The consequence of, doubling the number of pipeline stages in all these units, is that the number of instructions-in-flight in these units also doubles, resulting in the need for more renamed registers, and thus a larger register file.

Splitting Cycles to stay on the Cutting Edge

SMT, once mastered, really is the Great Enabler. This is probably the reason why Intel bought Compaq's Alpha patents in a (non-exclusive) deal. Maybe more for x86 then for the Itanium. The latter's big architec- tural register file gives it a disadvantage for SMT where a copy of the entire register file is needed for each additional thread. SMT makes Hyper-pipelining the name of the game. Intel may bring the whole pipeline at the double frequency in a number of steps. We would not be surprised that, after achieving this goal, the architects and circuit designers would set their mind on trying to double the pipeline frequency again.

Multi Threading: A way to Speed Up Single Thread Applications

Another good reason for multi threading: If you want to have the fast single thread processor use multi- threading tricks! A processor with optimal support for it's threads, meaning that each of them can run close to maximal speed can be used to implement a number of methods to overcome the bottlenecks that can not be solved by hyper-pipelining alone:

(1) "Thread based Speculative Pre-Computation" for Memory Latency as a result of Cache Misses

Presented by John P. Shen, Director of Intel's Micro Architecture Lab at the MPF 2001:

A cache pre-fetching method with extra threads added to the original binary of programs. Able to

pre-fetch unpredictable memory access patterns (like pointer intensive code). The extra parallel

SP threads (Speculative Pre-computation) may look like stripped down versions of the original

binary code with only those instructions left that are needed for the memory address calculations.

These threads would progress faster then the original code and the loads would pre-fetch the

cache-lines from memory into the caches thereby reducing access latency time for the real program.

(2) "Branch Threading" for Branch Miss Prediction.

If the branch prediction hardware concludes that a certain condition branch is very hard to predict then

it can decide to spawn a second thread. The original thread follows one path and the second thread

follows the alternative path. If the condition is finally known at the end of the pipeline then the wrong

path is discarded

(3) "Load Threading" for Data Load Miss Predictions.

Some architectures like the EV6 and the Pentium 4 make predictions if a data load from memory can

be scheduled before preceding stores before knowing any of the addresses. If the store overwrites

the load data later then the pipeline has to be re-started in much the same way like after a branch miss

prediction. Multi threading can allow both choices to be executed simultaneously until the right choice is

known.

The latter two methods need a fine grain form of multi-threading probably not available in the Hyper threading Pentiums. At least not in the current implementation that seems to need a pipeline flush while forking. Now the use of these kind of tricks really needs a lot of multi threading capability in a processor, especially if more then one is used at the same time. A wide scale use of these tricks is still a bit beyond Prescott's capabilities.

HOME

4 GHz Pentium 4 or an 8 GHz Pentium 5 ?

Mega Hertz... what Mega Hertz ?

The 90 nm Prescott is expected to reach speeds of 4 GHz and beyond. The Integer Execution Units however runs at 8 GHz, so does the integer register file, the address generators and now, as we may presume, also the L1 data cache. So why call it a 4 GHz processor? Technically spoken it is not a 4 GHz processor but an 8 GHz processor...

A Chance for a Change

Such a sudden jump in Giga Hertz needs to be accompanied with a significant increase in performance to make it marketable to the average customer. The 50% ... 60% extra performance brought by improved Simultaneous Multi Threading does offer this as a one-time-only opportunity. If Intel ever wants to use the real frequency of the Integer Pipeline then it has to make the transition with the introduction of the Prescott.

A name change to Pentium 5 would be appropriate to signal a major architecture change.

Marketing and Metaphors

To marketing the task to explain the term simultaneous multi threading to the general public. Most likely ending up with a number of metaphors that give people the illusion that they understand something while they are in fact totally confusing reality. We've heard a few nice ones from AMD when it had to explain that Mega Hertz is not the same as performance. Something like "Animals with little legs having to run like crazy just to keep up with the larger (Athlon) species...." The classical combustion engine may help here: "A four cylinder engine with twice the RPM produces the same amount of Horse Powers as an eight cylinder does ...." The extra complication is to explain how the second logical processor is result of the much higher frequency. " The processor is so incredible fast that it can work like two", Something like the energetic modern women who have a job and take care of their children at the same time... I would not be surprised that as a side effect of such a campaign we may see some psychiatric researchers proposing that raised brain wave frequencies can induce schizophrenia...

Well, for so far the hope that marketing can produce some decent consumer education...

An 8 GHz Processor in a 90 nm process would be consistent with Intel's statements that it's 70 nm processors will run at more then 10 GHz. These predictions were made already one and a halve year ago.

HOME

Intel versus AMD.

Performance wise.
A single thread 90 nm Prescott is likely to be on par with a 90 nm version of the Sledge Hammer. A new larger L1 cache with double the access rate. A similar sized L2 cache (1 MegaByte) and a similar Memory Band width of 6.4 GB/s   (Prescott : 800 MHz x 64 bit, Sledge : 400 MHz x 128 bit)
Applications and compilers become better optimized towards the hyper-pipelined Pentium. This without any of the new tricks we discussed above: speculative pre-computing, branch threading and load threading. The application of speculative pre-computing may give Sledge a hard time.

The Die Size Advantage
Prescott may mark the end of an era where AMD could erode Intel market share as the result of Athlons much smaller die size compared to the Pentium 4 core.   Fab capacity is the second hurdle (after processor performance) on the road to a bigger piece of the x86 processor market. A market good for few dozen billion dollars. The good news for AMD is that it's architects did a dual-processor-on-a-chip (CMP) version of Sledgehammer. This Hammer version may well turn out to be AMD's main stream processor at the 90 nm processor node. Two Hammer cores together with 1 Megabyte L2 cache would consume something like 95 square millimeters at the 90 nm process node. Smaller then its current smallest processor, the Duron that has something like 106 mm2 but larger than the projections for Prescott which are in order of 80 mm2

Multiplying Model Numbers
It would be justifiable if AMD multiplies its model number by 2 for the "dual-processor-on-a-chip" version of the Sledgehammer to get in the 8000+ range. Something that may be become the only option from a marketing viewpoint. The performance of a 2 processor-on-a-chip Sledgehammer is likely be higher again then a 2 thread Prescott. Making multi-processing very important for AMD even in de desktop segment. The Microsoft's Licensing model will play a crucial role here. The current Intel brokered distinction between logical and physical processors does not benefit AMD. AMD users would have to pay significantly more for their version of Windows than Intel users. A more reasonable definition is needed to make a distinction between a desktop PC and the various forms of server PC's. A definition where a desktop would have only one chip containing processors and a server would have two or more chips containing processor may be a solution.
HOME

Abstracts from the Intel presentations.

2002 VLSI Symposia

A 4GHz 130nm Address Generation Unit with 32-bit Sparse-tree Adder Core

Sanu Mathew, Mark Anders, Ram K. Krishnamurthy and Shekhar Borkar

Circuits Research, Intel Labs, Intel Corporation, Hillsboro, OR 97124, USA, sanu.k.mathew@intel.com

This paper describes a 32-bit Address Generation Unit (AGU) designed for 4GHz operation in 1.2V, 130nm technology. The AGU utilizes a 152ps dual-Vt sparse-tree adder core to achieve 20% delay reduction, 80% lower interconnect density and a low (1%) active energy leakage component. The semi-dynamic implementation enables an average energy profile similar to static CMOS, with good sub-130nm scaling trend.

Dual Supply Voltage Clocking for 5GHz 130nm Integer Execution Core

Ram K. Krishnamurthy, Steven Hsu, Mark Anders, Brad Bloechel, Bhaskar Chatterjee*, Manoj Sachdev*, Shekhar Borkar Circuits Research, Intel Labs, Intel Corporation, Hillsboro, OR 97124, USA, ramk@ichips.intel.com *University of Waterloo, Ontario N2L3G1, Canada, bhaskar@vlsi.uwaterloo.ca

This paper describes dual-Vcc clocking on a 1.2V, 5GHz integer execution core fabricated in 130nm CMOS to achieve up to 71% measured clock power (including 15% active leakage) reduction. A write-port style pass-transistor latch and split-output level-converting local clock buffer are described for robust, DC power free low-Vcc clock operation.

A 4.5GHz 130nm 32KB L0 Cache with a Self Reverse Bias Scheme

Steven K. Hsu, Atila Alvandpour, Sanu Mathew, Shih-Lien Lu, Ram K. Krishnamurthy, Shekhar Borkar

Circuits Research, Intel Labs, Intel Corporation, Hillsboro, OR 97124, USAsteven.k.hsu@intel.com

This paper describes a 32KB dual-ported L0 cache for 4.5GHz operation in 1.2V, 130nm CMOS. The local bitline uses a Self Reverse Bias scheme to achieve ?220mV access transistor underdrive without external bias voltage or gate-oxide overstress. 11% faster read delay and 104% higher DC robustness (including 7x measured active leakage reduction) is achieved over optimized high-performance dual-Vt scheme.

Designing a 3GHz, 130nm, Intel® Pentium®4 Processor

Daniel Deleganes, Jonathon Douglas, Badari Kommandur, Marek Patyra Intel Architecture Group, 2501 NW 229 th Ave. MS RA2-401 Hillsboro, OR, 97124, USA

The design of an IA32 processor fabricated on state-of-the art 130nm CMOS process with improved six layers of dual-damascene copper metallization is described. Engineering an IA32 processor for server, desktop, and mobile platforms, particularly meeting diverse power & thermal constraints, poses numerous challenges. This presentation focuses on methods applied to achieve high frequency and low power on the same chip, particularly, the use of Dual Vt process, clock skew design, and thermal management techniques.

Forward Body Bias for Microprocessors in 130nm Technology Generation and Beyond

Ali Keshavarzi, Siva Narendra, Bradley Bloechel, Shekhar Borkar and Vivek De Microprocessor Research, Intel Labs, Hillsboro, OR, USA

Device and testchip measurements show that forward body bias (FBB) can be used effectively to improve performance and reduce complexity of a 130nm dual-VT technology, reduce leakage power during burn-in and standby, improve circuit delay and robustness, and reduce active power. FBB allows performance advantages of low temperature operation to be realized fully without requiring transistor redesign, and also improves VT variations, mismatch, and gm x ro product.

A 6GHz, 16Kbytes L1 Cache in a 100nm Dual-VT Technology Using a Bitline Leakage Reduction (BLR) Technique

Yibin Ye, Muhammad Khellah, Dinesh Somasekhar, Ali Farhang and Vivek De Microprocessor Research, Intel Labs, Hillsboro, OR, USA

A L1 cache testchip with dual-VT cell and a bitline leakage reduction (BLR) technique has been implemented in a 100nm dual-VT technology. Area of a 2KBytes array is 263.m X 204.m, which is virtually the same as the best conventional design with high-VT cell. BLR eliminates impacts of bitline leakage on performance and noise margin with minimal area overhead. Bitline delay improves by 23%, thus enabling 6GHz operation. Energy consumption per cycle is 15% higher.

A Leakage-Tolerant Dynamic Register File Using Leakage Bypass with Stack Forcing (LBSF) and Source Follower NMOS (SFN) Techniques

Stephen Tang, Steven Hsu, Yibin Ye, James Tschanz, Dinesh Somasekhar, Siva Narendra, Shih-Lien Lu, Ram Krishnamurthy and Vivek De Microprocessor Research, Intel Labs, Hillsboro, OR, USA

LBSF and SFN leakage-tolerant techniques improve robustness of leakage-sensitive and performance-critical wide dynamic circuits in the local and global bitlines of a 256X32b register file in a 100nm dual-VT technology. The full LBSF design improves clock frequency by 50% or reduces energy by 37%, compared to the best dual-VT (DVT) design. Performance advantages of LBSF and SFN become more significant as leakage increases.

Four-Way Processor 800 MT/s Front Side Bus with Ground Referenced Voltage Source I/O

Thomas P. Thomas, Ian A. Young Intel Corporation Portland Technology Development RA1-309, 5200 NE Elam Young Parkway Hillsboro OR 97124, USA

A 40cm multi-drop bus shared by 5 test chips to emulate 4 processors and a chipset runs error free at 800MT/s with 130mV margin using Ground Referenced Voltage Source (GRVS) I/O scheme. For comparison, when the same test chip is programmed to use Gunning Transceiver Logic (GTL), the bus speed is 500 MT/s for the same 130mV margin under identical conditions.

Static Pulsed Bus for On-Chip Interconnects

Muhammad Khellah, James Tschanz, Yibin Ye, Siva Narendra and Vivek De Circuit Research, Intel Labs, Hillsboro, OR, USA

A Static Pulsed Bus (SPB) technique offers significant advantages over conventional static bus (SB) in delay, energy, total device width and peak VCC current for 1500mm to 4500mm long M4 buses in a 100nm technology. These improvements are due to reduction in effective coupling capacitance and repeater skewing enabled by monotonic signal transition. Unlike dynamic schemes, energy savings of SPB are maintained across all activity factors without any clock power or routing overhead.

A Transition-Encoded Dynamic Bus Technique for High-Performance Interconnects

Mark Anders, Nivruti Rai*, Ram Krishnamurthy, Shekhar Borkar Circuit Research, Intel Labs Intel Corporation, Hillsboro, OR 97124, USA mark.a.anders@intel.com *Desktop Products Group Intel Corporation, Hillsboro, OR 97124, USA nivruti.rai@intel.com

A transition-encoded dynamic bus technique enables interconnect delay reduction while maintaining the robustness and switching energy behavior of a static bus. Efficient circuits, designed for a drop-in replacement, enable significant delay and peak-current reduction even for short buses, while obtaining energy savings at aggressive delay targets. In a 180nm 32-bit microprocessor, 79% of all global buses exhibit 10%-35% performance improvement.

An Accurate and Efficient Analysis Method for Multi-Gb/s Chip-to-chip Signaling Schemes

Bryan K. Casper, Matthew Haycock, Randy Mooney Circuit Research, Intel Labs bryan.k.casper@intel.com Hillsboro, OR

This paper introduces an accurate method of modeling the performance of high-speed chip-to-chip signaling systems. Implemented in a simulation tool, it precisely accounts for intersymbol interference,

cross-talk and echos as well as circuit related effects such as thermal noise, power supply noise and

receiver jitter. We correlated the simulation tool to actual measurements of a high-speed signaling system

and then used this tool to make tradeoffs between different methods of chip-to-chip signaling with and

without equalization.

We present a technique to enable the integration of sensitive analog circuits with a high performance microprocessor (Pentium . 4), on a lossy substrate.

We show that by exploiting the spectral content of substrate noise, and the use appropriately tuned analog amplification it is possible to limit the isolation requirements to 70dB. By using a combination of measurement and field solver results, we show that a minimal process enhancement (i.e. a deep nwell) will yield 50 dB of isolation, and the remainder can be achieved by layout and differential circuit techniques.

Selective Node Engineering for Chip-Level Soft Error Rate Improvement

Tanay Karnik, Sriram Vangal, V. Veeramachaneni, Peter Hazucha, Vasantha Erraguntla, Shekhar Borkar Circuit Research, Intel Labs, Hillsboro, OR, U.S.A.

This paper presents a technique to selectively engineer sequential or domino nodes in high performance circuits to improve soft error rate (SER) induced by cosmic rays or alpha particles. In 0.18 µm process, the SER improvement is as much as 3X at the cell-level, 1.8X at the block- level and 1.3X at the chip-level without any penalty in performance or area, and <3% power penalty. The node selection, hardening and SER quantification steps are fully automated.

Design Optimizations of a High Performance Microprocessor Using Combinations of Dual-VT Allocation and Transistor Sizing

James Tschanz, Yibin Ye, Liqiong Wei 1 , Venkatesh Govindarajulu, Nitin Borkar, Steven Burns 2 , Tanay Karnik, Shekhar Borkar and Vivek De Microprocessor Research, 1 Mobile Architecture, 2 Strategic CAD, Intel Labs Hillsboro, OR, USA

Joint optimizations of dual-VT allocation and transistor sizing for a high performance microprocessor reduce low-VT usage by 36%-64%, compared to a design where only dual-VT allocation is optimized. Designs optimized for minimum power (DVT+S) and minimum area (L-SDVT) reduce leakage power by 20%, with minimal impact on total power and die area. An enhancement of the optimum DVT+S design allows processor frequency to be increased efficiently during manufacturing through low-VT device leakage push only.

Design & Validation of the Pentium® III and Pentium® 4 Processors Power Delivery

Tawfik Rahal-Arabi, Greg Taylor, Matthew Ma, and Clair Webb Intel Corporation / Logic Technology Development 5200 NE ElamYoung Parkway Hillsboro, Oregon, 97124 Email: Tawfik.r.Arabi@intel.com

In this paper, we present an empirical approach for the validation of the power supply impedance model. The model is widely used to design the power delivery for high performance systems. For this purpose, several silicon wafers of the Pentium ® III and Pentium ® 4 processors were built with various amount of decoupling. The measured data showed significant discrepancies with the model predictions and provided useful insights in investigating the model regions of validity.

Effectiveness of Adaptive Supply Voltage and Body Bias for Reducing Impact of Parameter Variations in Low Power and High Performance Microprocessors

James Tschanz, James Kao 1 , Siva Narendra, Raj Nair and Vivek De Microprocessor Research, Intel Labs, Hillsboro, OR, USA 1 Massachusetts Institute of Technology

Testchip measurements show that adaptive VCC is useful for reducing impacts of die-to-die and WID parameter variations on frequency, active power and leakage power distributions of both low power and high performance microprocessors. Using adaptive VCC together with adaptive VBS or WID-VBS is much more effective than using any of them individually. Adaptive VCC+WID-VBS increases the number of dies accepted in the highest two frequency bins to 80%

HOME

***