update version 0.83

June 2, 2000:  Ashok Kumar mentions 170 mm2 as Willamette's die size.

C|NET:  http://yahoo.cnet.com/news/0-1003-200-2001874.html?pt.yfin.cat_fin.txt.ne

U.S. Bancorp Piper Jaffray's analyst Ashok Kumar published his report on Intel  today in which he apparently mentions 170 mm2 as Willamette's die size. It's well known that Ashok has a very friendly relationship with Intel so we take this very seriously.

A die size of 170 mm2 means that the actual core size (excluding the256kbyte  L2 cache) would be in the order of 140 mm2. This is twice the size of the Coppermine core (70 mm2). This is larger then most estimates. We previous estimated a 50-60% larger  core (110 mm2) based on the amount which the individual units on the P6 die photograph were supposed to grow. We estimate that this large die size  would be equivalent to that of a Mustang with 1 Megabyte of on-chip L2 cache.

Why should the Willamette die be this big?  Here just a few possibilities:

A much larger Trace Cache:
The Trace Cache needs significantly more bits to store the (decoded)  instructions: About six times as much as for the un-decoded instructions. Willamette would need about 400 kbyte SRam to store the same amount of instructions as the Athlon stores in it's 64 kbyte L1 instruction cache. This amount of storage however doesn't seem to be necessary with a closely coupled on chip L2 cache.

A larger part of the pipeline is running at 750 MHz:
A number of "twin" pipeline stages such as "rename-rename", "schedule-schedule", "dispatch-dispatch"
which can be difficult to "hyper-pipeline" may actually be single stages with twice the amount of logic running at half the frequency. The Trace Cache for instance is known to emit six micro-ops at a rate of 750 MHz
into the three-way super scalar pipeline which is supposed to run at 1.5 GHz. A three-way super-scalar pipeline running at 1.5 GHz is largely equivalent in functionality and performance with a six-way super-scalar pipeline running at 750 MHz. Stages of the pipeline running at 750 MHz may be easier to design but need about twice the logic.

On chip L3 Cache Tags for Foster:
Willamette will not have external L3 cache memory but Foster will. A commonly used practice to avoid the difficult problem of how to split the production between processors of type A and type B is simply to produce a processor which can be sold as  type A and as type B. A good example is the 180 nm Pentium III which is sold as Celeron II with half of its 256kbyte cache disabled. If Intel can justify this for a processor at the value end than it can certainly justify this for a high end processor. It is thus not unlikely that Willamette and Foster will have the same die. Foster should be able to handle 2 Megabyte of external, full speed, L3 cache SRam. The cache tags needed are estimated to occupy circa 10 mm2 with 128 byte cache lines.

Single cycle 128 bit SSE2:
The P6 executes the 128 bit SSE operations by issuing two 64 bit micro-ops. We believe that the hardware in the Willamette also handles 128 bit operations in two 64 bit cycles. Willamette's software guide mentions that "A few units accept an instruction once every cycle"  Intel has also completely refrained from using those nice Giga-flops numbers which should result from handling four 32 or two 64 bit floating point multiplications or additions per cycle at a speed of 1.5 GHz. A single cycle 128 bit throughput would require extra die real- estate. It is doubtful if this extra hardware would translate into a much higher performance however. The small number of eight 128 bit XMM registers in combination with the long latency of the floating point operations would limit the amount of instructions which can be executed in parallel. The single load port would throttle the performance for data-independent code which needs to load operands from memory.

We'll use the 170 mm2 in our report report,  see version 0.83