September 21, 2003

Understanding the detailed Architecture of AMD's 64 bit Core

 by Hans de Vries 

 

click for a large version

      

        Article Main Index 

 

       Index

      The Integer Core

      The Floating Point Core

      The Data Cache and Load/Store Units

      Instruction Cache and Decoding 

   

click for a large version

 

 The Arrival of the Hammers

 

 

 

You can find the enlarged versions of the images above further on in this page. 

Click here for the large (1600x1200) version

 

 

Understanding the detailed Architecture of AMD's 64 bit Core

 

 

For those who really want to know what it takes to build a world class 64 bit  

speculative out of order processor core for a multi-processing environment.  

 

 

Index

.

 Index   Chapter 1,    The Integer Core:    The Integer Super Highway  

 

           1.1      The Integer Super Highway

           1.2      Three way Super Scalar CISC architecture

           1.3      A Third class of  Instructions:  Double Dispatch Operations

           1.4      128 bit SSE(2) instructions are split into Doubles

           1.5      Using Doubles for 128 bit SSE(2) instructions avoids a 25% latency penalty

           1.6      Doubles used for some Integer and x87 instructions as well

           1.7      Doubles handle 128 bit memory accesses

           1.8      Address Additions before Scheduling

           1.9      Register Renaming and Out-Of-Order Processing

           1.10    Renaming the Integer Registers

           1.11    The  IFFRF:   Integer Future File and Register File

           1.12    The "Future File" section of the IFFRF

           1.13    Exception or Branch Miss-prediction:  Overwrite Speculative values with Retired values

           1.14    The Reorder Buffer

           1.15    Retirement and Exception Processing

           1.16    Exception Processing is always delayed until Retirement

           1.17    Retirement  of Vector Path and Double Dispatch Instructions

           1.18    Out-Of-Order processing:  Instruction Dispatch

           1.19    The Schedulers / Reservation Stations

           1.20    Each x86 instruction can launch both an ALU and an AGU operation

           1.21    The Scheduling of an ALU operation

           1.22    The Scheduling of an AGU operation for memory access

           1.23    Micro Architectural Advantages of Opteron's Integer Core

 

  Index   Chapter 2,    Opteron's Floating Point Units

 

           2.1      The Floating Point Renamed Register File

           2.2      Floating Point  rename stage 1:    x87 stack to absolute FP register mapping

           2.3      Floating Point  rename stage 2:    Regular Register Renaming

           2.4      Floating Point  instruction scheduler

           2.5      The 5 read and 5 write ports of the floating point renamed register file

           2.6      The Floating Point processing units

           2.7      The Convert and Classify units

           2.8      X87 Status handling:  FCOMI / FCMOV  and FCOM / FSTSW pairs

 

  Index   Chapter 3,    Opteron's Data Cache and Load / Store units

 

           3.1      Data Cache: 64 kByte with three cycle data load latency

           3.2      Two accesses per cycle, read or write:   8 way bank interleaved, two way set associative

           3.3      The Data Cache Hit / Miss Detection:  The cache tags and the primairy TLB's

           3.4      The 512 entry second level TLB

           3.5      Error Coding and Correction

           3.6      The  Load / Store Unit,   LS1 and LS2

           3.7      The "Pre Cache" Load / Store unit:  LS1

           3.8      Entering LS2:  The Cache Probe Response

           3.9      The "Post Cache" Load Store unit:  LS2

           3.10    Retiring instructions in the Load Store unit and Exception Handling

           3.11    Store to Load forwarding,  The Dependency Link File

           3.12    Self Modifying Code checks:  Mutual exclusive L1 DCache and L1 I Cache

           3.13    Handling multi processing deadlocks:  exponential back-off 

           3.14    Improvements for multi processing and multi threading

           3.15    Address Space Number (ASN) and Global flag

           3.16    The TLB Flush Filter CAM

           3.17    Data Cache Snoop Interface

           3.18    Snooping the Data Cache for Cache Coherency,  The MOESI protocol

           3.19    Snooping the Data Cache for Cache Coherency,  The Snoop Tag RAM

           3.20    Snooping the L1 Data Cache and outstanding stores in LS2

           3.21    Snooping LS2 for loads to recover strict memory ordering in shared memory

           3.22    Snooping the TLB Flush Filter CAM

 

  Index   Chapter 4,    Opteron's Instruction Cache and Decoding

 

           4.1      Instruction Cache:  More then instructions alone

           4.2      The General Instruction Format

           4.3      The Pre-decode bits

           4.4      Massively Parallel Pre-decoding

           4.5      Large Workload Branch Prediction

           4.6      Improved Branch Prediction

           4.7      The Branch Selectors

           4.8      The Branch Target Buffer

           4.9      The Global History Bimodal Counters

           4.10    Combined Local and Global Branch Prediction with three branches per line

           4.11    The Branch Target Address Calculator,  Backup for the Branch Target Buffer

           4.12    Instruction Cache Hit / Miss detection, The Current Page and the BTAC

           4.13    Instruction Cache Snooping

   

 

Chapter 1,   The Integer Core:   The Integer Super Highway

 

 

 

     1.1   The Integer Super Highway

 

The Die photo of the Integer core is dominated by all the 64 bit data-busses running North-South. At some points the density may reach up to twenty busses. The busses carry all the source and result operand data between the Integer units. The lay-out of the busses is bit-interleaved, meaning that equal bit-numbers are grouped together: The bits 0 of all the busses are next to each other at one side of this Integer super highway while all bits 63 can be found at the other side. Very visible is also the separation into individual bytes. 

 

     1.2   Three way Super Scalar CISC architecture

 

The Opteron is a 3 way super-scalar processor.  It can decode, execute and retire three x86-instructions per cycle. These instructions can be quite complex (CISC) operations involving multiple (>2) source operands. The Pentium 4 handles 3 so called uOps per cycle where multiple of these uOps may be needed to implement a single x86 instruction. It may be that the Prescott, the follow-up of the Pentium 4 can handle four uOps per cycle as we revealed here.

 

In general an x86 instruction can be expressed as:

F( reg,reg )F( reg,mem ) or F( mem,reg ) where the first operand is both source and destination. The first two forms are general for Integer, MMX and SSE(2). The later form is found basically in Integer instructions: One source operand is loaded from memory and the result is written back to the same location. The Integer Pipeline handles Loads and Stores for all operations including those for Floating Point and Multimedia instructions.

   

 

 

 

   

Overview of Opteron's Processor Core

 

     1.3   A third class of  Instructions:  Double Dispatch Operations

 

The original Athlon ( which we'll refer to as the Athlon 32)  classifies instructions either as Direct Path or Vector Path.

To the first class belong all the less complex instructions that can be handled by the hardware as a single operation. The more complex instructions (Vector Path) invoke the micro sequencer that executes a micro code program. Instructions are read from micro code Rom and inserted in the 3-way pipeline.

 

The Opteron introduces a third instruction class: The Double Dispatch instructions, or simply "Doubles".  The doubles are generated near the end of the decoding pipeline. The Instructions, which either followed the "Direct Path", or where generated by the Micro Code sequencer, are split into two independent instructions. The 3-way pipeline can thus generate up to six instructions per cycle. The instructions are "re-packed" back to three again in the PACK-stage.  This extra pipeline stage has often been the subject of speculation since Opteron's introduction at the 2001 Micro Processor Forum. The Six-fold symmetry of the "doubling stage" is clearly visible on the Die plot above. 

 

     1.4   128 bit SSE(2) instructions are split into Doubles

 

Appendix C of Opteron's Optimization Guide specifies to which class each and every instruction belongs. Most 128 bit SSE and SSE2 instructions are implemented as double dispatch instructions. Only those that can not be split into two independent 64 bit operations are handled as Vector Path (Micro Code) instructions. Those SSE2 instructions that operate on only one half of a 128 bit register are implemented as a single (Direct Path) instruction.

 

There are both advantages and disadvantages performance-wise here. A disadvantage may be that the decode rate of 128 bit SSE2 instructions is limited to 1.5 per cycle. In general however this not a performance limiter because the maximum throughput is limited by the FP units and the retirement hardware to a single 128 bit SSE instruction per cycle. More important is the extra cycle latency that a Pentium 4 style implementation would bring is avoided.

 

     1.5   Using Doubles for 128 bit SSE(2) instructions avoids a 25% latency penalty

 

In the Pentium 4 an SSE2 instruction is split in a later stage in the Floating Point unit itself. The Floating Point units accept 128 bit source data at it's first stage. It then splits the operation in two and combines the two results at the end into one single 128 bit result. This effectively adds one extra cycle to the total latency. For instance: The x87 FADD and FMUL take 5 and 7 cycles while the 128 bit (2x64) SSE2 equivalents need 6 and 8 cycles. 

 

The Opteron, like the Athlon 32, handles both FADD and FMUL in 4 cycles. The SSE2 equivalents are handled with the same 4 cycle latency. An extra cycle would mean a latency increase of 25%, a serious performance limiter, so the correct decision has been made here. If you would look at a highly pipelined FP unit in action then you would see mostly bubbles and few instructions . Instructions waiting for the results of others that have yet to finish. Latency is more important here then bandwidth. 

 

The next Pentium, code-named Prescott has an extra Floating Point Multiplier and Adder as we could reveal to you here. We now think that the extra FP units are used for single port but full 128 bit operation. This would bring back the SSE2 latencies for Add and Multiply to 5 and 7 cycles, beneficial for single thread programs. It would double the Floating Point bandwidth which is mainly interesting for Hyper Threading performance. 

   

     1.6   Doubles used for some Integer and x87 instructions as well

 

The Double Dispatch instructions are not only used for SSE and SSE2 instructions.  Appendix C of Opteron's Optimization Guide also list classic x86 instructions like POP and PUSH, some of the integer multiplications and the LEAVE instruction. All the instructions are handled by micro code on the Athlon 32 which is a lot slower. Also a number of classical x87 instructions are now handled by doubles, for instance those FP instructions that have an integer as one of the source operands that first needs to be converted to floating point.

 

     1.7   Doubles handle 128 bit memory accesses

 

The 128 bit memory references used for SSE and SSE2 are likewise split up into two independent 64 bit accesses which are handled by the integer core. The results are snooped from the Load Data busses of the Data Cache by the Floating Point Core.

The decision to extend the Integer Registers from 32 bit to 64 bit and to split the 128 bit SSE(2) instructions into two 64 ones results in an elegant all 64 bit Micro Architecture. 

 

There is a significant advantage in having an L1 Data Cache that can handle 128 bit loads or stores as two independent 64 bit loads or stores per cycle. Two 64 bit loads from different addresses into a single 128 bit SSE2 register with two moves is just as fast a loading a single 128 bit word from memory. Apple had a decent argument for introducing a 128 bit data type containing four 32 bit floating point values. Which is as such usable for high quality ARGB color image data. (Given it's customer base)  Two 64 bits floating point numbers in a 128 bit word doesn't seem to serve any practical commercial application. (other then making live miserable for compiler builders...)  Providing separate 64 bit loads and stores at a two per cycle rate gives a compiler a better chance to combine unrelated 64 bit operations into a single 128 bit one.

 

     1.8   Address Additions before Scheduling                                                                         US Patent 6,457,115.

 

A single x86 instruction may need many source operands when memory is involved:

 

address base + index< scale + displacement + segment

 

Up to four arguments are needed to calculate the address (ignoring the 2 bit "scale-field" hard-coded in the instruction)  This means that a typical x86 instruction of the format F(reg,mem) needs not less then 5 input operands!  Now one of the parameters is a constant given by the instruction itself (displacement)  Another parameter (segment) is a "semi- constant" and is typically zero in modern code with a non- segmented flat memory space.  

 

The Athlon 32 adds the segment to the address only when needed after one of the three AGU's  (Address Generator Units) has calculated the linear address.  It does so during the Data Cache Access which causes an extra cycle of cache load latency. The Opteron has a different implementation. The displacement and segment are summed together before the actual address calculation. 

 

The segment value is considered a constant and thus, just like the displacement, know during decoding. The addition is made during decoding/dispatch and the result is passed on together with the rest of the instruction bits as a new "displacement field" of the instruction. 

 

An exception is generated whenever the segment value does change. The results of operations depending on it are cancelled and the pipeline is restarted from the right point.

 

The "Decode-Time" address adder might be used for other address additions as well.  (The 64 bit mode gets rid of most of the segmentation)  

   

 

 

For instance the new Relative Address mode that adds the 64 bit Instruction Pointer (RIP) and the 32 bit displacement from the instruction together.  By reducing the number of input parameters as much as possible during decoding we end up with a maximum of four input parameters for each instruction. Three of them are register variables and the fourth one is a constant.

 

 

 

 

     1.9   Register Renaming and Out-Of-Order Processing

 

The Athlon (and Opteron) uses some clever tricks to handle Register Renaming and OOO processing  (Out-Of-Order) which allows them to shave some 25% of the integer pipeline. The design allows for a simple and fast scheduler that doesn't need special hardware to handle miss-scheduling caused by cache-misses.

 

Register renaming is used to eliminate "False Dependencies" which limit the number of Instructions Per Cycle (IPC) that a processor can execute.  False Dependencies are the result of a limited number of registers. A register that holds an intermediate result needs to be re-used soon for another, maybe unrelated, calculation. Its value is then overwritten and not available anymore. The instruction that overwrites it must always wait for the instruction that needs the previous result.

 

This serializes the execution of the instruction and limits the IPC. This is especially true for an architecture like x86 which has a very small number of registers. The example below shows how register renaming can eliminate false data dependencies: Register rC is overwritten by the 3rd instruction, so the 3rd instruction has to wait for the 2nd instruction: a False Dependency. With register renaming we can use an "arbitrary" large register file. There is no need to re-use rC(r3)  We can simple use another available register instead, register r7 in this case. The basic rule is that all of the instructions that are "in-flight" are given a different destination register. (single assignment)

 

Non Renamed:   rC=rA+rB; rF=rC&rD; rC=rA-rB;

Renamed:       r3=r1+r2; r6=r3&r4; r7=r1-r2;

 

     1.10   Renaming the Integer Registers

 

Opteron has sixteen 64 bit architectural integer registers. Not visible for the programmer are eight more 64 bit scratch registers used to store intermediate results for micro code routines that handle more complex x86 instructions.  The Athlon family of processors handles Register Renaming in the simplest possible way. Which is a compliment because it often takes a lot of smart thinking to figure out how to do things in the simplest way!  People only rarely succeed in this ...

 

As we said, each instruction in flight needs a different destination register.  The total number of renamed registers must be equal or larger then the sum of all instructions-in-flight plus the architectural-registers.  The maximum number of instruction in flight is 72, add everything together then you need 96 "renamed registers".  Two different structures are used to maintain these registers. The instructions-in-flight results are maintained by the result fields of the 72 entry Re-Order Buffer ( ROB ) and the architectural-registers are maintained by the  "Integer Future File and Register File". (  IFFRF )

   

 

Re-Order-Buffer Tag definition

    

wrap

 bit

Instruction In Flight Number

re-order buffer index  0...23 

sub-index  0..2

bit 7

bit 6

bit 5

bit 4

bit 3

bit 2

bit 1

bit 0

 

 

This configuration allows for a very simple renaming scheme which takes -zero- cycles...  Each instruction dispatched from one of the three decode lanes gets a "Re-Order Buffer Tag" or "Instruction In Flight Tag" consisting of:

 

1)   A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched.

2)   A value 0..23 that identifies the "cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23.

3)   A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches. 

 

     1.11   The  IFFRF:  Integer Future File and Register File

 

 

This register file is used to maintain the 16 architectural registers and the 8 temporary scratch registers. It has two entries for each of the 16 architectural registers. One of the two can be viewed as the actual register as seen by the programmer. It gets its value when the instruction that produced it has "retired"  An instruction is retired when it is sure that no exception or branch-miss-prediction has occurred and all preceding instructions have been retired as well. The value of the register is said to be "non-speculative". 

 

 

40 entry Integer Future File and Register File:   IFFRF

    

16 entries

Retired Architectural Register Values

16 entries

Speculative  Register Values:  "Future File"

  8 entries

Temporary Registers

 

 

Instruction-In-Flight and their results may be cancelled and discarded as long as they have not been retired.  Cancellation can be a a result of a proceeding instruction that caused an exception or a by a branch-miss-prediction. Instructions-In-Flight are in principle always speculative. The results stay speculative even if the instruction has finished. The results only become non-speculative at retirement when the retirement logic determines that no exception has occurred.

 

     1.12   The Future File section of the IFFRF

 

The second entry for each Architectural Register holds the so-called "Future" value.  The 16 of them together constitutes the so-called Future File  These entries contain the most recent value produced for a certain architectural register by any instruction,

( retired or non retired ).  The contents of a future file register is speculative as long as the producing instruction has not yet retired. The value becomes non- speculative after a while if the producing instruction successfully retires. 

 

The Future File origins go back to 1985

 

Instructions write into the Future File as soon as their result is produced. The Future File however does not accept the result if it's not the very latest result for a certain register. If a later instruction has managed to finish earlier and has written its result already into the Future File then it will not accept results anymore for that register from older instructions. Finished Instructions address the Future File with the instruction code register number, a number from 0 to 15 for the 16 architectural registers.  The "Re-Order Buffer Tag" is used to determine if a result may be overwritten.  Each Future File entry has a corresponding Tag. We will see that an instruction may only write into the Future File entry if it still owns the entry: If the Tags match.

 

     1.13   Exception or Branch Miss-prediction:  Overwrite Speculative values with Retired values

 

All speculative results are cancelled by copying the retired values of the IFFRF over the Future File values of the IFFRF.

 

The speculative results must be cancelled whenever the retirement  logic detects that an exception occurred when the instruction or an earlier one was executed. There are many types of exceptions, Memory accesses can encounter a Page Miss or they can erroneously access a memory area which they are not entitled to access. The divide by zero is another well known exception.

( It shouldn't be for Floating Point numbers because +/- infinity are perfectly valid IEEE Floating Point values) 

 

When we say Speculative Results here then we mean more specifically the results that may need to be canceled because of erroneous Control Flow Speculation:  The program flow went into a different direction then predicted,   now:

 

-  A branch miss prediction is basically the same as any other exception, but...

-  All exceptions are also branch miss-predictions.

 

An exception causes a change in the program flow much like a conditional call. All instructions that can cause exceptions are thus in fact conditional control flow instructions. Exceptions are however always predicted as not taken and ignored by the branch prediction hardware.  

 -

     1.14   The Reorder Buffer

 

We mentioned retirement a number of times now. Retirement is handled with the aid of the reorder buffer. This unit does what its name suggests:  It Re-Orders the instructions, It orders them back into the original program flow.  The Schedulers are responsible for Out-Of-Order execution. The schedulers do so by launching instructions to execution units whenever all their source operands are available and the needed execution unit is free. It's the reorder buffer that brings the instructions back into order again.

 

 

Operation of the Reorder Buffer

 

index

1

2

3

4

5

6

7

8

9

10

11

12

lane 0

 

 

 

 

 

 

 

 

 

 

 

 

lane 1

 

 

 

 

 

 

 

 

 

 

 

 

lane 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   =  Out Of Order finished Instructions,  results still speculative.

 

   =  Instructions being retired now.

 

   =  Retired Instructions,  not speculative anymore.

 

 

The reorder buffer itself is split into three identical lanes. Each lane has 24 entries. The lanes and entries correspond to the reorder buffer Tag assigned to each instruction. Each Instruction that finishes writes its result into the reorder buffer using the reorder buffer Tag as address.  The instructions also store any events that happened during execution that will require an exception.

In particular conditional branch instructions may report that the branch address they calculated does not correspond with the address that was predicted.

 

The instruction will leave further information needed for the reorder buffer to do its work. It leave some info of what kind of instruction it is.  It will leave the architectural register address (0..15) that corresponds with its destination register.  The instruction will leave also the address where it is located in 'instruction' memory. Some of this info may already be left in the reorder buffer earlier when the instruction received it's reorder buffer Tag. 

 

The reorder buffer is shared by all instructions. It's also used to reorder Floating Point, SSE(2)  and MMX instructions. These instructions however do not write their result data in the reorder buffer. They use the 120 entry renamed floating point register file for that purpose. The reorder buffer however is still used to track the instruction code info and address, exception flags, ready status and retirement status. All instructions are retired with the aid of the reorder buffer. 

 

     1.15   Retirement and Exception Processing

 

The image shows how Instructions can be retired at the moment when all previous instructions are retired. Three instructions can be retired per cycle. The Instruction Control Unit (ICU) accesses the reorder buffer contents for the three instructions. The instructions are retired If there are no exception flags set. The result data is written to the Retired Entries of the IFFRF. The later is basically a write-only process. These values are only used in case of an exception. In this case they are used to overwrite the speculative values of the Future File.

 

The ICU will handle a branch miss prediction by forwarding the instruction's address to the Instruction Fetch Unit at the beginning of the pipeline. The branch will then be re-executed, now with the right prediction. Other exceptions require an exception routine call. The ICU can for instance save some relevant data in the temporary registers of the IFFRF and invoke the exception call or a micro code function.

 

     1.16   Exception Processing is always delayed until Retirement

 

Exceptions processing must always be delayed until the instruction that caused the exception is not speculative anymore:

A memory access exception for instance may be caused by accessing an Array with an Index that is out of bounds. The program may have a test for out-of-bound indices and code to handle it.  The branch prediction hardware however will most likely predict that the out-of-bound test is Not True because the Index is OK most of the time. The processor will thus access the array with an out-of-bounds Index anyway and not unlikely cause a memory access exception.  Exception handling is delayed until retirement where the instruction plus its exception flags is discarded because of the branch-miss-prediction.

 

It's for the same reason that speculative Stores to memory are delayed and hold in the LSU (Load Store Unit) until retirement.  

 

     1.17   Retirement  of Vector Path and Double Dispatch Instructions

 

A single Vector Path instruction may produce many instructions. The Micro sequencer inserts these instructions in the 3 way pipeline. Three per cycle. They do not mix with Direct Path instructions during decoding and retirement. The actual Retirement takes place when the last line of 3 instructions is ready. The retirement hardware scans from the first line of 3 micro code generated instructions to the last line, accumulating all possible exceptions that occurred. Retirement follows If no exception has occurred, otherwise the appropriate exception call is made.

 

Instructions generated by Doubles can mix with other (Direct Path) instructions during decoding and retirement. The two instructions generated by a Double must however retire simultaneously, imagine a PUSH that does retire the memory store but doesn't retire the Stack Pointer update.. This leads to the limitation that both instructions generated by a Double must be in the same 3 instruction line during retirement.

 

     1.18   Out-Of-Order processing:  Instruction Dispatch

 

We are now ready to describe in greater detail how Out Of Order processing is handled. We go back to the Instruction Dispatch Stage. Instruction Dispatch means here that Instructions are send to the Schedulers. They are not send to the execution units yet. The Instructions do access the Register File though. Three Instructions can look up a total of nine register values from the IFFRF each cycle. The Future File entries are accessed. The Future File contains the latest speculative results for each of the 16 architectural registers. The three Instruction are then placed together with these register values into the three Schedulers. 

 

Now it's highly likely that not all previous instructions where finished so many of the register values are older values from previous instructions. Each Instruction that is dispatched clears the valid flag for the architectural register it will modify. It also leaves its Tag. Succeeding instructions now know the Future File entry is not valid anymore but they also know the Tag of the instruction that will provide the data they need. They will use this Tag later to pick up the result directly from the result busses. The instruction that invalidated the register will later finish, and write its result to the Future File ( if it still owns the entry ) and then set the valid bit back again.

 

The instruction has ownership over an entry in the Future File if the Tags match. It acquires ownership when the instruction is dispatched and it looses ownership if another instruction is dispatched that has the same destination register. An instruction writes its result in the Future File only if it is still owner. Subsequent instructions can pick up the result from there. If the instruction is not owner anymore than it won't modify the Future File entry. Any other instruction that needed this result was already dispatched and picked up the result directly from the result-busses. 

 

     1.19   The Schedulers / Reservation Stations

 

Each of the (up to) three Instructions that are Dispatched gets assigned to a Reservation Station within the Scheduler they are send to. Each scheduler has eight Reservation Stations. That's up from six in the Athlon 32 and up from five in the first Athlon prototypes. The Reservation Station gathers all remaining input data needed for the instruction from the result busses. It monitors the Tags of these busses to see if the instructions from which data is needed are about to produce their results. (Register File Bypass)

 

The Tag busses run one cycle in advance of the result data-busses. The Reservation Station does not need to look at all the busses. The Tag's sub-index identifies which of the three ALU's will produce the result. It also knows if the data will come from one of the two cache read ports. It can select the Tag bus in advance rather then having to test all the Tags.

 

 

 

The Scheduler's Reservation Station Entries

 

 

 

  Instruction

  Data

 "CONST"    64  bit   Displacement + Segment or Instruction Pointer

 

  Input

  Data

   VALUE:    64  bit   register 'A'

   TAG:   reg. 'A' 

   VALUE:    64  bit   register 'B'   or   64 bit Index register

   TAG:   reg. 'B' / Index 

   VALUE:    64  bit   Base register

   TAG:   Base reg. 

 

  Input

  Status

   VALUE:      4  bit   ZAPS  flags:    Zero,  Aux, Parity, Sign

   TAG:   ZAPS  flags

   VALUE:      1  bit   OF/C flag:       either OverFlow or Carry

   TAG:   OF/C   flag

 

 

 

Until now we've neglected the x86 status flags. Many x86 instructions use one or more of the six x86 status flags as an input. An x86- instruction does or does not change the status flags. An instruction may change some, all or none of the status flags. This all means that different flags may be produced by different instructions. Luckily there are two rules that help to simplify the scheduler.

  

rule 1:  An instructions that modifies any of the ZAPS flags ( Zero, Aux, Parity, Sign ) modifies all of them. This means that these

           can be handled by a single 4 bit entry in the reservation station.

rule 2.  An instruction that uses the OverFlow flag  (signed integer) does not use the Carry flag (unsigned integer). A single 1-bit

           reservation station can be used for the one which is needed.  

 

     1.20   Each x86 instruction can launch both an ALU and an AGU operation

 

A single x86 instruction waiting in a Reservation Station of one of the Schedulers can launch up to two operations.  It can launch an integer operation to it's associated ALU and it can launch a memory operation to its AGU (Address Generator Unit)  The simplest integer instructions of type F( reg,reg ) do not access memory and launch an ALU operation only.  Integer instructions of type

F( reg,mem ) launch a memory load first and consequentially launch an ALU operation when the load data arrives.

 

Integer instructions of type F( mem,reg ) are implement in the same way. The Load is now a Load/Store operation. The Load/Store keeps hanging in the LSU (Load Store Unit)  Here it waits for the result of the ALU operation to be stored in memory. Non Integer instructions such as Floating Point and Multi Media instruction specifying a memory access will launch an AGU instruction only.

The Floating Point / MMX operation itself is then handled by the Floating Point Unit itself. 

 

Each Scheduler can launch one ALU and one AGU operation per cycle. The ALU operation may come from one x86 instruction while the AGU operation may come from another.

 

     1.21   The Scheduling of an ALU operation                                                                        US Patent 6,535,972.

 

An ALU operations generally needs two register operands and optionally some status bits. An x86 instruction that accesses memory will leave the Load value in register 'B'   The Reservation Station waits until it has all needed input operands (data and status). The Scheduler observes all eight reservation stations and will Launch the ALU operation if its the oldest instructions that is ready to Launch. The Scheduler sends all operands plus instruction information to the ALU that is associated with it.

 

Reservation Station entries typically involved in an ALU operation:

 

Instruction

Data

 "CONST"    64 bit   Displacement + Segment or Instruction Pointer

 

Input

Data

   VALUE:    64 bit   register 'A'

   TAG:   reg. 'A' 

   VALUE:    64 bit   register 'B'   or   64 bit Index register

   TAG:   reg. 'B' / Index 

   VALUE:    64  bit   Base register

   TAG:   Base reg. 

 

Input

Status

   VALUE:      4 bit   ZAPS  flags:    Zero,  Aux, Parity, Sign

   TAG:   ZAPS  flags

   VALUE:      1  bit   OF/C flag:       either OverFlow or Carry

   TAG:   OF/C   flag

 

 

The Reservation station does not actually need to catch the last operand(s) itself.  The Reservation Station can be bypassed. The

ALU may receive the bus number which will carry the last result so it can catch the operand itself.  If you take a look at the Die photo then you see that all three ALU's are next to each other, even though each receives only operations from its own scheduler. The bypass mechanism lets them exchange data directly without the need of going back and forward to the schedulers.

 

     1.22   The Scheduling of an AGU operation for memory access                                  US Patent 6,457,115.

 

We saw how a single x86 instruction may need up to four arguments to calculate the memory address ( ignoring the 2 bit scale field hard-coded in the instruction ). This includes up to two register variables (base and index)  We also saw how displacement and segment could be added together already during instruction decoding. Segment is considered a semi-constant a restore mecha- nism is provided for the rare case that it is changed. 

 

Reservation Station entries typically involved in an AGU operation:

( address = base + index << scale + displacement + segment )

 

Instruction