|
|
|
|
|
You
can find the enlarged versions of the images above further on in this
page.
Click
here for the large (1600x1200) version
|
|
Understanding
the detailed Architecture
of AMD's 64 bit Core
For
those who really want to know what it takes to build a world
class 64 bit
speculative
out of order processor core for a multi-processing
environment.
|
|
Index
.
|
|
Index
Chapter 1, The
Integer Core: The
Integer Super Highway
|
|
1.1 The
Integer Super Highway
1.2 Three way Super Scalar CISC architecture
1.3 A Third class of Instructions: Double Dispatch
Operations
1.4 128
bit SSE(2) instructions are split into Doubles
1.5 Using Doubles for 128
bit SSE(2) instructions avoids a 25% latency penalty
1.6 Doubles used for some Integer and x87 instructions as well
1.7 Doubles handle 128 bit memory accesses
1.8 Address
Additions before Scheduling
1.9 Register Renaming and Out-Of-Order Processing
1.10 Renaming
the Integer Registers
1.11 The IFFRF:
Integer Future
File and Register File
1.12 The "Future
File" section of the IFFRF
1.13 Exception
or Branch Miss-prediction: Overwrite Speculative values with Retired values
1.14 The Reorder Buffer
1.15 Retirement and Exception Processing
1.16 Exception Processing
is always delayed until Retirement
1.17 Retirement of Vector Path and Double Dispatch Instructions
1.18 Out-Of-Order
processing: Instruction
Dispatch
1.19 The Schedulers / Reservation
Stations
1.20 Each x86 instruction can launch both an ALU and an AGU operation
1.21 The Scheduling of an ALU operation
1.22 The
Scheduling of an AGU operation for memory access
1.23 Micro Architectural Advantages of Opteron's Integer
Core
|
|
Index
Chapter 2, Opteron's
Floating Point Units
|
|
2.1 The Floating Point
Renamed Register File
2.2 Floating Point rename stage 1: x87
stack to absolute FP register mapping
2.3 Floating Point
rename stage 2: Regular Register Renaming
2.4 Floating Point
instruction scheduler
2.5 The 5 read and 5 write ports of the floating point renamed register file
2.6 The Floating Point processing units
2.7 The Convert and Classify units
2.8 X87 Status handling:
FCOMI /
FCMOV and FCOM / FSTSW pairs
|
|
Index
Chapter 3, Opteron's
Data Cache and Load / Store units
|
|
3.1 Data Cache: 64
kByte with three cycle data load latency
3.2 Two accesses per cycle, read or write: 8 way bank
interleaved, two way set associative
3.3 The
Data Cache Hit / Miss Detection: The cache tags and the
primairy TLB's
3.4 The 512 entry second level TLB
3.5 Error
Coding and Correction
3.6 The Load / Store Unit, LS1 and LS2
3.7 The
"Pre Cache" Load / Store unit: LS1
3.8 Entering LS2: The Cache Probe Response
3.9 The
"Post Cache" Load Store unit: LS2
3.10 Retiring instructions in the Load Store unit and Exception Handling
3.11 Store to Load forwarding, The Dependency Link File
3.12 Self Modifying Code checks: Mutual exclusive L1 DCache and L1 I Cache
3.13 Handling multi processing deadlocks: exponential back-off
3.14 Improvements for multi processing and multi threading
3.15 Address Space Number (ASN) and Global flag
3.16 The
TLB Flush Filter CAM
3.17 Data Cache Snoop
Interface
3.18 Snooping
the Data Cache for Cache Coherency, The MOESI protocol
3.19 Snooping
the Data Cache for Cache Coherency, The Snoop Tag RAM
3.20 Snooping
the L1 Data Cache and outstanding stores in LS2
3.21 Snooping
LS2 for loads to recover strict memory ordering in shared memory
3.22 Snooping
the TLB Flush Filter CAM
|
|
Index
Chapter 4,
Opteron's
Instruction Cache and Decoding
|
|
4.1 Instruction
Cache: More then instructions alone
4.2 The General Instruction Format
4.3 The
Pre-decode bits
4.4 Massively Parallel
Pre-decoding
4.5 Large
Workload Branch Prediction
4.6 Improved Branch Prediction
4.7 The Branch Selectors
4.8 The
Branch Target Buffer
4.9 The Global
History Bimodal Counters
4.10 Combined Local and Global Branch Prediction with
three branches per line
4.11 The
Branch Target Address Calculator, Backup for the Branch Target
Buffer
4.12 Instruction Cache Hit / Miss detection,
The Current Page and the BTAC
4.13 Instruction
Cache Snooping
|
|
Chapter
1, The
Integer Core: The
Integer Super Highway
|
|
|
|
|
|
1.1 The
Integer Super Highway
|
The
Die photo of the Integer core is dominated by all the 64
bit data-busses running North-South. At some points the density may
reach up to twenty busses. The busses carry all the source and result
operand data between the Integer units. The lay-out of the busses is bit-interleaved, meaning that equal bit-numbers are grouped
together: The bits 0 of
all the busses are next to each other at one side of this Integer super
highway while all bits 63 can be found at the other side. Very visible
is also the separation into individual bytes.
|
|
1.2
Three way Super Scalar CISC architecture
|
|
The
Opteron is a 3 way super-scalar processor. It can decode, execute and
retire three x86-instructions per cycle. These instructions can be quite
complex (CISC) operations involving multiple (>2) source operands.
The Pentium 4 handles 3 so called uOps per cycle where multiple
of these uOps may be needed to implement a single
x86 instruction. It may be that the Prescott, the follow-up of the
Pentium 4 can handle four uOps per cycle as we revealed here.
In
general an x86 instruction can be expressed as:
F( reg,reg ),
F( reg,mem ) or
F( mem,reg ) where the first operand is both source and
destination. The first two forms are general for Integer, MMX and
SSE(2). The later form is found basically in Integer instructions: One
source operand is loaded from memory and the result is written back to
the same location. The Integer Pipeline handles Loads and Stores for
all operations including those for Floating Point and Multimedia instructions.
|
|
Overview
of Opteron's Processor
Core
|
|
|
1.3
A third class of Instructions: Double Dispatch Operations
|
The
original Athlon ( which we'll refer to as the Athlon 32)
classifies instructions either as Direct Path or
Vector Path.
To the first class belong all the less complex
instructions that can be handled by the hardware as a single operation.
The more complex instructions (Vector Path) invoke the micro sequencer
that executes a micro code program. Instructions are read from micro
code Rom and inserted in the 3-way pipeline.
The
Opteron introduces a third instruction class: The Double
Dispatch instructions, or simply "Doubles". The
doubles are generated near the end of the decoding pipeline. The Instructions,
which either followed the "Direct Path", or where generated by
the Micro Code sequencer, are split into two independent instructions. The 3-way pipeline
can thus generate up to six instructions per cycle. The instructions are
"re-packed" back to three again in the PACK-stage. This extra pipeline stage has often been the subject
of speculation since Opteron's introduction at the 2001 Micro Processor
Forum. The
Six-fold symmetry of the "doubling stage" is clearly visible
on the Die plot above.
|
1.4
128
bit SSE(2) instructions are split into Doubles
|
|
Appendix
C of Opteron's Optimization Guide specifies to which class each and
every instruction belongs. Most 128 bit SSE and SSE2 instructions are
implemented as double dispatch instructions. Only those that can not be
split into two independent 64 bit operations are handled as Vector
Path (Micro Code) instructions. Those SSE2 instructions that operate on
only one half of a 128 bit register are implemented as a single (Direct
Path) instruction. There
are both advantages and disadvantages performance-wise here. A
disadvantage may be that the decode rate of 128 bit SSE2 instructions is limited to 1.5 per cycle. In general however this not a
performance limiter because the maximum throughput is limited by the FP
units and the retirement hardware to a single 128 bit SSE instruction per
cycle. More important is the extra cycle latency that a
Pentium 4 style implementation would bring is avoided.
|
|
1.5
Using Doubles for 128
bit SSE(2) instructions avoids a 25% latency penalty
|
|
In the Pentium 4 an SSE2
instruction is split in a later stage in the Floating Point unit itself. The
Floating Point units accept 128 bit source data at it's first stage.
It then splits the operation in two and combines the two results at the end into
one single 128 bit result. This effectively adds one extra cycle to the total
latency. For instance: The x87 FADD and FMUL take 5 and 7 cycles while
the 128 bit (2x64) SSE2 equivalents need 6 and 8 cycles. The
Opteron, like the Athlon 32, handles both FADD and FMUL in 4 cycles.
The SSE2 equivalents are handled with the same 4 cycle latency. An extra
cycle would mean a latency increase of 25%, a serious performance
limiter, so the correct decision has been made here. If
you would look at a highly pipelined FP unit in action then you would
see mostly bubbles and few instructions . Instructions waiting for the
results of others that have yet to finish. Latency is more important
here then bandwidth.
The
next Pentium, code-named Prescott has an extra Floating Point
Multiplier and Adder as we could reveal to you here. We now think that
the extra FP units are used for single port but full 128 bit operation. This would bring
back the SSE2 latencies for Add and Multiply to 5 and 7 cycles,
beneficial for single thread programs. It would double the Floating
Point bandwidth which is mainly interesting for Hyper Threading
performance.
|
|
1.6
Doubles used for some Integer and x87 instructions as well
|
|
The
Double Dispatch instructions are not only used for SSE and SSE2
instructions. Appendix
C of Opteron's Optimization Guide also list classic x86 instructions
like POP and PUSH, some of the integer multiplications and the LEAVE
instruction. All the instructions are handled by micro code on the
Athlon 32 which is a lot slower. Also a number of classical x87
instructions are now handled by doubles, for instance those FP
instructions that have an integer as one of the source operands that
first needs to be converted to floating point.
|
|
1.7 Doubles handle 128 bit memory accesses
|
|
The
128 bit memory references used for SSE and SSE2 are likewise split up
into two independent 64 bit accesses which are handled by the integer
core. The results are snooped from the Load Data busses of the Data
Cache by the Floating Point Core.
The
decision to extend the Integer Registers from 32 bit to 64 bit and to
split the 128
bit SSE(2) instructions into two 64 ones results in an elegant all 64 bit Micro
Architecture.
There
is a significant advantage in having an L1 Data Cache that can handle
128 bit loads or stores as two independent 64 bit loads or stores per
cycle. Two 64 bit loads from different addresses into a single 128 bit
SSE2 register with two moves is just as fast a loading a single 128 bit
word from memory. Apple
had a decent argument for introducing a 128 bit data type containing
four 32 bit floating point values. Which is as such usable for high
quality ARGB color image data. (Given it's customer base) Two 64
bits floating point numbers in a 128 bit word doesn't seem to serve any
practical commercial application. (other then making live miserable for compiler
builders...) Providing separate 64 bit loads and stores at a two per
cycle rate gives a compiler a better chance to combine unrelated 64 bit
operations into a single 128 bit one.
|
|
1.8 Address
Additions before Scheduling
US
Patent 6,457,115.
|
|
A
single x86 instruction may need many source operands when memory is
involved:
address = base +
index< scale
+ displacement +
segment
Up
to four arguments are needed to calculate the address (ignoring the 2
bit "scale-field" hard-coded in the instruction) This
means that a typical x86 instruction of the format F(reg,mem)
needs not less then 5 input operands! Now one of the parameters is a
constant given by the instruction itself (displacement) Another
parameter (segment) is a "semi- constant" and is typically zero
in modern code with a non- segmented flat memory space. The
Athlon 32 adds the segment to the address only when needed after one of
the three AGU's (Address Generator Units) has calculated the
linear address. It does so during the Data Cache Access which
causes an extra cycle of cache load latency. The
Opteron has a different implementation. The displacement and segment are
summed together before the actual
address calculation.
The
segment value is considered a constant and thus, just like the
displacement, know during decoding. The addition is made during decoding/dispatch
and the result is passed on together with the rest of the instruction
bits as a new "displacement field" of the instruction.
An
exception is generated whenever the segment value does change. The
results of operations depending on it are cancelled and the pipeline is
restarted from the right point.
The
"Decode-Time" address adder might be used for other address
additions as well. (The 64 bit mode gets rid of most of the
segmentation)
|
|

|
For
instance the new Relative Address mode
that adds the 64 bit Instruction Pointer (RIP) and the 32 bit
displacement from the instruction together. By reducing the number
of input parameters as much as possible during decoding we end up with a
maximum of four input parameters for each instruction. Three of them are
register variables and the fourth one is a constant.
|
|
|
|
|
|
|
|
|
1.9
Register Renaming and Out-Of-Order Processing
|
|
The
Athlon (and Opteron) uses some clever tricks to handle Register Renaming
and OOO processing (Out-Of-Order) which allows them to shave some 25%
of the integer pipeline. The design allows for a simple and
fast scheduler that doesn't need special hardware to handle
miss-scheduling caused by cache-misses.
Register renaming is used to eliminate "False
Dependencies" which limit the number of Instructions Per Cycle (IPC)
that a processor can execute. False Dependencies are the result of
a limited number of registers. A register that holds an intermediate
result needs to be re-used soon for another, maybe unrelated,
calculation. Its value is then overwritten and not available anymore.
The instruction that overwrites it must always wait for the instruction
that needs the previous result. This
serializes the execution of the instruction and limits the IPC. This is
especially true for an architecture like x86 which has a very small
number of registers. The example below shows how register renaming can
eliminate false data dependencies: Register rC
is overwritten by the 3rd instruction, so the 3rd instruction has to
wait for the 2nd instruction: a False Dependency. With register renaming
we can use an "arbitrary" large register file. There is no
need to re-use rC(r3)
We can simple use another available register instead, register r7
in this case. The basic rule is that all of the instructions that are "in-flight" are
given a different destination register. (single assignment) Non
Renamed: rC=rA+rB; rF=rC&rD; rC=rA-rB; Renamed:
r3=r1+r2; r6=r3&r4; r7=r1-r2;
|
|
1.10 Renaming
the Integer Registers
|
|
Opteron
has sixteen 64 bit architectural integer registers. Not visible for the
programmer are eight more 64 bit scratch registers used to store
intermediate results for micro code routines that handle more complex
x86 instructions. The Athlon family of processors handles Register
Renaming in the simplest possible way. Which is a compliment because it
often takes a lot of smart thinking to figure out how to do things in
the simplest way! People only rarely succeed in this ...
As
we said, each instruction in flight needs a different destination
register. The total number of renamed registers must be equal or
larger then the sum of all instructions-in-flight
plus the architectural-registers.
The maximum number of instruction in flight is 72, add everything
together then you need 96 "renamed registers". Two
different structures are used to maintain these registers. The
instructions-in-flight results are maintained by the result fields of
the 72 entry Re-Order Buffer ( ROB ) and the architectural-registers are
maintained by the "Integer Future File and Register
File". ( IFFRF )
|
|
Re-Order-Buffer Tag
definition |
|
wrap
bit |
Instruction
In Flight Number
|
|
re-order
buffer index 0...23 |
sub-index
0..2 |
|
bit
7 |
bit
6 |
bit
5 |
bit
4 |
bit
3 |
bit
2 |
bit
1 |
bit
0 |
|
This
configuration allows for a very simple renaming scheme which takes
-zero- cycles... Each instruction dispatched from one of the three
decode lanes gets a "Re-Order
Buffer Tag" or
"Instruction In
Flight Tag"
consisting of: 1)
A sub-index 0,1 or 2 which identifies from which of the three lanes the
instruction was dispatched. 2)
A value 0..23 that identifies the "cycle" in which the instruction was
dispatched. The "cycle counter" wraps to 0 after reaching 23. 3)
A wrap bit. When two instructions have different wrap bits then the
cycle counter has wrapped between the dispatches.
|
|
1.11
The IFFRF: Integer Future
File and Register File
|
|
This
register file is used to maintain the 16 architectural registers and the
8 temporary scratch registers. It has two entries for each of the 16
architectural registers. One of the two can be viewed as the actual register as seen by
the programmer. It gets its value when the instruction that produced it
has "retired" An instruction is retired when it is sure
that no exception or branch-miss-prediction has occurred and all
preceding instructions have been retired as well. The value of the
register is said to be "non-speculative".
|
|
40
entry Integer Future File and Register File: IFFRF |
|
16
entries
|
Retired
Architectural Register Values
|
|
16 entries
|
Speculative
Register Values: "Future File"
|
|
8 entries
|
Temporary Registers
|
|
Instruction-In-Flight
and their results may be cancelled and discarded as long as they have
not been retired. Cancellation can be a a result of a proceeding
instruction that caused an exception or a by a branch-miss-prediction. Instructions-In-Flight
are in principle always speculative. The results stay speculative even if the
instruction has finished. The results only become non-speculative at
retirement when the retirement logic determines that no exception has occurred.
|
|
1.12 The
Future
File section of the IFFRF
|
|
The
second entry for each Architectural Register holds the so-called "Future"
value. The 16 of them together constitutes the so-called Future File These
entries contain the most recent value produced for a certain
architectural register by any instruction,
( retired or non retired
). The contents of a future file register is speculative as long
as the
producing instruction has not yet retired. The value becomes non-
speculative after a while if the producing instruction successfully
retires.

The
Future File origins go back to 1985
Instructions
write into the Future File as soon as their result is produced.
The Future File however does not accept
the result if it's not the very latest result for a certain register. If
a later instruction has managed to finish earlier and has written its
result already into the Future File then it will not accept
results anymore for that register from older instructions. Finished
Instructions address the Future File with the instruction code register
number, a number from 0 to 15 for the 16 architectural registers. The "Re-Order Buffer Tag" is used to determine
if a result may be overwritten. Each Future File entry has a
corresponding Tag. We will see that an instruction may only write into
the Future File entry if it still owns the entry: If the Tags match.
|
|
1.13
Exception
or Branch Miss-prediction: Overwrite Speculative values with Retired values
|
|
All
speculative results are cancelled by copying the retired values of the
IFFRF over the Future File values of the IFFRF.
The
speculative results must be cancelled whenever the retirement
logic detects that an exception occurred when the instruction or an earlier
one was
executed. There are many types of exceptions, Memory accesses can
encounter a Page Miss or they can erroneously access a memory area which they are
not entitled to access. The divide by zero is another well known
exception.
( It shouldn't be for Floating Point numbers because +/-
infinity are perfectly valid IEEE Floating Point values)
When
we say Speculative
Results here then we
mean more specifically the results that may need to be canceled because
of erroneous Control
Flow Speculation:
The program flow went into a different direction then
predicted, now:
-
A branch miss prediction is basically the same as any other exception,
but...
-
All exceptions are also branch miss-predictions.
An
exception causes a change in the program flow much like a conditional
call. All instructions that can cause exceptions are thus in fact
conditional control flow instructions. Exceptions are however always
predicted as not taken and ignored by the branch prediction
hardware.
-
|
|
1.14 The Reorder Buffer
|
|
We
mentioned retirement a number of times now. Retirement is handled with
the aid of the reorder buffer. This unit does what its name suggests:
It Re-Orders the instructions, It orders them back into the original
program flow. The Schedulers are responsible for Out-Of-Order
execution. The schedulers do so by launching instructions to execution units
whenever all their source operands are available and the needed
execution unit is free. It's the reorder buffer that brings the
instructions back into order again.
|
|
Operation
of the Reorder Buffer |
|
index |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
|
lane
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
lane
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
lane
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
=
Out Of Order finished Instructions, results still
speculative.
|
|
|
=
Instructions being retired now.
|
|
|
=
Retired Instructions, not speculative anymore.
|
|
The
reorder buffer itself is split into three identical lanes. Each lane
has 24 entries. The lanes and entries correspond to the reorder buffer
Tag assigned to each instruction. Each Instruction that finishes writes
its result into the reorder buffer using the reorder buffer Tag as
address. The instructions also store any events that happened
during execution that will require an exception. In
particular conditional branch instructions may report that the branch
address they calculated does not correspond with the address that was
predicted.
The
instruction will leave further information needed for the reorder buffer
to do its work. It leave some info of what kind of instruction it
is. It will leave the architectural register address (0..15) that
corresponds with its destination register. The instruction will
leave also the address where it is located in 'instruction' memory. Some
of this info may already be left in the reorder buffer earlier when the
instruction received it's reorder buffer Tag.
The
reorder buffer is shared by all instructions. It's also used to reorder
Floating Point, SSE(2) and MMX instructions. These instructions
however do not write their result data in the reorder buffer. They use
the 120 entry renamed floating point register file for that purpose. The
reorder buffer however is still used to track the instruction code info
and address, exception flags, ready status and retirement status. All
instructions are retired with the aid of the reorder buffer.
|
|
1.15 Retirement and Exception Processing
|
|
The
image shows how Instructions can be retired at the moment when all
previous instructions are retired. Three instructions can be retired per
cycle. The Instruction Control Unit (ICU) accesses the reorder buffer
contents for the three instructions. The instructions are retired If
there are no exception flags set. The result data is written to the
Retired Entries of the IFFRF. The later is basically a write-only
process. These values are only used in case of an exception. In this
case they are
used to overwrite the speculative values of the Future File. The
ICU will handle a branch miss prediction by forwarding the instruction's
address to the Instruction Fetch Unit at the beginning of the pipeline. The branch will then be
re-executed, now with the right prediction. Other exceptions require an
exception routine call. The ICU can for instance save some relevant data
in the temporary registers of the IFFRF and invoke the exception call or
a micro code function.
|
|
1.16
Exception Processing
is always delayed until Retirement
|
|
Exceptions
processing must always be delayed until the instruction that caused the
exception is not speculative anymore: A
memory access exception for instance may be caused by accessing an Array
with an Index that is out of bounds. The program may have a test for
out-of-bound indices and code to handle it. The branch prediction
hardware however will most likely predict that the out-of-bound test is
Not True because the Index is OK most of the time. The processor will
thus access the array with an out-of-bounds Index anyway and not
unlikely cause a memory access exception. Exception handling is
delayed until retirement where the instruction plus its exception flags
is discarded because of the branch-miss-prediction. It's
for the same reason that speculative Stores to memory are delayed and
hold in the LSU (Load Store Unit) until retirement.
|
|
1.17
Retirement of Vector Path and Double Dispatch Instructions
|
|
A
single Vector Path instruction may produce many instructions. The
Micro sequencer inserts these instructions in the 3 way pipeline. Three
per cycle. They
do not mix with Direct Path instructions during decoding and retirement.
The actual Retirement takes place when the last line of 3 instructions
is ready. The retirement hardware scans from the first line of 3 micro
code generated instructions to the last line, accumulating all possible
exceptions that occurred. Retirement follows If no exception has occurred,
otherwise the appropriate exception call is made.
Instructions
generated by Doubles can mix with other (Direct Path) instructions
during decoding and retirement. The two instructions generated by a
Double must however retire simultaneously, imagine a PUSH that does
retire the memory store but doesn't retire the Stack Pointer update..
This leads to the limitation that both instructions generated by a
Double must be in the same 3 instruction line during retirement.
|
|
1.18
Out-Of-Order
processing: Instruction
Dispatch
|
|
We
are now ready to describe in greater detail how Out Of Order processing
is handled. We go back to the Instruction Dispatch Stage. Instruction
Dispatch means here that Instructions are send to the Schedulers. They
are not send to the execution units yet. The Instructions do access the
Register File though. Three Instructions can look up a total of nine
register values from the IFFRF each cycle. The Future File entries are
accessed. The Future File contains the latest speculative results for
each of the 16 architectural registers. The three Instruction are then
placed together with these register values into the three Schedulers.
Now
it's highly likely that not all previous instructions where finished so
many of the register values are older values from previous instructions.
Each Instruction that is dispatched clears the valid flag for the
architectural register it will modify. It also leaves its Tag.
Succeeding instructions now know the Future File entry is not valid
anymore but they also know the Tag of the instruction that will provide
the data they need. They will use this Tag later to pick up the result
directly from the result busses. The instruction that invalidated the register will
later finish, and write its result to the Future File ( if it still owns
the entry ) and then set the
valid bit back again. The
instruction has ownership
over an entry in the Future File if the Tags match. It acquires
ownership when the instruction is dispatched and it looses ownership if
another instruction is dispatched that has the same destination
register. An instruction writes its result in the Future File only if it
is still owner. Subsequent instructions can pick up the result from
there. If the instruction is not owner anymore than it won't modify the
Future File entry. Any other instruction that needed this result was
already dispatched and picked up the result directly from the
result-busses.
|
|
1.19 The Schedulers / Reservation
Stations
|
|
Each
of the (up to) three Instructions that are Dispatched gets assigned to a
Reservation Station within the Scheduler they are send to. Each
scheduler has eight Reservation Stations. That's up from six in the
Athlon 32 and up from five in the first Athlon prototypes. The Reservation
Station gathers all remaining input data needed for the instruction from
the result busses. It monitors the Tags of these busses to see if the
instructions from which data is needed are about to produce their
results. (Register File Bypass) The
Tag busses run one cycle in advance of the result data-busses. The
Reservation Station does
not need to look at all the busses. The Tag's sub-index identifies which
of the three ALU's will produce the result. It also knows if the data
will come from one of the two cache read ports. It can select the Tag
bus in advance rather then having to test all the Tags.
|
|
The
Scheduler's Reservation Station Entries
|
|
|
Instruction
Data |
|
"CONST"
64 bit Displacement + Segment
or Instruction Pointer |
|
|
Input
Data |
|
VALUE: 64 bit register 'A'
|
TAG: reg. 'A'
|
|
VALUE: 64 bit register
'B' or 64 bit Index register
|
TAG: reg. 'B' / Index
|
|
VALUE: 64 bit Base register
|
TAG: Base reg.
|
|
|
Input
Status |
|
VALUE: 4 bit
ZAPS flags: Zero, Aux, Parity,
Sign
|
TAG: ZAPS flags
|
|
VALUE: 1 bit
OF/C flag: either OverFlow or Carry
|
TAG: OF/C flag
|
|
|
|
Until
now we've neglected the x86 status flags. Many x86 instructions use one
or more of the six x86 status flags as an input. An x86- instruction
does or does not change the status flags. An instruction may change
some, all or none of the status flags. This all means that different
flags may be produced by different instructions. Luckily there are two
rules that help to simplify the scheduler.
rule
1: An instructions that modifies any of the ZAPS flags ( Zero,
Aux, Parity, Sign ) modifies all of them. This means that these
can be handled by a single 4 bit entry in the reservation station. rule
2. An instruction that uses the OverFlow flag (signed
integer) does not use the Carry flag (unsigned integer). A single 1-bit
reservation station can be used for the one which is needed.
|
|
1.20
Each x86 instruction can launch both an ALU and an AGU operation
|
|
A
single x86 instruction waiting in a Reservation Station of one of the
Schedulers can launch up to two operations. It can launch an
integer operation to it's associated ALU and it can launch a memory
operation to its AGU (Address Generator Unit) The
simplest integer instructions of type F( reg,reg )
do not access memory and launch an ALU operation only. Integer instructions
of type F(
reg,mem ) launch a
memory load first and consequentially launch an ALU operation when the
load data arrives. Integer
instructions of type F(
mem,reg ) are
implement in the same way. The Load is now a Load/Store operation. The
Load/Store keeps hanging in the LSU (Load Store Unit) Here it
waits for the result of the ALU operation to be stored in memory. Non
Integer instructions such as Floating Point and Multi Media instruction specifying
a memory access will launch an AGU instruction only. The
Floating Point / MMX operation itself is then handled by the Floating
Point Unit itself. Each
Scheduler can launch one ALU and one AGU operation per cycle. The ALU
operation may come from one x86 instruction while the AGU operation may
come from another.
|
|
1.21 The Scheduling of an ALU operation
US
Patent 6,535,972.
|
|
An
ALU operations generally needs two register operands and optionally some
status bits. An x86 instruction that accesses memory will leave the Load
value in register 'B' The Reservation Station waits until it
has all needed input operands (data and status). The Scheduler observes
all eight reservation stations and will Launch the ALU operation if its
the oldest instructions that is ready to Launch. The Scheduler sends all
operands plus instruction information to the ALU that is associated with
it.
Reservation
Station entries typically involved in an ALU operation:
|
Instruction
Data |
|
"CONST"
64 bit Displacement + Segment
or Instruction Pointer |
|
|
Input
Data |
|
VALUE: 64 bit register 'A'
|
TAG: reg. 'A'
|
|
VALUE: 64 bit register
'B' or 64 bit Index register
|
TAG: reg. 'B' / Index
|
|
VALUE: 64 bit Base register
|
TAG: Base reg.
|
|
|
Input
Status |
|
VALUE: 4 bit
ZAPS flags: Zero, Aux, Parity,
Sign
|
TAG: ZAPS flags
|
|
VALUE: 1 bit
OF/C flag: either OverFlow or Carry
|
TAG: OF/C flag
|
|
|
The
Reservation station does not actually need to catch the last operand(s)
itself. The Reservation Station can be bypassed. The ALU
may receive the bus
number which will carry
the last result so it can catch the operand itself. If you take a look
at the Die photo then you see that all three ALU's are next to each
other, even though each receives only operations from its own scheduler.
The bypass mechanism lets them exchange data directly without the need
of going back and forward to the schedulers.
|
|
1.22
The
Scheduling of an AGU operation for memory access
US
Patent 6,457,115.
|
|
We
saw how a single x86 instruction may need up to four arguments to calculate the
memory address (
ignoring the 2 bit scale
field hard-coded in the instruction ). This includes up to two register
variables (base and
index) We also
saw how
displacement
and
segment
could be added together already during
instruction decoding. Segment is considered a semi-constant a
restore mecha- nism is provided for the rare case that it is changed.
|
Reservation
Station entries typically involved in an AGU operation:
(
address =
base +
index
<< scale
+ displacement +
segment )
| |