This file is raw output from pdftotext and may not be ideal for distribution. If you are a maintainer for Hackipedia, please sit down when you have time and clean this text version up. Source PDF: /mnt/fw-js/docs/Hardware/ethernet/pdf, RealTek/RTL8101E-GR Integrated Fast Ethernet Controller for PCI Express - Datasheet.pdf Like all conversions the text below should be fully readable as UTF-8 unicode text. --------------------------------------------------------------- PowerPC Virtual Environment Architecture Book II Version 2.01 December 2003 Manager: Joe Wetzel/Poughkeepsie/IBM Technical Content: Ed Silha/Austin/IBM Cathy May/Watson/IBM Brad Frey/Austin/IBM The following paragraph does not apply to the United Kingdom or any country or state where such provisions are inconsistent with local law. The specifications in this manual are subject to change without notice. This manual is provided “AS IS”. Interna- tional Business Machines Corp. makes no warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. International Business Machines Corp. does not warrant that the contents of this publication or the accompanying source code examples, whether individually or as one or more groups, will meet your requirements or that the publication or the accompanying source code examples are error-free. This publication could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. Address comments to IBM Corporation, Internal Zip 9630, 11400 Burnett Road, Austin, Texas 78758-3493. IBM may use or distribute whatever information you supply in any way it believes appropriate without incurring any obligation to you. The following terms are trademarks of the International Business Machines Corporation in the United States and/or other countries: IBM PowerPC RISC/System 6000 POWER POWER2 POWER4 POWER4+ IBM System/370 Notice to U.S. Government Users—Documentation Related to Restricted Rights—Use, duplication or disclosure is subject to restrictions set fourth in GSA ADP Schedule Contract with IBM Corporation.  Copyright International Business Machines Corporation, 1994, 2003. All rights reserved. ii PowerPC Virtual Environment Architecture Version 2.01 Preface This document defines the additional instructions and grammer. Book III, PowerPC Operating Environment facilities, beyond those of the PowerPC User Instruc- Architecture defines the system (privileged) tion Set Architecture, that are provided by the instructions and related facilities. Book IV, PowerPC PowerPC Virtual Environment Architecture. It covers Implementation Features defines the implementa- the storage model and related instructions and facili- tion-dependent aspects of a particular implementa- ties available to the application programmer, and the tion. Time Base as seen by the application programmer. As used in this document, the term “PowerPC Archi- Other related documents define the PowerPC User tecture” refers to the instructions and facilities Instruction Set Architecture, the PowerPC Operating described in Books I, II, and III. The description of the Environment Architecture, and PowerPC Implementa- instantiation of the PowerPC Architecture in a given tion Features. Book I, PowerPC User Instruction Set implementation includes also the material in Book IV Architecture defines the base instruction set and for that implementation. related facilities available to the application pro- Preface iii Version 2.01 iv PowerPC Virtual Environment Architecture Version 2.01 Table of Contents Chapter 1. Storage Model . . . . . . . 1 4.3 Computing Time of Day from the 1.1 Definitions and Notation . . . . . . . 1 Time Base . . . . . . . . . . . . . . . . 30 1.2 Introduction . . . . . . . . . . . . . . 2 1.3 Virtual Storage . . . . . . . . . . . . 2 Chapter 5. Optional Facilities and 1.4 Single-Copy Atomicity . . . . . . . . 3 Instructions . . . . . . . . . . . . . . . . . . 33 1.5 Cache Model . . . . . . . . . . . . . 3 5.1 External Control . . . . . . . . . . . 33 1.6 Storage Control Attributes . . . . . 4 5.1.1 External Access Instructions . . . 34 1.6.1 Write Through Required . . . . . 4 5.2 Storage Control Instructions . . . . 35 1.6.2 Caching Inhibited . . . . . . . . . . 4 5.2.1 Cache Management Instructions 35 1.6.3 Memory Coherence Required . . 5 5.3 Little-Endian . . . . . . . . . . . . . . 36 1.6.4 Guarded . . . . . . . . . . . . . . . 5 1.7 Shared Storage . . . . . . . . . . . . 6 Appendix A. Assembler Extended 1.7.1 Storage Access Ordering . . . . . 6 Mnemonics . . . . . . . . . . . . . . . . . . 37 1.7.2 Storage Ordering of I/O Accesses 8 A.1 Synchronize Mnemonics . . . . . . 37 1.7.3 Atomic Update . . . . . . . . . . . 8 1.8 Instruction Storage . . . . . . . . . . 10 Appendix B. Programming Examples 1.8.1 Concurrent Modification and Execution of Instructions . . . . . . . . 12 for Sharing Storage . . . . . . . . . . . . 39 B.1 Atomic Update Primitives . . . . . 39 B.2 Lock Acquisition and Release, and Chapter 2. Effect of Operand Related Techniques . . . . . . . . . . . 41 Placement on Performance . . . . . .13 B.2.1 Lock Acquisition and Import 2.1 Instruction Restart . . . . . . . . . . 14 Barriers . . . . . . . . . . . . . . . . . . 41 B.2.2 Lock Release and Export Barriers 42 Chapter 3. Storage Control B.2.3 Safe Fetch . . . . . . . . . . . . . . 42 Instructions . . . . . . . . . . . . . . . . . . 15 B.3 List Insertion . . . . . . . . . . . . . 43 3.1 Parameters Useful to Application B.4 Notes . . . . . . . . . . . . . . . . . . 43 Programs . . . . . . . . . . . . . . . . . 15 3.2 Cache Management Instructions . 16 Appendix C. Cross-Reference for 3.2.1 Instruction Cache Instruction . . . 17 Changed POWER Mnemonics . . . . 45 3.2.2 Data Cache Instructions . . . . . . 18 3.3 Synchronization Instructions . . . . 21 Appendix D. New Instructions . . . . . 47 3.3.1 Instruction Synchronize Instruction 21 3.3.2 Load And Reserve and Store Appendix E. PowerPC Virtual Conditional Instructions . . . . . . . . . 22 3.3.3 Memory Barrier Instructions . . . 25 Environment Instruction Set . . . . . . 49 Chapter 4. Time Base . . . . . . . . . . 29 Index . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 Time Base Instructions . . . . . . . 30 4.2 Reading the Time Base . . . . . . . 30 Last Page - End of Document . . . . . 53 Table of Contents v Version 2.01 vi PowerPC Virtual Environment Architecture Version 2.01 Figures 1. Performance effects of storage operand placement . . . . . . . . . . . . . . . . . . 13 2. Time Base . . . . . . . . . . . . . . . . . . 29 3. Performance effects of storage operand placement, Little-Endian mode . . . . . . . 36 Figures vii Version 2.01 viii PowerPC Virtual Environment Architecture Version 2.01 Chapter 1. Storage Model 1.1 Definitions and Notation . . . . . . . 1 1.7 Shared Storage . . . . . . . . . . . . 6 1.2 Introduction . . . . . . . . . . . . . . 2 1.7.1 Storage Access Ordering . . . . . 6 1.3 Virtual Storage . . . . . . . . . . . . 2 1.7.2 Storage Ordering of I/O Accesses 8 1.4 Single-Copy Atomicity . . . . . . . . 3 1.7.3 Atomic Update . . . . . . . . . . . 8 1.5 Cache Model . . . . . . . . . . . . . 3 1.7.3.1 Reservations . . . . . . . . . . . 8 1.6 Storage Control Attributes . . . . . 4 1.7.3.2 Forward Progress . . . . . . . . 10 1.6.1 Write Through Required . . . . . 4 1.8 Instruction Storage . . . . . . . . . . 10 1.6.2 Caching Inhibited . . . . . . . . . . 4 1.8.1 Concurrent Modification and 1.6.3 Memory Coherence Required . . 5 Execution of Instructions . . . . . . . . 12 1.6.4 Guarded . . . . . . . . . . . . . . . 5 storage which contains instructions has the same 1.1 Definitions and Notation effect with respect to the sequential execution model as a Store instruction as described there.) The following definitions, in addition to those specified ■ storage location in Book I, are used in this Book. In these definitions, A contiguous sequence of bytes in storage. When “ Load instruction” includes the Cache Management used in association with a specific instruction or and other instructions that are stated in the instruc- the instruction fetching mechanism, the length of tion descriptions to be “treated as a Load”, and simi- the sequence of bytes is typically implied by the larly for “ Store instruction”. operation. In other uses, it may refer more abstractly to a group of bytes which share ■ processor common storage attributes. A hardware component that executes the ■ storage access instructions specified in a program. An access to a storage location. There are three ■ system (mutually exclusive) kinds of storage access. A combination of processors, storage, and associ- — data access ated mechanisms that is capable of executing An access to the storage location specified programs. Sometimes the reference to system by a Load or Store instruction, or, if the includes services provided by the operating access is performed “out-of-order” (see Book system. III), an access to a storage location as if it ■ main storage were the storage location specified by a Load The level of the storage hierarchy in which all or Store instruction. storage state is visible to all processors and — instruction fetch mechanisms in the system. An access for the purpose of fetching an ■ instruction storage instruction. The view of storage as seen by the mechanism — implicit access that fetches instructions. An access by the processor for the purpose ■ data storage of address translation or reference and The view of storage as seen by a Load or Store change recording (see Book III). instruction. ■ caused by, associated with ■ program order — caused by The execution of instructions in the order A storage access is said to be caused by an required by the sequential execution model. (See instruction if the instruction is a Load or the section entitled “Instruction Execution Order” Store and the access (data access) is to the in Book I. A dcbz instruction that modifies storage location specified by the instruction. Chapter 1. Storage Model 1 Version 2.01 — associated with A storage access is said to be associated 1.2 Introduction with an instruction if the access is for the purpose of fetching the instruction (instruc- The PowerPC User Instruction Set Architecture, dis- tion fetch), or is a data access caused by the cussed in Book I, defines storage as a linear array of instruction, or is an implicit access that bytes indexed from 0 to a maximum of 264 − 1. Each occurs as a side effect of fetching or exe- byte is identified by its index, called its address, and cuting the instruction. each byte contains a value. This information is suffi- ■ prefetched instructions cient to allow the programming of applications that Instructions for which a copy of the instruction require no special features of any particular system has been fetched from instruction storage, but the environment. The PowerPC Virtual Environment instruction has not yet been executed. Architecture, described herein, expands this simple ■ uniprocessor storage model to include caches, virtual storage, and A system that contains one processor. shared storage multiprocessors. The PowerPC Virtual ■ multiprocessor Environment Architecture, in conjunction with services A system that contains two or more processors. based on the PowerPC Operating Environment Archi- ■ shared storage multiprocessor tecture (see Book III) and provided by the operating A multiprocessor that contains some common system, permits explicit control of this expanded storage, which all the processors in the system storage model. A simple model for sequential exe- can access. cution allows at most one storage access to be per- ■ performed formed at a time and requires that all storage A load or instruction fetch by a processor or accesses appear to be performed in program order. mechanism (P1) is performed with respect to any In contrast to this simple model, the PowerPC Archi- processor or mechanism (P2) when the value to tecture specifies a relaxed model of storage consist- be returned by the load or instruction fetch can ency. In a multiprocessor system that allows multiple no longer be changed by a store by P2. A store copies of a storage location, aggressive implementa- by P1 is performed with respect to P2 when a tions of the architecture can permit intervals of time load by P2 from the location accessed by the during which different copies of a storage location store will return the value stored (or a value have different values. This chapter describes fea- stored subsequently). An instruction cache block tures of the PowerPC Architecture that enable pro- invalidation by P1 is performed with respect to P2 grammers to write correct programs for this storage when an instruction fetch by P2 will not be satis- model. fied from the copy of the block that existed in its instruction cache when the instruction causing the invalidation was executed, and similarly for a data cache block invalidation. The preceding 1.3 Virtual Storage definitions apply regardless of whether P1 and P2 are the same entity. ■ page The PowerPC system implements a virtual storage An aligned unit of storage for which protection model for applications. This means that a combina- and control attributes are independently tion of hardware and software can present a storage specifiable and for which reference and change model that allows applications to exist within a status are independently recorded. Two virtual “virtual” address space larger than either the effec- page sizes are supported simultaneously, 4 KB tive address space or the real address space. and a larger size. The larger size is an imple- mentation-dependent power of 2 (bytes). Real Each program can access 264 bytes of “effective pages are always 4 KB. address” (EA) space, subject to limitations imposed ■ block by the operating system. In a typical PowerPC The aligned unit of storage operated on by each system, each program's EA space is a subset of a Cache Management instruction. The size of a larger “virtual address” (VA) space managed by the block can vary by instruction and by implementa- operating system. tion. The maximum block size is 4 KB. ■ aligned storage access Each effective address is translated to a real address A load or store is aligned if the address of the (i.e., to an address of a byte in real storage or on an target storage location is a multiple of the size of I/O device) before being used to access storage. The the transfer effected by the instruction. hardware accomplishes this, using the address trans- lation mechanism described in Book III. The oper- ating system manages the real (physical) storage resources of the system, by setting up the tables and other information used by the hardware address translation mechanism. 2 PowerPC Virtual Environment Architecture Version 2.01 Book II deals primarily with effective addresses that The results for several combinations of loads and are in “segments” translated by the “address trans- stores to the same or overlapping locations are lation mechanism” (see Book III). Each such effective described below. address lies in a “virtual page”, which is mapped to a 1. When two processors execute atomic stores to “real page” (4 KB virtual page) or to a contiguous locations that do not overlap, and no other stores sequence of real pages (large virtual page) before are performed to those locations, the contents of data or instructions in the virtual page are accessed. those locations are the same as if the two stores were performed by a single processor. In general, real storage may not be large enough to 2. When two processors execute atomic stores to map all the virtual pages used by the currently active the same storage location, and no other store is applications. With support provided by hardware, the performed to that location, the contents of that operating system can attempt to use the available location are the result stored by one of the real pages to map a sufficient set of virtual pages of processors. the applications. If a sufficient set is maintained, 3. When two processors execute stores that have “paging” activity is minimized. If not, performance the same target location and are not guaranteed degradation is likely. to be atomic, and no other store is performed to that location, the result is some combination of The operating system can support restricted access to the bytes stored by both processors. virtual pages (including read/write, read only, and no 4. When two processors execute stores to overlap- access; see Book III), based on system standards ping locations, and no other store is performed to (e.g., program code might be read only) and applica- those locations, the result is some combination of tion requests. the bytes stored by the processors to the over- lapping bytes. The portions of the locations that do not overlap contain the bytes stored by the 1.4 Single-Copy Atomicity processor storing to the location. 5. When a processor executes an atomic store to a location, a second processor executes an atomic An access is single-copy atomic, or simply atomic, if it load from that location, and no other store is per- is always performed in its entirety with no visible formed to that location, the value returned by the fragmentation. Atomic accesses are thus serialized: load is the contents of the location before the each happens in its entirety in some order, even store or the contents of the location after the when that order is not specified in the program or store. enforced between processors. 6. When a load and a store with the same target location can be executed simultaneously, and no In PowerPC the following single-register accesses are other store is performed to that location, the always atomic: value returned by the load is some combination of the contents of the location before the store ■ byte accesses (all bytes are aligned on byte and the contents of the location after the store. boundaries) ■ halfword accesses aligned on halfword bounda- ries ■ word accesses aligned on word boundaries 1.5 Cache Model ■ doubleword accesses aligned on doubleword A cache model in which there is one cache for boundaries instructions and another cache for data is called a No other accesses are guaranteed to be atomic. For “Harvard-style” cache. This is the model assumed by example, the access caused by the following the PowerPC Architecture, e.g., in the descriptions of instructions is not guaranteed to be atomic. the Cache Management instructions in Section 3.2, “Cache Management Instructions” on page 16. Alter- ■ any Load or Store instruction for which the native cache models may be implemented (e.g., a operand is unaligned “combined cache” model, in which a single cache is ■ lmw, stmw, lswi, lswx, stswi, stswx used for both instructions and data, or a model in ■ any Cache Management instruction which there are several levels of caches), but they support the programming model implied by a An access that is not atomic is performed as a set of Harvard-style cache. smaller disjoint atomic accesses. The number and alignment of these accesses are implementa- The processor is not required to maintain copies of tion-dependent, as is the relative order in which they storage locations in the instruction cache consistent are performed. with modifications to those storage locations (e.g., modifications caused by Store instructions). Chapter 1. Storage Model 3 Version 2.01 A location in the data cache is considered to be modi- are supported except Write Through Required with fied in that cache if the location has been modified Caching Inhibited. (e.g., by a Store instruction) and the modified data have not been been written to main storage. Programming Note The Write Through Required and Caching Inhibited Cache Management instructions are provided so that attributes are mutually exclusive because, as programs can manage the caches when needed. For described below, the Write Through Required example, program management of the caches is attribute permits the storage location to be in the needed when a program generates or modifies code data cache while the Caching Inhibited attribute that will be executed (i.e., when the program modifies does not. data in storage and then attempts to execute the modified data as instructions). The Cache Manage- Storage that is Write Through Required or ment instructions are also useful in optimizing the use Caching Inhibited is not intended to be used for of memory bandwidth in such applications as graphics general-purpose programming. For example, the and numerically intensive computing. The functions lwarx, ldarx, stwcx., and stdcx. instructions may performed by these instructions depend on the cause the system data storage error handler to be storage control attributes associated with the speci- invoked if they specify a location in storage fied storage location (see Section 1.6, “Storage having either of these attributes. Control Attributes”). The Cache Management instructions allow the In the remainder of this section, “ Load instruction” program to do the following. includes the Cache Management and other ■ invalidate the copy of storage in an instruction instructions that are stated in the instruction cache block (icbi) descriptions to be “treated as a Load”, and similarly ■ provide a hint that the program will probably for “ Store instruction”. soon access a specified data cache block (dcbt, dcbtst) ■ set the contents of a data cache block to zeros 1.6.1 Write Through Required (dcbz) ■ copy the contents of a modified data cache block A store to a Write Through Required storage location to main storage (dcbst) is performed in main storage. A Store instruction that ■ copy the contents of a modified data cache block specifies a location in Write Through Required storage to main storage and make the copy of the block may cause additional locations in main storage to be in the data cache invalid (dcbf) accessed. If a copy of the block containing the speci- fied location is retained in the data cache, the store is also performed in the data cache. The store does not cause the block to be considered to be modified in the 1.6 Storage Control Attributes data cache. Some operating systems may provide a means to In general, accesses caused by separate Store allow programs to specify the storage control attri- instructions that specify locations in Write Through butes described in this section. Because the support Required storage may be combined into one access. provided for these attributes by the operating system Such combining does not occur if the Store may vary between systems, the details of the specific instructions are separated by a sync instruction or by system being used must be known before these attri- an eieio instruction. butes can be used. Storage control attributes are associated with units of 1.6.2 Caching Inhibited storage that are multiples of the page size. Each storage access is performed according to the storage An access to a Caching Inhibited storage location is control attributes of the specified storage location, as performed in main storage. A Load instruction that described below. The storage control attributes are specifies a location in Caching Inhibited storage may the following. cause additional locations in main storage to be accessed unless the specified location is also ■ Write Through Required Guarded. An instruction fetch from Caching Inhibited ■ Caching Inhibited storage may cause additional words in main storage ■ Memory Coherence Required to be accessed. No copy of the accessed locations is ■ Guarded placed into the caches. These attributes have meaning only when an effective address is translated by the processor performing the In general, non-overlapping accesses caused by sepa- storage access. All combinations of these attributes rate Load instructions that specify locations in Caching Inhibited storage may be combined into one 4 PowerPC Virtual Environment Architecture Version 2.01 access, as may non-overlapping accesses caused by Programming Note separate Store instructions that specify locations in Caching Inhibited storage. Such combining does not Operating systems that allow programs to request occur if the Load or Store instructions are separated that storage not be Memory Coherence Required by a sync instruction, or by an eieio instruction if the should provide services to assist in managing storage is also Guarded. memory coherence for such storage, including all system-dependent aspects thereof. In most systems the default is that all storage is 1.6.3 Memory Coherence Required Memory Coherence Required. For some applica- tions in some systems, software management of An access to a Memory Coherence Required storage coherence may yield better performance. In such location is performed coherently, as follows. cases, a program can request that a given unit of storage not be Memory Coherence Required, and Memory coherence refers to the ordering of stores to can manage the coherence of that storage by a single location. Atomic stores to a given location using the sync instruction, the Cache Management are coherent if they are serialized in some order, and instructions, and services provided by the oper- no processor or mechanism is able to observe any ating system. subset of those stores as occurring in a conflicting order. This serialization order is an abstract sequence of values; the physical storage location need not assume each of the values written to it. For example, a processor may update a location several 1.6.4 Guarded times before the value is written to physical storage. The result of a store operation is not available to A data access to a Guarded storage location is per- every processor or mechanism at the same instant, formed only if either (a) the access is caused by an and it may be that a processor or mechanism instruction that is known to be required by the observes only some of the values that are written to sequential execution model, or (b) the access is a a location. However, when a location is accessed load and the storage location is already in a cache. If atomically and coherently by all processor and mech- the storage is also Caching Inhibited, only the storage anisms, the sequence of values loaded from the location specified by the instruction is accessed; oth- location by any processor or mechanism during any erwise any storage location in the cache block con- interval of time forms a subsequence of the sequence taining the specified storage location may be of values that the location logically held during that accessed. interval. That is, a processor or mechanism can never load a “newer” value first and then, later, load Instructions are not fetched from virtual storage that an “older” value. is Guarded. If the effective address of the current instruction is in such storage, the system instruction Memory coherence is managed in blocks called storage error handler is invoked. coherence blocks. Their size is implementa- tion-dependent (see the Book IV, PowerPC Implemen- Programming Note tation Features document for the implementation), but In some implementations, instructions may be is usually larger than a word and often the size of a executed before they are known to be required by cache block. the sequential execution model. Because the results of instructions executed in this manner are For storage that is not Memory Coherence Required, discarded if it is later determined that those software must explicitly manage memory coherence instructions would not have been executed in the to the extent required by program correctness. The sequential execution model, this behavior does operations required to do this may be system- not affect most programs. dependent. This behavior does affect programs that access Because the Memory Coherence Required attribute storage locations that are not “well-behaved” for a given storage location is of little use unless all (e.g., a storage location that represents a control processors that access the location do so coherently, register on an I/O device that, when accessed, in statements about Memory Coherence Required causes the device to perform an operation). To storage elsewhere in Books I − III it is generally avoid unintended results, programs that access assumed that the storage has the Memory Coherence such storage locations should request that the Required attribute for all processors that access it. storage be Guarded, and should prevent such storage locations from being in a cache (e.g., by requesting that the storage also be Caching Inhib- ited). Chapter 1. Storage Model 5 Version 2.01 ■ If a Load instruction depends on the value 1.7 Shared Storage returned by a preceding Load instruction (because the value is used to compute the effec- This architecture supports the sharing of storage tive address specified by the second Load), the between programs, between different instances of the corresponding storage accesses are performed in same program, and between processors and other program order with respect to any processor or mechanisms. It also supports access to a storage mechanism to the extent required by the associ- location by one or more programs using different ated Memory Coherence Required attributes. effective addresses. All these cases are considered This applies even if the dependency has no effect storage sharing. Storage is shared in blocks that are on program logic (e.g., the value returned by the an integral number of pages. first Load is ANDed with zero and then added to the effective address specified by the second When the same storage location has different effec- Load). tive addresses, the addresses are said to be aliases. ■ When a processor (P1) executes a Synchronize or Each application can be granted separate access priv- eieio instruction a memory barrier is created, ileges to aliased pages. which orders applicable storage accesses pairwise, as follows. Let A be a set of storage accesses that includes all storage accesses asso- 1.7.1 Storage Access Ordering ciated with instructions preceding the barrier- creating instruction, and let B be a set of storage The storage model for the ordering of storage accesses that includes all storage accesses asso- accesses is weakly consistent. This model provides ciated with instructions following the barrier- an opportunity for improved performance over a creating instruction. For each applicable pair a i ,bj model that has stronger consistency rules, but places of storage accesses such that a i is in A and b j is the responsibility on the program to ensure that in B, the memory barrier ensures that a i will be ordering or synchronization instructions are properly performed with respect to any processor or placed when storage is shared by two or more pro- mechanism, to the extent required by the associ- grams. ated Memory Coherence Required attributes, before b j is performed with respect to that The order in which the processor performs storage processor or mechanism. accesses, the order in which those accesses are per- The ordering done by a memory barrier is said to formed with respect to another processor or mech- be “cumulative” if it also orders storage accesses anism, and the order in which those accesses are that are performed by processors and mech- performed in main storage may all be different. anisms other than P1, as follows. Several means of enforcing an ordering of storage accesses are provided to allow programs to share — A includes all applicable storage accesses by storage with other programs, or with mechanisms any such processor or mechanism that have such as I/O devices. These means are listed below. been performed with respect to P1 before the The phrase “to the extent required by the associated memory barrier is created. Memory Coherence Required attributes” refers to the — B includes all applicable storage accesses by Memory Coherence Required attribute, if any, associ- any such processor or mechanism that are ated with each access. performed after a Load instruction executed ■ If two Store instructions specify storage locations by that processor or mechanism has returned that are both Caching Inhibited and Guarded, the the value stored by a store that is in B. corresponding storage accesses are performed in program order with respect to any processor or No ordering should be assumed among the storage mechanism. accesses caused by a single instruction (i.e, by an instruction for which the access is not atomic), and no means are provided for controlling that order. 6 PowerPC Virtual Environment Architecture Version 2.01 Programming Note Because stores cannot be performed “out-of-order” ■ Because processors may predict branch target (see Book III, PowerPC Operating Environment addresses and branch condition resolution, Architecture), if a Store instruction depends on the control dependencies (e.g., branches) do not value returned by a preceding Load instruction order storage accesses except as described (because the value returned by the Load is used to above. For example, when a subroutine returns compute either the effective address specified by to its caller the return address may be pre- the Store or the value to be stored), the corre- dicted, with the result that loads caused by sponding storage accesses are performed in instructions at or after the return address may program order. The same applies if whether the be performed before the load that obtains the Store instruction is executed depends on a condi- return address is performed. tional Branch instruction that in turn depends on the value returned by a preceding Load instruction. Because processors may implement nonarchitected duplicates of architected resources (e.g., GPRs, CR Because an isync instruction prevents the execution fields, and the Link Register), resource dependen- of instructions following the isync until instructions cies (e.g., specification of the same target register preceding the isync have completed, if an isync for two Load instructions) do not order storage follows a conditional Branch instruction that depends accesses. on the value returned by a preceding Load instruc- tion, the load on which the Branch depends is per- Examples of correct uses of dependencies, sync, formed before any loads caused by instructions lwsync, and eieio to order storage accesses can be following the isync. This applies even if the effects found in Appendix B, “Programming Examples for of the “dependency” are independent of the value Sharing Storage” on page 39. loaded (e.g., the value is compared to itself and the Branch tests the EQ bit in the selected CR field), and Because the storage model is weakly consistent, the even if the branch target is the sequentially next sequential execution model as applied to instruction. instructions that cause storage accesses guarantees only that those accesses appear to be performed in With the exception of the cases described above and program order with respect to the processor exe- earlier in this section, data dependencies and cuting the instructions. For example, an instruction control dependencies do not order storage accesses. may complete, and subsequent instructions may be Examples include the following. executed, before storage accesses caused by the first instruction have been performed. However, for ■ If a Load instruction specifies the same storage a sequence of atomic accesses to the same storage location as a preceding Store instruction and the location, if the location is in storage that is Memory location is in storage that is not Caching Inhib- Coherence Required the definition of coherence ited, the load may be satisfied from a “store guarantees that the accesses are performed in queue” (a buffer into which the processor places program order with respect to any processor or stored values before presenting them to the mechanism that accesses the location coherently, storage subsystem), and not be visible to other and similarly if the location is in storage that is processors and mechanisms. A consequence is Caching Inhibited. that if a subsequent Store depends on the value returned by the Load, the two stores need not Because accesses to storage that is Caching Inhib- be performed in program order with respect to ited are performed in main storage, memory bar- other processors and mechanisms. riers and dependencies on Load instructions order ■ Because a Store Conditional instruction may such accesses with respect to any processor or complete before its store has been performed, a mechanism even if the storage is not Memory conditional Branch instruction that depends on Coherence Required. the CR0 value set by a Store Conditional instruction does not order the Store Conditional's store with respect to storage accesses caused by instructions that follow the Branch. Chapter 1. Storage Model 7 Version 2.01 Programming Note 1. A reservation for a subsequent stwcx. instruction is created. The first example below illustrates cumulative ordering of storage accesses preceding a memory 2. The storage coherence mechanism is notified that barrier, and the second illustrates cumulative a reservation exists for the storage location spec- ordering of storage accesses following a memory ified by the lwarx. barrier. Assume that locations X, Y, and Z initially contain the value 0. The stwcx. instruction is a store to a word-aligned location that is conditioned on the existence of the Example 1: reservation created by the lwarx and on whether the same storage location is specified by both Processor A: stores the value 1 to location X instructions. To emulate an atomic operation with Processor B: loads from location X obtaining the these instructions, it is necessary that both the lwarx value 1, executes a sync instruc- and the stwcx. specify the same storage location. tion, then stores the value 2 to location Y A stwcx. performs a store to the target storage location only if the storage location specified by the Processor C: loads from location Y obtaining the lwarx that established the reservation has not been value 2, executes a sync instruc- stored into by another processor or mechanism since tion, then loads from location X the reservation was created. If the storage locations specified by the two instructions differ, the store is Example 2: not necessarily performed. Processor A: stores the value 1 to location X, executes a sync instruction, then A stwcx. that performs its store is said to “succeed”. stores the value 2 to location Y Examples of the use of lwarx and stwcx. are given in Processor B: loops loading from location Y until Appendix B, “Programming Examples for Sharing the value 2 is obtained, then stores Storage” on page 39. the value 3 to location Z Processor C: loads from location Z obtaining the A successful stwcx. to a given location may complete value 3, executes a sync instruc- before its store has been performed with respect to tion, then loads from location X other processors and mechanisms. As a result, a subsequent load or lwarx from the given location by In both cases, cumulative ordering dictates that another processor may return a “stale” value. the value loaded from location X by processor C However, a subsequent lwarx from the given location is 1. by the other processor followed by a successful stwcx. by that processor is guaranteed to have returned the value stored by the first processor's stwcx. (in the absence of other stores to the given 1.7.2 Storage Ordering of I/O Accesses location). A “coherence domain” consists of all processors and Programming Note all interfaces to main storage. Memory reads and The store caused by a successful stwcx. is writes initiated by mechanisms outside the coherence ordered, by a dependence on the reservation, domain are performed within the coherence domain in with respect to the load caused by the lwarx that the order in which they enter the coherence domain established the reservation, such that the two and are performed as coherent accesses. storage accesses are performed in program order with respect to any processor or mechanism. 1.7.3 Atomic Update 1.7.3.1 Reservations The Load And Reserve and Store Conditional instructions together permit atomic update of a The ability to emulate an atomic operation using storage location. There are word and doubleword lwarx and stwcx. is based on the conditional behavior forms of each of these instructions. Described here is of stwcx., the reservation created by lwarx, and the the operation of the word forms lwarx and stwcx.; clearing of that reservation if the target location is operation of the doubleword forms ldarx and stdcx. is modified by another processor or mechanism before the same except for obvious substitutions. the stwcx. performs its store. The lwarx instruction is a load from a word-aligned A reservation is held on an aligned unit of real location that has two side effects. Both of these side storage called a reservation granule. The size of the effects occur at the same time that the load is per- reservation granule is 2n bytes, where n is implemen- formed. 8 PowerPC Virtual Environment Architecture Version 2.01 tation-dependent but is always at least 4 (thus the Programming Note minimum reservation granule size is a quadword). The reservation granule associated with effective Warning: The architecture is likely to be changed address EA contains the real address to which EA in the future to permit the reservation to be lost if maps. (“real_addr(EA)” in the RTL for the Load And a dcbf instruction is executed on the processor Reserve and Store Conditional instructions stands for holding the reservation. Therefore dcbf “real address to which EA maps”.) instructions should not be placed between a Load And Reserve instruction and the subsequent Store A processor has at most one reservation at any time. Conditional instruction. A reservation is established by executing a lwarx or ldarx instruction, and is lost (or may be lost, in the case of the fourth bullet) if any of the following occur. Programming Note ■ The processor holding the reservation executes In general, programming conventions must ensure another lwarx or ldarx: this clears the first reser- that lwarx and stwcx. specify addresses that vation and establishes a new one. match; a stwcx. should be paired with a specific ■ The processor holding the reservation executes lwarx to the same storage location. Situations in any stwcx. or stdcx., regardless of whether the which a stwcx. may erroneously be issued after specified address matches the address specified some lwarx other than that with which it is by the lwarx or ldarx that established the reser- intended to be paired must be scrupulously vation. avoided. For example, there must not be a ■ Some other processor executes a Store or dcbz context switch in which the processor holds a res- to the same reservation granule, or modifies a ervation in behalf of the old context, and the new Reference or Change bit (see Book III, PowerPC context resumes after a lwarx and before the Operating Environment Architecture) in the same paired stwcx.. The stwcx. in the new context reservation granule. might succeed, which is not what was intended by ■ Some other processor executes a dcbtst, dcbst, the programmer. Such a situation must be pre- or dcbf to the same reservation granule: whether vented by executing a stwcx. or stdcx. that speci- the reservation is lost is undefined. fies a dummy writable aligned location as part of ■ Some other mechanism modifies a storage the context switch; see the section entitled “Inter- location in the same reservation granule. rupt Processing” in Book III. Interrupts (see Book III, PowerPC Operating Environ- ment Architecture) do not clear reservations Programming Note (however, system software invoked by interrupts may clear reservations). Because the reservation is lost if another processor stores anywhere in the reservation Programming Note granule, lock words (or doublewords) should be allocated such that few such stores occur, other One use of lwarx and stwcx. is to emulate a than perhaps to the lock word itself. (Stores by “Compare and Swap” primitive like that provided other processors to the lock word result from con- by the IBM System/370 Compare and Swap tention for the lock, and are an expected conse- instruction; see Section B.1, “Atomic Update quence of using locks to control access to shared Primitives” on page 39. A System/370-style storage; stores to other locations in the reserva- Compare and Swap checks only that the old and tion granule can cause needless reservation loss.) current values of the word being tested are equal, Such allocation can most easily be accomplished with the result that programs that use such a by allocating an entire reservation granule for the Compare and Swap to control a shared resource lock and wasting all but one word. Because res- can err if the word has been modified and the old ervation granule size is implementa- value subsequently restored. The combination of tion-dependent, portable code must do such lwarx and stwcx. improves on such a Compare allocation dynamically. and Swap, because the reservation reliably binds the lwarx and stwcx. together. The reservation is Similar considerations apply to other data that are always lost if the word is modified by another shared directly using lwarx and stwcx. (e.g., processor or mechanism between the lwarx and pointers in certain linked lists; see Section B.3, stwcx., so the stwcx. never succeeds unless the “List Insertion” on page 43). word has not been stored into (by another processor or mechanism) since the lwarx. Chapter 1. Storage Model 9 Version 2.01 1.7.3.2 Forward Progress 1.8 Instruction Storage Forward progress in loops that use lwarx and stwcx. is achieved by a cooperative effort among hardware, The instruction execution properties and requirements system software, and application software. described in this section, including its subsections, apply only to instruction execution that is required by The architecture guarantees that when a processor the sequential execution model. executes a lwarx to obtain a reservation for location X and then a stwcx. to store a value to location X, In this section, including its subsections, it is assumed either that all instructions for which execution is attempted are in storage that is not Caching Inhibited and 1. the stwcx. succeeds and the value is written to (unless instruction address translation is disabled; see location X, or Book III) is not Guarded, and from which instruction 2. the stwcx. fails because some other processor or fetching does not cause the system error handler to mechanism modified location X, or be invoked (e.g., from which instruction fetching is not 3. the stwcx. fails because the processor's reserva- prohibited by the “address translation mechanism” or tion was lost for some other reason. the “storage protection mechanism”; see Book III). In Cases 1 and 2, the system as a whole makes progress in the sense that some processor success- Programming Note fully modifies location X. Case 3 covers reservation The results of attempting to execute instructions loss required for correct operation of the rest of the from storage that does not satisfy this assumption system. This includes cancellation caused by some are described in Sections 1.6.2 and 1.6.4 of this other processor writing elsewhere in the reservation Book and in Book III. granule for X, as well as cancellation caused by the operating system in managing certain limited resources such as real storage. It may also include For each instance of executing an instruction from implementation-dependent causes of reservation loss. location X, the instruction may be fetched multiple times. An implementation may make a forward progress guarantee, defining the conditions under which the The instruction cache is not necessarily kept con- system as a whole makes progress. Such a guar- sistent with the data cache or with main storage. It is antee must specify the possible causes of reservation the responsibility of software to ensure that instruc- loss in Case 3. While the architecture alone cannot tion storage is consistent with data storage when provide such a guarantee, the characteristics listed in such consistency is required for program correctness. Cases 1 and 2 are necessary conditions for any After one or more bytes of a storage location have forward progress guarantee. An implementation and been modified and before an instruction located in operating system can build on them to provide such a that storage location is executed, software must guarantee. execute the appropriate sequence of instructions to make instruction storage consistent with data storage. Programming Note Otherwise the result of attempting to execute the The architecture does not include a “fairness instruction is boundedly undefined except as guarantee”. In competing for a reservation, two described in Section 1.8.1, “Concurrent Modification processors can indefinitely lock out a third. and Execution of Instructions” on page 12. 10 PowerPC Virtual Environment Architecture Version 2.01 Programming Note Following are examples of how to make instruction dcbst X #copy the block in main storage storage consistent with data storage. Because the sync #order copy before invalidation optimal instruction sequence to make instruction icbi X #invalidate copy in instr cache storage consistent with data storage may vary sync #order invalidation before store between systems, many operating systems will # to flag provide a system service to perform this function. stw r0,flag(3) #set flag indicating instruction # storage is now consistent Case 1: The program has a single thread. The following instruction sequence, executed by the waiting threads, will prevent the waiting threads Assume that location X previously contained the from executing the instruction at location X until instruction A0; the program modified one of more location X in instruction storage is consistent with bytes of that location such that, in data storage, the data storage, and then will cause any prefetched location contains the instruction A1; and location X instructions to be discarded. is wholly contained in a single cache block. The fol- lowing instruction sequence will make instruction lwz r0,flag(3) #loop until flag = 1 (when 1 cmpwi r0,1 # is loaded, location X in storage consistent with data storage such that if the bne $-8 # instruction storage is isync was in location X− 4, the instruction A1 in # consistent with location X location X would be executed immediately after the # in data storage) isync. isync #discard any prefetched inst'ns dcbst X #copy the block to main storage In the preceding instruction sequence any context sync #order copy before invalidation synchronizing instruction (e.g., rfid) can be used icbi X #invalidate copy in instr cache instead of isync. (For Case 1 only isync can be isync #discard prefetched instructions used.) Case 2: The program has two or more threads. For both cases, if two or more instructions in sepa- rate data cache blocks have been modified, the Assume thread A has modified the instruction at dcbst instruction in the examples must be replaced location X and other threads are waiting for thread A by a sequence of dcbst instructions such that each to signal that the new instruction is ready to block containing the modified instructions is copied execute. The following instruction sequence will back to main storage. Similarly, for icbi the make instruction storage consistent with data sequence must invalidate each instruction cache storage and then set a flag to indicate to the waiting block containing a location of an instruction that was threads that the new instruction can be executed. modified. The sync instruction that appears above between “ dcbst X” and “ icbi X” would be placed between the sequence of dcbst instructions and the sequence of icbi instructions. Chapter 1. Storage Model 11 Version 2.01 include causing inconsistent information to be pre- 1.8.1 Concurrent Modification and sented to the system error handler. Execution of Instructions Programming Note The phrase “concurrent modification and execution of An example of how failure to satisfy the require- instructions” (CMODX) refers to the case in which a ments given above can cause inconsistent infor- processor fetches and executes an instruction from mation to be presented to the system error instruction storage which is not consistent with data handler is as follows. If the value X0 (an illegal storage or which becomes inconsistent with data instruction) is executed, causing the system illegal storage prior to the completion of its processing. This instruction handler to be invoked, and before the section describes the only case in which executing error handler can load X0 into a register, X0 is this instruction under these conditions produces replaced with X1, an Add Immediate instruction, it defined results. will appear that a legal instruction caused an illegal instruction exception. In the remainder of this section the following termi- nology is used. ■ Location X is an arbitrary word-aligned storage Programming Note location. It is possible to apply a patch or to instrument a ■ X0 is the value of the contents of location X for given program without the need to suspend or which software has made the location X in halt the program. This can be accomplished by instruction storage consistent with data storage. modifying the example shown above where one thread is creating instructions to be executed by ■ X1, X2, ..., Xn are the sequence of the first n one or more other threads. values occupying location X after X0. ■ Xn is the first value of X subsequent to X0 for In place of the Store to a flag to indicate to the which software has again made instruction other threads that the code is ready to be exe- storage consistent with data storage. cuted, the program that is applying the patch would replace a patch class instruction in the ori- ■ The “patch class” of instructions consists of the ginal program with a Branch instruction that I-form Branch instruction (b[ l] [ a]) and the pre- would cause any thread executing the Branch to ferred no-op instruction (ori 0,0,0). branch to the newly created code. The first If the instruction from location X is executed after the instruction in the newly created code must be an copy of location X in instruction storage is made con- isync, which will cause any prefetched instructions sistent for the value X0 and before it is made con- to be discarded, ensuring that the execution is sistent for the value Xn, the results of executing the consistent with the newly created code. The instruction are defined if and only if the following con- instruction storage location containing the isync ditions are satisfied. instruction in the patch area must be consistent with data storage with respect to the processor 1. The stores that place the values X1, ..., Xn into that will execute the patched code before the location X are atomic stores that modify all four Store which stores the new Branch instruction is bytes of location X. performed. 2. Each Xi , 0 ≤ i ≤ n, is a patch class instruction. 3. Location X is in storage that is Memory Coher- Programming Note ence Required. It is believed that all processors that comply with If these conditions are satisfied, the result of each versions of the architecture that precede Version execution of an instruction from location X will be the 2.01 support concurrent modification and exe- execution of some Xi , 0 ≤ i ≤ n. The value of the cution of instructions as described in this section ordinate i associated with each value executed may if the requirements given above are satisfied, and be different and the sequence of ordinates i associ- that most such processors yield boundedly unde- ated with a sequence of values executed is not con- fined results if the requirements given above are strained, (e.g., a valid sequence of executions of the not satisfied. However, in general such support instruction at location X could be the sequence Xi , has not been verified by processor testing. Also, Xi + 2 , then Xi− 1). If these conditions are not satisfied, one such processor is known to yield undefined the results of each such execution of an instruction results in certain cases if the requirements given from location X are boundedly undefined, and may above are not satisfied. 12 PowerPC Virtual Environment Architecture Version 2.01 Chapter 2. Effect of Operand Placement on Performance 2.1 Instruction Restart . . . . . . . . . . 14 The placement (location and alignment) of operands Operand Boundary Crossing in storage affects relative performance of storage accesses, and may affect it significantly. The best Byte Cache Virtual performance is guaranteed if storage operands are Size Align. None Block Page2 Seg. aligned. In order to obtain the best performance Integer across the widest range of implementations, the pro- grammer should assume the performance model 8 Byte 8 optimal − − − described in Figure 1 with respect to the placement of 4 good good good poor storage operands. Performance of accesses varies <4 good good good poor depending on the following: 4 Byte 4 optimal − − − 1. Operand Size <4 good good good poor 2. Operand Alignment 2 Byte 2 optimal − − − 3. Crossing no boundary <2 good good good poor 4. Crossing a cache block boundary 5. Crossing a virtual page boundary 1 Byte 1 optimal − − − 6. Crossing a segment boundary (see Book III, lmw, 4 good good good poor PowerPC Operating Environment Architecture for stmw <4 poor poor poor poor a description of storage segments) string good good good poor Float The Move Assist instructions have no alignment requirements. 8 Byte 8 optimal − − − 4 good good poor poor <4 poor poor poor poor 4 Byte 4 optimal − − − <4 poor poor poor poor 1 If an instruction causes an access that is not atomic and any portion of the operand is in storage that is Write Through Required or Caching Inhibited, performance is likely to be poor. 2 If the storage operand spans two virtual pages that have different storage control attributes, performance is likely to be poor. Figure 1. Performance effects of storage operand placement Chapter 2. Effect of Operand Placement on Performance 13 Version 2.01 Programming Note 2.1 Instruction Restart There are many events that might cause a Load or Store instruction to be restarted. For example, In this section, “ Load instruction” includes the Cache a hardware error may cause execution of the Management and other instructions that are stated in instruction to be aborted after part of the access the instruction descriptions to be “treated as a Load”, has been performed, and the recovery operation and similarly for “ Store instruction”. could then cause the aborted instruction to be re- executed. The following instructions are never restarted after having accessed any portion of the storage operand When an instruction is aborted after being par- (unless the instruction causes a “Data Address tially executed, the contents of the instruction Compare match” or a “Data Address Breakpoint pointer indicate that the instruction has not been match”, for which the corresponding rules are given executed, however, the contents of some registers in Book III). may have been altered and some bytes within the 1. A Store instruction that causes an atomic access storage operand may have been accessed. The following are examples of an instruction being 2. A Load instruction that causes an atomic access partially executed and altering the program state to storage that is both Caching Inhibited and even though it appears that the instruction has Guarded not been executed. Any other Load or Store instruction may be partially 1. Load Multiple, Load String: Some registers in executed and then aborted after having accessed a the range of registers to be loaded may have portion of the storage operand, and then re-executed been altered. (i.e., restarted, by the processor or the operating 2. Any Store instruction, dcbz: Some bytes of system). If an instruction is partially executed, the the storage operand may have been altered. contents of registers are preserved to the extent that the correct result will be produced when the instruc- 3. Any floating-point Load instruction: The tion is re-executed. target register (FRT) may have been altered. 14 PowerPC Virtual Environment Architecture Version 2.01 Chapter 3. Storage Control Instructions 3.1 Parameters Useful to Application 3.3 Synchronization Instructions . . . . 21 Programs . . . . . . . . . . . . . . . . . 15 3.3.1 Instruction Synchronize Instruction 21 3.2 Cache Management Instructions . 16 3.3.2 Load And Reserve and Store 3.2.1 Instruction Cache Instruction . . . 17 Conditional Instructions . . . . . . . . . 22 3.2.2 Data Cache Instructions . . . . . . 18 3.3.3 Memory Barrier Instructions . . . 25 3.1 Parameters Useful to Application Programs It is suggested that the operating system provide a If the caches are combined, the same value should be service that allows an application program to obtain given for an instruction cache attribute and the corre- the following information. sponding data cache attribute. 1. The two virtual page sizes 2. Coherence block size 3. Granule sizes for reservations 4. An indication of the cache model implemented (e.g., Harvard-style cache, combined cache) 5. Instruction cache size 6. Data cache size 7. Instruction cache line size (see Book IV, PowerPC Implementation Features) 8. Data cache line size (see Book IV) 9. Block size for icbi 10. Block size for dcbt and dcbtst 11. Block size for dcbz, dcbst, and dcbf 12. Instruction cache associativity 13. Data cache associativity 14. Factors for converting the Time Base to seconds Chapter 3. Storage Control Instructions 15 Version 2.01 3.2 Cache Management Instructions The Cache Management instructions obey the sequen- treated as a Load (Store) from (to) the addressed byte tial execution model except as described in Section with respect to address translation, the definition of 3.2.1, “Instruction Cache Instruction” on page 17. program order on page 1, storage protection, refer- ence and change recording, and the storage access In the instruction descriptions the statements “this ordering described in Section 1.7.1, “Storage Access instruction is treated as a Load” and “this instruction Ordering” on page 6. is treated as a Store” mean that the instruction is 16 PowerPC Virtual Environment Architecture Version 2.01 3.2.1 Instruction Cache Instruction Instruction Cache Block Invalidate X-form Programming Note As stated above, the effective address is trans- icbi RA,RB lated using translation resources used for data accesses, even though the block being invalidated 31 /// RA RB 982 / was copied into the instruction cache based on translation resources used for instruction fetches 0 6 11 16 21 31 (see Book III, PowerPC Operating Environment Architecture). Let the effective address (EA) be the sum (RA|0)+(RB). Programming Note If the block containing the byte addressed by EA is in The invalidation of the specified instruction cache storage that is Memory Coherence Required and a block cannot be assumed to have been performed block containing the byte addressed by EA is in the with respect to the processor executing the instruction cache of any processors, the block is inval- instruction until a subsequent isync instruction idated in those instruction caches. has been executed by the processor. No other instruction or event has the corresponding effect. If the block containing the byte addressed by EA is in storage that is not Memory Coherence Required and a block containing the byte addressed by EA is in the instruction cache of this processor, the block is invali- dated in that instruction cache. The function of this instruction is independent of whether the block containing the byte addressed by EA is in storage that is Write Through Required or Caching Inhibited. This instruction is treated as a Load (see Section 3.2), except that reference and change recording need not be done. Special Registers Altered: None Chapter 3. Storage Control Instructions 17 Version 2.01 3.2.2 Data Cache Instructions Data Cache Block Touch X-form Data Cache Block Touch for Store X-form dcbt RA,RB dcbtst RA,RB 31 /// RA RB 278 / 31 /// RA RB 246 / 0 6 11 16 21 31 0 6 11 16 21 31 Let the effective address (EA) be the sum Let the effective address (EA) be the sum (RA|0)+(RB). (RA|0)+(RB). The dcbt instruction provides a hint that the program The dcbtst instruction provides a hint that the will probably soon load from the block containing the program will probably soon store to the block con- byte addressed by EA. The hint is ignored if the block taining the byte addressed by EA. The hint is ignored is Caching Inhibited or Guarded. if the block is Caching Inhibited or Guarded. The actions (if any) taken by the processor in The actions (if any) taken by the processor in response to the hint are not considered to be “caused response to the hint are not considered to be “caused by” or “associated with” the dcbt instruction (e.g., by” or “associated with” the dcbtst instruction (e.g., dcbt is considered not to cause any data accesses). dcbtst is considered not to cause any data accesses). No means are provided by which software can syn- No means are provided by which software can syn- chronize these actions with the execution of the chronize these actions with the execution of the instruction stream. For example, these actions are instruction stream. For example, these actions are not ordered by memory barriers. not ordered by memory barriers. This instruction is treated as a Load (see Section 3.2), This instruction is treated as a Load (see Section 3.2), except that the system data storage error handler is except that the system data storage error handler is not invoked, and reference and change recording not invoked, and reference and change recording need not be done. need not be done. Special Registers Altered: Special Registers Altered: None None Programming Note In response to the hint provided by dcbt and dcbtst, the processor may prefetch the specified block into the data cache, or take other actions that reduce the latency of subsequent Load or Store instructions that refer to the block. Earlier implementations do not necessarily ignore the hint provided by dcbt and dcbtst if the speci- fied block is in storage that is Guarded and not Caching Inhibited. Therefore a dcbt or dcbtst instruction should not specify an EA in such storage if the program is to be run on such imple- mentations. 18 PowerPC Virtual Environment Architecture Version 2.01 Data Cache Block set to Zero X-form Data Cache Block Store X-form dcbz RA,RB dcbst RA,RB [ POWER mnemonic: dclz] 31 /// RA RB 54 / 31 /// RA RB 1014 / 0 6 11 16 21 31 0 6 11 16 21 31 Let the effective address (EA) be the sum if RA = 0 then b ← 0 (RA|0)+(RB). else b ← (RA) EA ← b + (RB) If the block containing the byte addressed by EA is in n ← block size (bytes) storage that is Memory Coherence Required and a m ← log2(n) block containing the byte addressed by EA is in the ea ← EA0:63−m || m 0 data cache of any processor and any locations in the MEM(ea, n) ← n0x00 block are considered to be modified there, those locations are written to main storage, additional Let the effective address (EA) be the sum locations in the block may be written to main storage, (RA|0)+(RB). and the block ceases to be considered to be modified in that data cache. All bytes in the block containing the byte addressed by EA are set to zero. If the block containing the byte addressed by EA is in storage that is not Memory Coherence Required and This instruction is treated as a Store (see Section 3.2). a block containing the byte addressed by EA is in the data cache of this processor and any locations in the Special Registers Altered: block are considered to be modified there, those None locations are written to main storage, additional locations in the block may be written to main storage, Programming Note and the block ceases to be considered to be modified dcbz does not cause the block to exist in the data in that data cache. cache if the block is in storage that is Caching Inhibited. The function of this instruction is independent of whether the block containing the byte addressed by For storage that is neither Write Through EA is in storage that is Write Through Required or Required nor Caching Inhibited, dcbz provides an Caching Inhibited. efficient means of setting blocks of storage to zero. It can be used to initialize large areas of This instruction is treated as a Load (see Section 3.2), such storage, in a manner that is likely to except that reference and change recording need not consume less memory bandwidth than an equiv- be done. alent sequence of Store instructions. Special Registers Altered: For storage that is either Write Through Required None or Caching Inhibited, dcbz is likely to take signif- icantly longer to execute than an equivalent sequence of Store instructions. See the section entitled “Cache Management Instructions” in Book III, PowerPC Operating Envi- ronment Architecture for additional information about dcbz. Chapter 3. Storage Control Instructions 19 Version 2.01 Data Cache Block Flush X-form locations are written to main storage and additional locations in the block may be written to main storage. dcbf RA,RB The block is invalidated in the data cache of this processor. 31 /// RA RB 86 / The function of this instruction is independent of 0 6 11 16 21 31 whether the block containing the byte addressed by EA is in storage that is Write Through Required or Caching Inhibited. Let the effective address (EA) be the sum (RA|0)+(RB). This instruction is treated as a Load (see Section 3.2), except that reference and change recording need not If the block containing the byte addressed by EA is in be done. storage that is Memory Coherence Required and a block containing the byte addressed by EA is in the Special Registers Altered: data cache of any processor and any locations in the None block are considered to be modified there, those locations are written to main storage and additional Programming Note locations in the block may be written to main storage. The block is invalidated in the data caches of all The requirements of the sequential execution processors. model combine with the treatment of dcbf as a Load to ensure that the operation caused by a If the block containing the byte addressed by EA is in dcbf instruction is performed (i.e., complete) with storage that is not Memory Coherence Required and respect to a subsequent Load instruction (in the a block containing the byte addressed by EA is in the same execution thread) that specifies a storage data cache of this processor and any locations in the location in the cache block specified by the dcbf block are considered to be modified there, those instruction. 20 PowerPC Virtual Environment Architecture Version 2.01 3.3 Synchronization Instructions 3.3.1 Instruction Synchronize Instruction Instruction Synchronize XL-form isync [ POWER mnemonic: ics] 19 /// /// /// 150 / 0 6 11 16 21 31 Executing an isync instruction ensures that all instructions preceding the isync instruction have com- pleted before the isync instruction completes, and that no subsequent instructions are initiated until after the isync instruction completes. It also ensures that all instruction cache block invalidations caused by icbi instructions preceding the isync instruction have been performed with respect to the processor executing the isync instruction, and then causes any prefetched instructions to be discarded. Except as described in the preceding sentence, the isync instruction may complete before storage accesses associated with instructions preceding the isync instruction have been performed. This instruction is context synchronizing (see Book III, PowerPC Operating Environment Architecture). Special Registers Altered: None Chapter 3. Storage Control Instructions 21 Version 2.01 3.3.2 Load And Reserve and Store Conditional Instructions The Load And Reserve and Store Conditional Programming Note instructions can be used to construct a sequence of The Memory Coherence Required attribute on instructions that appears to perform an atomic update other processors and mechanisms ensures that operation on an aligned storage location. See Section their stores to the reservation granule will cause 1.7.3, “Atomic Update” on page 8 for additional infor- the reservation created by the Load And Reserve mation about these instructions. instruction to be lost. The Load And Reserve and Store Conditional instructions are fixed-point Storage Access Programming Note instructions; see the section entitled “Fixed-Point Storage Access Instructions” in Book I, PowerPC User Because the Load And Reserve and Store Condi- Instruction Set Architecture. tional instructions have implementation depend- encies (e.g., the granularity at which reservations The storage location specified by the Load And are managed), they must be used with care. The Reserve and Store Conditional instructions must be in operating system should provide system library storage that is Memory Coherence Required if the programs that use these instructions to implement location may be modified by other processors or the high-level synchronization functions (Test and mechanisms. If the specified location is in storage Set, Compare and Swap, locking, etc.; see that is Write Through Required or Caching Inhibited, Appendix B) that are needed by application pro- the system data storage error handler or the system grams. Application programs should use these alignment error handler is invoked. library programs, rather than use the Load And Reserve and Store Conditional instructions directly. 22 PowerPC Virtual Environment Architecture Version 2.01 Load Word And Reserve Indexed Load Doubleword And Reserve Indexed X-form X-form lwarx RT,RA,RB ldarx RT,RA,RB 31 RT RA RB 20 / 31 RT RA RB 84 / 0 6 11 16 21 31 0 6 11 16 21 31 if RA = 0 then b ← 0 if RA = 0 then b ← 0 else b ← (RA) else b ← (RA) EA ← b + (RB) EA ← b + (RB) RESERVE ← 1 RESERVE ← 1 RESERVE_ADDR ← real_addr(EA) RESERVE_ADDR ← real_addr(EA) RT ← 320 || MEM(EA, 4) RT ← MEM(EA, 8) Let the effective address (EA) be the sum Let the effective address (EA) be the sum (RA|0)+(RB). The word in storage addressed by EA (RA|0)+(RB). The doubleword in storage addressed is loaded into RT32:63. RT0:31 are set to 0. by EA is loaded into RT. This instruction creates a reservation for use by a This instruction creates a reservation for use by a Store Word Conditional instruction. An address com- Store Doubleword Conditional instruction. An puted from the EA as described in Section 1.7.3.1 is address computed from the EA as described in associated with the reservation, and replaces any Section 1.7.3.1 is associated with the reservation, and address previously associated with the reservation. replaces any address previously associated with the reservation. EA must be a multiple of 4. If it is not, either the system alignment error handler is invoked or the EA must be a multiple of 8. If it is not, either the results are boundedly undefined. system alignment error handler is invoked or the results are boundedly undefined. Special Registers Altered: None Special Registers Altered: None Chapter 3. Storage Control Instructions 23 Version 2.01 Store Word Conditional Indexed X-form Store Doubleword Conditional Indexed X-form stwcx. RS,RA,RB stdcx. RS,RA,RB 31 RS RA RB 150 1 0 6 11 16 21 31 31 RS RA RB 214 1 0 6 11 16 21 31 if RA = 0 then b ← 0 else b ← (RA) if RA = 0 then b ← 0 EA ← b + (RB) else b ← (RA) if RESERVE then EA ← b + (RB) if RESERVE_ADDR = real_addr(EA) then if RESERVE then MEM(EA, 4) ← (RS)32:63 if RESERVE_ADDR = real_addr(EA) then CR0 ← 0b00 || 0b1 || XERSO MEM(EA, 8) ← (RS) else CR0 ← 0b00 || 0b1 || XERSO u ← undefined 1-bit value else if u then u ← undefined 1-bit value MEM(EA, 4) ← (RS)32:63 if u then CR0 ← 0b00 || u || XERSO MEM(EA, 8) ← (RS) RESERVE ← 0 CR0 ← 0b00 || u || XERSO else RESERVE ← 0 CR0 ← 0b00 || 0b0 || XERSO else Let the effective address (EA) be the sum CR0 ← 0b00 || 0b0 || XERSO (RA|0)+(RB). Let the effective address (EA) be the sum (RA|0)+(RB). If a reservation exists and the storage location speci- fied by the stwcx. is the same as that specified by the If a reservation exists and the storage location speci- Load And Reserve instruction that established the fied by the stdcx. is the same as that specified by the reservation, (RS)32:63 are stored into the word in Load And Reserve instruction that established the storage addressed by EA and the reservation is reservation, (RS) is stored into the doubleword in cleared. storage addressed by EA and the reservation is cleared. If a reservation exists but the storage location speci- fied by the stwcx. is not the same as that specified by If a reservation exists but the storage location speci- the Load And Reserve instruction that established the fied by the stdcx. is not the same as that specified by reservation, the reservation is cleared, and it is unde- the Load And Reserve instruction that established the fined whether (RS)32:63 are stored into the word in reservation, the reservation is cleared, and it is unde- storage addressed by EA. fined whether (RS) is stored into the doubleword in storage addressed by EA. If a reservation does not exist, the instruction com- pletes without altering storage. If a reservation does not exist, the instruction com- pletes without altering storage. CR Field 0 is set to reflect whether the store opera- tion was performed, as follows. CR Field 0 is set to reflect whether the store opera- CR0LT GT EQ SO = 0b00 || store_performed || XERSO tion was performed, as follows. CR0LT GT EQ SO = 0b00 || store_performed || XERSO EA must be a multiple of 4. If it is not, either the system alignment error handler is invoked or the EA must be a multiple of 8. If it is not, either the results are boundedly undefined. system alignment error handler is invoked or the results are boundedly undefined. Special Registers Altered: CR0 Special Registers Altered: CR0 24 PowerPC Virtual Environment Architecture Version 2.01 3.3.3 Memory Barrier Instructions The Memory Barrier instructions can be used to Extended mnemonics for Synchronize control the order in which storage accesses are per- formed. Additional information about these Extended mnemonics are provided for the Synchro- instructions and about related aspects of storage nize instruction so that it can be coded with the L management can be found in Book III, PowerPC Oper- value as part of the mnemonic rather than as a ating Environment Architecture. numeric operand. These are shown as examples with the instruction. See Appendix A, “Assembler Extended Mnemonics” on page 37. Synchronize X-form The ordering done by the memory barrier is cumula- tive. sync L The sync instruction may complete before storage [ POWER mnemonic: dcs] accesses associated with instructions preceding the sync instruction have been performed. 31 /// L /// /// 598 / 0 6 9 11 16 21 31 If L = 0 , the sync instruction has the following addi- tional properties. ■ Executing the sync instruction ensures that all The sync instruction creates a memory barrier (see instructions preceding the sync instruction have Section 1.7.1). The set of storage accesses that is completed before the sync instruction completes, ordered by the memory barrier depends on the value and that no subsequent instructions are initiated of the L field. until after the sync instruction completes. L = 0 (“heavyweight sync”) ■ The sync instruction is execution synchronizing The memory barrier provides an ordering func- (see Book III, PowerPC Operating Environment tion for the storage accesses associated with all Architecture). However, address translation and instructions that are executed by the processor reference and change recording (see Book III) executing the sync instruction. The applicable associated with subsequent instructions may be pairs are all pairs a i ,bj in which b j is a data performed before the sync instruction completes. access, except that if a i is the storage access ■ The memory barrier provides the additional caused by an icbi instruction then b j may be per- ordering function such that if a given instruction formed with respect to the processor executing that is the result of a Store in set B is executed, the sync instruction before a i is performed with all applicable storage accesses in set A have respect to that processor. been performed with respect to the processor executing the instruction to the extent required by L = 1 (“lightweight sync”) the associated memory coherence properties. The memory barrier provides an ordering func- The single exception is that any storage access in tion for the storage accesses caused by Load, set A that is caused by an icbi instruction exe- Store, and dcbz instructions that are executed by cuted by the processor executing the sync the processor executing the sync instruction and instruction (P1) may not have been performed for which the specified storage location is in with respect to P1 (see the description of the icbi storage that is Memory Coherence Required and instruction on page 17). is neither Write Through Required nor Caching Inhibited. The applicable pairs are all pairs ai ,bj The cumulative properties of the barrier apply to of such accesses except those in which a i is an the execution of the given instruction as they access caused by a Store or dcbz instruction and would to a Load that returned a value that was b j is an access caused by a Load instruction. the result of a Store in set B. L= 2 The value L = 3 is reserved, and the results of exe- cuting a sync instruction with L = 3 are boundedly The set of storage accesses that is ordered by undefined. the memory barrier is described in the section entitled “Synchronize Instruction” in Book III, as Special Registers Altered: are additional properties of the sync instruction None with L = 2 . Chapter 3. Storage Control Instructions 25 Version 2.01 Extended Mnemonics: Programming Note Extended mnemonics for Synchronize: The sync instruction can be used to ensure that all stores into a data structure, caused by Store Extended: Equivalent to: instructions executed in a “critical section” of a sync sync 0 program, will be performed with respect to lwsync sync 1 another processor before the store that releases ptesync sync 2 the lock is performed with respect to that processor; see Section B.2, “Lock Acquisition and Except in the sync instruction description in this Release, and Related Techniques” on page 41. section, references to “ sync” in Books I − III imply L = 0 unless otherwise stated or obvious from context The memory barrier created by a sync instruction (the appropriate extended mnemonics are used when with L = 0 or L = 1 does not order implicit storage the other L values are intended). accesses. The memory barrier created by a sync instruction with any L value does not order Programming Note instruction fetches. Section 1.8 on page 10 contains a detailed (The memory barrier created by a sync instruction description of how to modify instructions such that with L = 0 — or L = 2 ; see Book III — appears to a well-defined result is obtained. order instruction fetches for instructions pre- ceding the sync instruction with respect to data accesses caused by instructions following the Programming Note sync instruction. However, this ordering is a con- sync serves as both a basic and an extended sequence of the first “additional property” of sync mnemonic. The Assembler will recognize a sync with L = 0 , not a property of the memory barrier.) mnemonic with one operand as the basic form, and a sync mnemonic with no operand as the extended form. In the extended form the L operand is omitted and assumed to be 0. In order to obtain the best performance across the widest range of implementations, the pro- grammer should use either the sync instruction with L = 1 or the eieio instruction if either of these is sufficient for his needs; otherwise he should use sync with L = 0 . sync with L = 2 should not be used by application programs. Programming Note The functions provided by sync with L = 1 are a strict subset of those provided by sync with L = 0 . (The functions provided by sync with L = 2 are a strict superset of those provided by sync with L = 0 ; see Book III.) 26 PowerPC Virtual Environment Architecture Version 2.01 Enforce In-order Execution of I/O X-form eieio Programming Note The eieio instruction is intended for use in man- 31 /// /// /// 854 / aging shared data structures (see Appendix B, 0 6 11 16 21 31 “Programming Examples for Sharing Storage” on page 39), in doing memory-mapped I/O, and in preventing load/store combining operations in The eieio instruction creates a memory barrier (see main storage (see Section 1.6, “Storage Control Section 1.7.1), which provides an ordering function for Attributes” on page 4). the storage accesses caused by Load, Store, dcbz, eciwx, and ecowx instructions executed by the Because stores to storage that is both Caching processor executing the eieio instruction. These Inhibited and Guarded are performed in program storage accesses are divided into two sets, which are order (see Section 1.7.1, “Storage Access ordered separately. The storage access caused by an Ordering” on page 6), eieio is needed for such eciwx instruction is ordered as a load, and the storage only when loads must be ordered with storage access caused by a dcbz or ecowx instruction respect to stores or with respect to other loads, or is ordered as a store. when load/store combining operations must be prevented. 1. Loads and stores to storage that is both Caching Inhibited and Guarded, and stores to main For accesses in set 1, a i and b j need not be the storage caused by stores to storage that is Write same kind of access or be to storage having the Through Required same storage control attributes. For example, a i The applicable pairs are all pairs ai ,bj of such can be a load to Caching Inhibited, Guarded accesses. storage, and b j a store to Write Through Required storage. The ordering done by the memory barrier for accesses in this set is not cumulative. If stronger ordering is desired than that provided 2. Stores to storage that is Memory Coherence by eieio, the sync instruction must be used, with Required and is neither Write Through Required the appropriate value in the L field. nor Caching Inhibited The applicable pairs are all pairs ai ,bj of such Programming Note accesses. The functions provided by eieio are a strict subset The ordering done by the memory barrier for of those provided by sync with L = 0 . The func- accesses in this set is cumulative. tions provided by eieio for its second set are a strict subset of those provided by sync with L = 1 . The eieio instruction may complete before storage accesses associated with instructions preceding the eieio instruction have been performed. Special Registers Altered: None Chapter 3. Storage Control Instructions 27 Version 2.01 28 PowerPC Virtual Environment Architecture Version 2.01 Chapter 4. Time Base 4.1 Time Base Instructions . . . . . . . 30 4.3 Computing Time of Day from the 4.2 Reading the Time Base . . . . . . . 30 Time Base . . . . . . . . . . . . . . . . 30 The Time Base (TB) is a 64-bit register (see Figure 2) updated and other frequencies, such as the CPU clock containing a 64-bit unsigned integer that is incre- or bus clock, in a PowerPC system. The Time Base mented periodically. Each increment adds 1 to the update frequency is not required to be constant. low-order bit (bit 63). The frequency at which the What is required, so that system software can keep integer is updated is implementation-dependent. time of day and operate interval timers, is one of the following. TBU TBL ■ The system provides an (implementa- 0 32 63 tion-dependent) interrupt to software whenever the update frequency of the Time Base changes, Field Description and a means to determine what the current TBU Upper 32 bits of Time Base update frequency is. TBL Lower 32 bits of Time Base ■ The update frequency of the Time Base is under Figure 2. Time Base the control of the system software. The Time Base increments until its value becomes Programming Note 0xFFFF_FFFF_FFFF_FFFF (264 − 1). At the next incre- If the operating system initializes the Time Base ment, its value becomes 0x0000_0000_0000_0000. on power-on to some reasonable value and the There is no explicit indication (such as an interrupt; update frequency of the Time Base is constant, see Book III, PowerPC Operating Environment Archi- the Time Base can be used as a source of values tecture) that this has occurred. that increase at a constant rate, such as for time stamps in trace entries. The period of the Time Base depends on the driving frequency. As an order of magnitude example, Even if the update frequency is not constant, suppose that the CPU clock is 1 GHz and that the values read from the Time Base are Time Base is driven by this frequency divided by 32. monotonically increasing (except when the Time Then the period of the Time Base would be Base wraps from 264− 1 to 0). If a trace entry is recorded each time the update frequency 264 × 32 changes, the sequence of Time Base values can TTB = = 5.90 × 1011 seconds 1 GHz be post-processed to become actual time values. which is approximately 18,700 years. Successive readings of the Time Base may return The PowerPC Architecture does not specify a relation- identical values. ship between the frequency at which the Time Base is Chapter 4. Time Base 29 Version 2.01 Extended Mnemonics: 4.1 Time Base Instructions Extended mnemonics for Move From Time Base: Extended mnemonics Extended: Equivalent to: mftb Rx mftb Rx,268 Extended mnemonics are provided provided for the mftbu Rx mftb Rx,269 Move From Time Base instruction so that it can be coded with the TBR name as part of the mnemonic Programming Note rather than as a numeric operand. See the appendix mftb serves as both a basic and an extended entitled “Assembler Extended Mnemonics” in Book III, mnemonic. The Assembler will recognize an mftb PowerPC Operating Environment Architecture. mnemonic with two operands as the basic form, and an mftb mnemonic with one operand as the extended form. In the extended form the TBR Move From Time Base XFX-form operand is omitted and assumed to be 268 (the value that corresponds to TB). mftb RT,TBR Compiler and Assembler Note 31 RT tbr 371 / 0 6 11 21 31 The TBR number coded in assembler language does not appear directly as a 10-bit binary number in the instruction. The number coded is n ← tbr5:9 || tbr0:4 split into two 5-bit halves that are reversed in the if n = 268 then instruction, with the high-order 5 bits appearing in RT ← TB bits 16:20 of the instruction and the low-order 5 else if n = 269 then bits in bits 11:15. RT ← 320 || TB0:31 The TBR field denotes either the Time Base or Time Base Upper, encoded as shown in the table below. The contents of the designated register are placed into register RT. When reading Time Base Upper, the 4.2 Reading the Time Base high-order 32 bits of register RT are set to zero. The contents of the Time Base can be read into a TBR * Register GPR by the mftb extended mnemonic. To read the decimal tbr 5:9 tbr 0:4 Name contents of the Time Base into register Rx, execute: 268 01000 01100 TB mftb Rx 269 01000 01101 TBU Reading the Time Base has no effect on the value it * Note that the order of the two 5-bit contains or on the periodic incrementing of that value. halves of the TBR number is reversed. If the TBR field contains any value other than one of 4.3 Computing Time of Day the values shown above then one of the following occurs. from the Time Base ■ The system illegal instruction error handler is invoked. Since the update frequency of the Time Base is imple- ■ The system privileged instruction error handler is mentation-dependent, the algorithm for converting the invoked. current value in the Time Base to time of day is also ■ The results are boundedly undefined. implementation-dependent. Special Registers Altered: As an example, assume that the Time Base is incre- None mented at a constant rate of once for every 32 cycles of a 1 GHz CPU instruction clock. What is wanted is the pair of 32-bit values comprising a POSIX standard clock:1 the number of whole seconds that have passed 1 Described in POSIX Draft Standard P1003.4/D12, Draft Standard for Information Technology -- Portable Operating System Interface (POSIX) -- Part 1: System Application Program Interface (API) - Amendment 1: Realtime Extension [ C Language]. Institute of Electrical and Electronics Engineers, Inc., Feb. 1992. 30 PowerPC Virtual Environment Architecture Version 2.01 since midnight January 0, 1970, and the remaining Non-constant update frequency fraction of a second expressed as a number of nanoseconds. In a system in which the update frequency of the Time Base may change over time, it is not possible to Assume that: convert an isolated Time Base value into time of day. ■ The value 0 in the Time Base represents the start Instead, a Time Base value has meaning only with time of the POSIX clock (if this is not true, a respect to the current update frequency and the time simple 64-bit subtraction will make it so). of day that the update frequency was last changed. Each time the update frequency changes, either the ■ The integer constant ticks_per_sec contains the system software is notified of the change via an inter- value rupt (see Book III, PowerPC Operating Environment Architecture), or the change was instigated by the 1 GHz system software itself. At each such change, the = 31,250,000 32 system software must compute the current time of which is the number of times the Time Base is day using the old update frequency, compute a new updated each second. value of ticks_per_sec for the new frequency, and save the time of day, Time Base value, and tick rate. ■ The integer constant ns_adj contains the value Subsequent calls to compute time of day use the current Time Base value and the saved data. 1,000,000,000 = 32 31,250,000 which is the number of nanoseconds per tick of the Time Base. The POSIX clock can be computed with an instruction sequence such as this: mftb Ry # Ry = Time Base lwz Rx,ticks_per_sec divd Rz,Ry,Rx # Rz = whole seconds stw Rz,posix_sec mulld Rz,Rz,Rx # Rz = quotient * divisor sub Rz,Ry,Rz # Rz = excess ticks lwz Rx,ns_adj mulld Rz,Rz,Rx # Rz = excess nanoseconds stw Rz,posix_ns Chapter 4. Time Base 31 Version 2.01 32 PowerPC Virtual Environment Architecture Version 2.01 Chapter 5. Optional Facilities and Instructions 5.1 External Control . . . . . . . . . . . 33 5.2.1 Cache Management Instructions 35 5.1.1 External Access Instructions . . . 34 5.2.1.1 Data Cache Instruction . . . . . 35 5.2 Storage Control Instructions . . . . 35 5.3 Little-Endian . . . . . . . . . . . . . . 36 The facilities and instructions described in this If attempt is made to execute either of these chapter are optional. An implementation may provide instructions when E = 0 the system data storage error all, some, or none of them, except as described handler is invoked. The location of these fields is below. described in Book III, PowerPC Operating Environ- ment Architecture. The storage access caused by eciwx and ecowx is 5.1 External Control performed as though the specified storage location is Caching Inhibited and Guarded, and is neither Write Through Required nor Memory Coherence Required. The External Control facility permits a program to communicate with a special-purpose device. Two Interpretation of the real address transmitted by instructions are provided, both of which must be eciwx and ecowx and of the 32-bit value transmitted implemented if the facility is provided. by ecowx is up to the target device, and is not speci- ■ External Control In Word Indexed (eciwx), which fied by the PowerPC Architecture. See the System does the following: Architecture documentation for a given PowerPC system for details on how the External Control facility — Computes an effective address (EA) as for can be used with devices on that system. any X-form instruction — Validates the EA as would be done for a load from that address Example — Translates the EA to a real address — Transmits the real address to the device An example of a device designed to be used with the — Accepts a word of data from the device and External Control facility might be a graphics adapter. places it into a General Purpose Register The ecowx instruction might be used to send the device the translated real address of a buffer con- ■ External Control Out Word Indexed (ecowx), taining graphics data, and the word transmitted from which does the following: the General Purpose Register might be control infor- — Computes an effective address (EA) as for mation that tells the adapter what operation to any X-form instruction perform on the data in the buffer. The eciwx instruc- — Validates the EA as would be done for a tion might be used to load status information from the store to that address adapter. — Translates the EA to a real address — Transmits the real address and a word of A device designed to be used with the External data from a General Purpose Register to the Control facility may also recognize events that indi- device cate that the address translation being used by the processor has changed. In this case the operating Permission to execute these instructions and identifi- system need not “ p i n” the area of storage identified cation of the target device are controlled by two by an eciwx or ecowx instruction (i.e., need not fields, called the E bit and the RID field respectively. protect it from being paged out). Chapter 5. Optional Facilities and Instructions 33 Version 2.01 5.1.1 External Access Instructions In the instruction descriptions the statements “this the Cache Management instructions; see Section 3.2, instruction is treated as a Load” and “this instruction “Cache Management Instructions” on page 16. is treated as a Store” have the same meanings as for External Control In Word Indexed External Control Out Word Indexed X-form X-form eciwx RT,RA,RB ecowx RS,RA,RB 31 RT RA RB 310 / 31 RS RA RB 438 / 0 6 11 16 21 31 0 6 11 16 21 31 if RA = 0 then b ← 0 if RA = 0 then b ← 0 else b ← (RA) else b ← (RA) EA ← b + (RB) EA ← b + (RB) raddr ← address translation of EA raddr ← address translation of EA send load word request for raddr to send store word request for raddr to device identified by RID device identified by RID RT ← 320 || word from device send (RS)32:63 to device Let the effective address (EA) be the sum Let the effective address (EA) be the sum (RA|0)+(RB). (RA|0)+(RB). A load word request for the real address corre- A store word request for the real address corre- sponding to EA is sent to the device identified by RID, sponding to EA and the contents of RS32:63 are sent to bypassing the cache. The word returned by the device the device identified by RID, bypassing the cache. is placed into RT32:63. RT0:31 are set to 0. The E bit must be 1. If it is not, the data storage error The E bit must be 1. If it is not, the data storage error handler is invoked. handler is invoked. EA must be a multiple of 4. If it is not, either the EA must be a multiple of 4. If it is not, either the system alignment error handler is invoked or the system alignment error handler is invoked or the results are boundedly undefined. results are boundedly undefined. This instruction is treated as a Store, except that its This instruction is treated as a Load. storage access is not performed in program order with respect to accesses to other Caching Inhibited See Book III, PowerPC Operating Environment Archi- and Guarded storage locations unless software explic- tecture for additional information about this instruc- itly imposes that order. tion. See Book III, PowerPC Operating Environment Archi- Special Registers Altered: tecture for additional information about this instruc- None tion. Programming Note Special Registers Altered: None The eieio instruction can be used to ensure that the storage accesses caused by eciwx and ecowx are performed in program order with respect to other Caching Inhibited and Guarded storage accesses. 34 PowerPC Virtual Environment Architecture Version 2.01 5.2 Storage Control Instructions 5.2.1 Cache Management Instructions 5.2.1.1 Data Cache Instruction The optional version of the Data Cache Block Touch instruction stream. For example, these actions are instruction includes a TH (Touch Hint) field, which not ordered by memory barriers. permits a program to provide a hint that a sequence of data cache blocks is likely to be needed soon. The This instruction is treated as a Load (see Section 3.2), sequence is called a “data stream”. except that the system data storage error handler is not invoked, and reference and change recording need not be done. Data Cache Block Touch X-form Special Registers Altered: None dcbt RA,RB,TH Programming Note 31 /// TH RA RB 278 / In response to the hint provided by dcbt, the 0 6 9 11 16 21 31 processor may prefetch the specified storage locations into the data cache, or take other actions that reduce the latency of subsequent Let the effective address (EA) be the sum Load instructions that refer to the locations. (RA|0)+(RB). The dcbt instruction provides a hint that the program will probably soon load from the storage locations Programming Note specified by EA and the TH field. The hint is ignored dcbt serves as both a basic and an extended for storage locations that are Caching Inhibited or mnemonic. The Assembler will recognize a dcbt Guarded. mnemonic with three operands as the basic form, and a dcbt mnemonic with two operands as the The encodings of the TH field are as follows. extended form. In the extended form the TH TH Description operand is omitted and assumed to be 0b00. 00 The storage location is the block containing the byte addressed by EA. Programming Note 01 The storage locations are the block containing If the TH field is set to 0b00, the instruction oper- the byte addressed by EA and sequentially fol- ates as described in Section 3.2.2, “Data Cache lowing blocks (i.e., the blocks containing the Instructions” on page 18. bytes addressed by EA + n × block_size, where n = 0, 1, 2, ...). The TH field should not be set to 0b10, because 10 Reserved that value may be assigned a meaning in some future version of the architecture. 11 The storage locations are the block containing the byte addressed by EA and sequentially pre- Earlier implementations that do not support the ceding blocks (i.e., the blocks containing the optional version of dcbt ignore the TH field (i.e., bytes addressed by EA − n × block_size, where treat it as if it were set to 0b00), and do not nec- n = 0, 1, 2, ...). essarily ignore the hint provided by dcbt if the specified block is in storage that is Guarded and The actions (if any) taken by the processor in not Caching Inhibited. Therefore a dcbt instruc- response to the hint are not considered to be “caused tion with TH1= 1 should not specify an EA in such by” or “associated with” the dcbt instruction (e.g., storage if the program is to be run on such imple- dcbt is considered not to cause any data accesses). mentations. No means are provided by which software can syn- chronize these actions with the execution of the Chapter 5. Optional Facilities and Instructions 35 Version 2.01 Programming Note 5.3 Little-Endian Although optimal use of the data stream variant of dcbt (TH1= 1 ) depends on the characteristics of the prefetch mechanism and of the storage hier- If the optional Little-Endian facility is implemented archy (see Book IV), the programmer should (see the section entitled “Little-Endian” in Book I, assume that the following programming model is PowerPC User Instruction Set Architecture), the pro- supported. grammer should assume the performance model described in Figure 3 with respect to the placement of ■ Data stream resources are allocated in round- storage operands that are accessed in Little-Endian robin fashion. Therefore dcbt instructions mode. (with TH1= 1 ) should be executed for the least important stream first and the most important stream last. If this technique is used and Operand Boundary Crossing dcbt instructions are executed for more Byte Cache Virtual streams than the processor supports, the Size Align. None Block Page2 Seg. most important streams will be prefetched. Integer ■ The prefetch mechanism paces prefetching of a data stream with consumption of the pre- 8 Byte 8 optimal − − − fetched data, prefetching only a limited 4 good good poor poor number of blocks ahead of the block that is <4 poor poor poor poor currently being loaded from by the program. 4 Byte 4 optimal − − − As a consequence, when the program ceases <4 good good poor poor to load from successive blocks of the stream, prefetching of the stream ceases. 2 Byte 2 optimal − − − <2 good good poor poor ■ Certain conditions may cause prefetching to be terminated for a data stream that the 1 Byte 1 optimal − − − program is still using. However, the prefetch Float mechanism will subsequently detect that the stream is still being loaded from and will 8 Byte 8 optimal − − − resume prefetching of the stream. Therefore 4 good good poor poor there is no need to code more than one dcbt <4 poor poor poor poor instruction (with TH1= 1 ) for the stream. 4 Byte 4 optimal − − − Although the dcbt instruction described in Section <4 poor poor poor poor 3.2.2 (equivalently, dcbt with TH=0b00) can be 1 If an instruction causes an access that is not used to provide the same function as the data atomic and any portion of the operand is in stream variant, the data stream variant may be storage that is Write Through Required or easier to use because only one instance of the Caching Inhibited, performance is likely to be dcbt instruction is needed per stream, instead of poor. one per cache block, and because the perform- 2 If the storage operand spans two virtual pages ance of processing the stream is less sensitive to that have different storage control attributes, how far ahead of the Load instructions the dcbt performance is likely to be poor. instruction is placed. Figure 3. Performance effects of storage operand placement, Little-Endian mode 36 PowerPC Virtual Environment Architecture Version 2.01 Appendix A. Assembler Extended Mnemonics In order to make assembler language programs mnemonics and symbols related to instructions simpler to write and easier to understand, a set of defined in Book II. extended mnemonics and symbols is provided for certain instructions. This appendix defines extended Assemblers should provide the extended mnemonics and symbols listed here, and may provide others. A.1 Synchronize Mnemonics The L field in the Synchronize instruction controls the scope of the synchronization function performed by the instruction. Extended mnemonics are provided that represent the L value in the mnemonic rather than requiring it to be coded as a numeric operand. Note: sync serves as both a basic and an extended mnemonic. The Assembler will recognize a sync mnemonic with one operand as the basic form, and a sync mnemonic with no operand as the extended form. In the extended form the L operand is omitted and assumed to be 0. sync (equivalent to: sync 0) lwsync (equivalent to: sync 1) ptesync (equivalent to: sync 2) Appendix A. Assembler Extended Mnemonics 37 Version 2.01 38 PowerPC Virtual Environment Architecture Version 2.01 Appendix B. Programming Examples for Sharing Storage This appendix gives examples of how dependencies In these examples it is assumed that contention for and the Synchronization instructions can be used to the shared resource is low; the conditional branches control storage access ordering when storage is are optimized for this case by using “ + ” and “ − ” suf- shared between programs. fixes appropriately. Many of the examples use extended mnemonics (e.g., The examples deal with words; they can be used for bne, bne-, cmpw) that are defined in the Appendix doublewords by changing all word-specific mnemonics entitled “Assembler Extended Mnemonics” in Book I, to the corresponding doubleword-specific mnemonics PowerPC User Instruction Set Architecture. (e.g., lwarx to ldarx, cmpw to cmpd). Many of the examples use the Load And Reserve and In this appendix it is assumed that all shared storage Store Conditional instructions, in a sequence that locations are in storage that is Memory Coherence begins with a Load And Reserve instruction and ends Required, and that the storage locations specified by with a Store Conditional instruction (specifying the Load And Reserve and Store Conditional instructions same storage location as the Load Conditional) fol- are in storage that is neither Write Through Required lowed by a Branch Conditional instruction that tests nor Caching Inhibited. whether the Store Conditional instruction succeeded. atomic operation. The examples shown provide the B.1 Atomic Update Primitives effect of an atomic read/modify/write operation, but use several instructions rather than a single atomic This section gives examples of how the Load And instruction. Reserve and Store Conditional instructions can be used to emulate atomic read/modify/write operations. An atomic read/modify/write operation reads a storage location and writes its next value, which may be a function of its current value, all as a single Fetch and No-op with respect to the value in the location, its success ensures that the value loaded by the The “Fetch and No-op” primitive atomically loads the lwarx is still the current value at the time the current value in a word in storage. stwcx. is executed. In this example it is assumed that the address of the Fetch and Store word to be loaded is in GPR 3 and the data loaded are returned in GPR 4. The “Fetch and Store” primitive atomically loads and loop: lwarx r4,0,r3 #load and reserve replaces a word in storage. stwcx. r4,0,r3 #store old value if # still reserved In this example it is assumed that the address of the bne− loop #loop if lost reservation word to be loaded and replaced is in GPR 3, the new value is in GPR 4, and the old value is returned in Note: GPR 5. 1. The stwcx., if it succeeds, stores to the target loop: lwarx r5,0,r3 #load and reserve location the same value that was loaded by the stwcx. r4,0,r3 #store new value if preceding lwarx. While the store is redundant # still reserved bne− loop #loop if lost reservation Appendix B. Programming Examples for Sharing Storage 39 Version 2.01 Fetch and Add Compare and Swap The “Fetch and Add” primitive atomically increments The “Compare and Swap” primitive atomically com- a word in storage. pares a value in a register with a word in storage, if they are equal stores the value from a second reg- In this example it is assumed that the address of the ister into the word in storage, if they are unequal word to be incremented is in GPR 3, the increment is loads the word from storage into the first register, in GPR 4, and the old value is returned in GPR 5. and sets the EQ bit of CR Field 0 to indicate the result of the comparison. loop: lwarx r5,0,r3 #load and reserve add r0,r4,r5 #increment word stwcx. r0,0,r3 #store new value if In this example it is assumed that the address of the # still reserved word to be tested is in GPR 3, the comparand is in bne− loop #loop if lost reservation GPR 4 and the old value is returned there, and the new value is in GPR 5. Fetch and AND loop: lwarx r6,0,r3 #load and reserve cmpw r4,r6 #1st 2 operands equal? The “Fetch and AND” primitive atomically ANDs a bne− exit #skip if not value into a word in storage. stwcx. r5,0,r3 #store new value if # still reserved In this example it is assumed that the address of the bne− loop #loop if lost reservation exit: mr r4,r6 #return value from storage word to be ANDed is in GPR 3, the value to AND into it is in GPR 4, and the old value is returned in GPR 5. Notes: loop: lwarx r5,0,r3 #load and reserve and r0,r4,r5 #AND word 1. The semantics given for “Compare and Swap” stwcx. r0,0,r3 #store new value if above are based on those of the IBM System/370 # still reserved Compare and Swap instruction. Other architec- bne− loop #loop if lost reservation tures may define a Compare and Swap instruction differently. Note: 2. “Compare and Swap” is shown primarily for ped- 1. The sequence given above can be changed to agogical reasons. It is useful on machines that perform another Boolean operation atomically on lack the better synchronization facilities provided a word in storage, simply by changing the and by lwarx and stwcx.. A major weakness of a instruction to the desired Boolean instruction (or, System/370-style Compare and Swap instruction xor, etc.). is that, although the instruction itself is atomic, it checks only that the old and current values of the word being tested are equal, with the result that Test and Set programs that use such a Compare and Swap to control a shared resource can err if the word has This version of the “Test and Set” primitive atom- been modified and the old value subsequently ically loads a word from storage, sets the word in restored. The sequence shown above has the storage to a nonzero value if the value loaded is zero, same weakness. and sets the EQ bit of CR Field 0 to indicate whether the value loaded is zero. 3. In some applications the second bne- instruction and/or the mr instruction can be omitted. The In this example it is assumed that the address of the bne- is needed only if the application requires word to be tested is in GPR 3, the new value that if the EQ bit of CR Field 0 on exit indicates (nonzero) is in GPR 4, and the old value is returned in “not equal” then (r4) and (r6) are in fact not GPR 5. equal. The mr is needed only if the application requires that if the comparands are not equal loop: lwarx r5,0,r3 #load and reserve then the word from storage is loaded into the reg- cmpwi r5,0 #done if word ister with which it was compared (rather than into bne− $+12 # not equal to 0 stwcx. r4,0,r3 #try to store non-0 a third register). If either or both of these bne− loop #loop if lost reservation instructions is omitted, the resulting Compare and Swap does not obey System/370 semantics. 40 PowerPC Virtual Environment Architecture Version 2.01 B.2 Lock Acquisition and Release, and Related Techniques This section gives examples of how dependencies and ment locks, import and export barriers, and similar the Synchronization instructions can be used to imple- constructs. B.2.1 Lock Acquisition and Import If the shared data structure is in storage that is neither Write Through Required nor Caching Inhibited, Barriers an lwsync instruction can be used instead of the isync instruction. If lwsync is used, the load from “data1” An “import barrier” is an instruction or sequence of may be performed before the stwcx.. But if the instructions that prevents storage accesses caused by stwcx. fails, the second branch is taken and the lwarx instructions following the barrier from being per- is reexecuted. If the stwcx. succeeds, the value formed before storage accesses that acquire a lock returned by the load from “data1” is valid even if the have been performed. An import barrier can be used load is performed before the stwcx., because the to ensure that a shared data structure protected by a lwsync ensures that the load is performed after the lock is not accessed until the lock has been acquired. instance of the lwarx that created the reservation A sync instruction can be used as an import barrier, used by the successful stwcx.. but the approaches shown below will generally yield better performance because they order only the rele- B.2.1.2 Obtain Pointer and Import vant storage accesses. Shared Storage B.2.1.1 Acquire Lock and Import Shared If lwarx and stwcx. instructions are used to obtain a Storage pointer into a shared data structure, an import barrier is not needed if all the accesses to the shared data If lwarx and stwcx. instructions are used to obtain the structure depend on the value obtained for the lock, an import barrier can be constructed by placing pointer. The following example uses the “Fetch and an isync instruction immediately following the loop Add” primitive to obtain and increment the pointer. containing the lwarx and stwcx.. The following example uses the “Compare and Swap” primitive to In this example it is assumed that the address of the acquire the lock. pointer is in GPR 3, the value to be added to the pointer is in GPR 4, and the old value of the pointer is In this example it is assumed that the address of the returned in GPR 5. lock is in GPR 3, the value indicating that the lock is free is in GPR 4, the value to which the lock should be loop: lwarx r5,0,r3 #load pointer and reserve add r0,r4,r5 #increment the pointer set is in GPR 5, the old value of the lock is returned in stwcx. r0,0,r3 #try to store new value GPR 6, and the address of the shared data structure bne− loop #loop if lost reservation is in GPR 9. lwz r7,data1(r5) #load shared data loop: lwarx r6,0,r3 #load lock and reserve cmpw r4,r6 #skip ahead if The load from “data1” cannot be performed until the bne− wait # lock not free pointer value has been loaded into GPR 5 by the stwcx. r5,0,r3 #try to set lock lwarx. The load from “data1” may be performed bne− loop #loop if lost reservation before the stwcx.. But if the stwcx. fails, the branch isync #import barrier is taken and the value returned by the load from lwz r7,data1(r9) #load shared data “data1” is discarded. If the stwcx. succeeds, the . value returned by the load from “data1” is valid even . if the load is performed before the stwcx., because wait: ... #wait for lock to free the load uses the pointer value returned by the instance of the lwarx that created the reservation The second bne- does not complete until CR0 has used by the successful stwcx.. been set by the stwcx.. The stwcx. does not set CR0 until it has completed (successfully or unsuccessfully). An isync instruction could be placed between the bne- The lock is acquired when the stwcx. completes suc- and the subsequent lwz, but no isync is needed if all cessfully. Together, the second bne- and the subse- accesses to the shared data structure depend on the quent isync create an import barrier that prevents the value returned by the lwarx. load from “data1” from being performed until the branch has been resolved not to be taken. Appendix B. Programming Examples for Sharing Storage 41 Version 2.01 In this example it is assumed that the shared data B.2.2 Lock Release and Export structure is in storage that is neither Write Through Barriers Required nor Caching Inhibited, the address of the lock is in GPR 3, the value indicating that the lock is An “export barrier” is an instruction or sequence of free is in GPR 4, and the address of the shared data instructions that prevents the store that releases a structure is in GPR 9. lock from being performed before stores caused by stw r7,data1(r9) #store shared data (last) instructions preceding the barrier have been per- eieio #export barrier formed. An export barrier can be used to ensure that stw r4,lock(r3) #release lock all stores to a shared data structure protected by a lock will be performed with respect to any other The eieio ensures that the store that releases the lock processor before the store that releases the lock is will not be performed with respect to any other performed with respect to that processor. processor until all stores caused by instructions pre- ceding the eieio have been performed with respect to B.2.2.1 Export Shared Storage and that processor. Release Lock However, for storage that is neither Write Through A sync instruction can be used as an export barrier Required nor Caching Inhibited, eieio orders only independent of the storage control attributes (e.g., stores and has no effect on loads. If the portion of presence or absence of the Caching Inhibited attri- the program preceding the eieio contains loads from bute) of the storage containing the shared data struc- the shared data structure and the stores to the ture. Because the lock must be in storage that is shared data structure do not depend on the values neither Write Through Required nor Caching Inhibited, returned by those loads, the store that releases the if the shared data structure is in storage that is Write lock could be performed before those loads. If it is Through Required or Caching Inhibited a sync instruc- necessary to ensure that those loads are performed tion must be used as the export barrier. before the store that releases the lock, lwsync should be used instead of eieio. Alternatively, the technique In this example it is assumed that the shared data described in Section B.2.3 can be used. structure is in storage that is Caching Inhibited, the address of the lock is in GPR 3, the value indicating that the lock is free is in GPR 4, and the address of B.2.3 Safe Fetch the shared data structure is in GPR 9. If a load must be performed before a subsequent stw r7,data1(r9) #store shared data (last) store (e.g., the store that releases a lock protecting a sync #export barrier shared data structure), a technique similar to the fol- stw r4,lock(r3) #release lock lowing can be used. The sync ensures that the store that releases the lock In this example it is assumed that the address of the will not be performed with respect to any other storage operand to be loaded is in GPR 3, the con- processor until all stores caused by instructions pre- tents of the storage operand are returned in GPR 4, ceding the sync have been performed with respect to and the address of the storage operand to be stored that processor. is in GPR 5. B.2.2.2 Export Shared Storage and lwz r4,0(r3) #load shared data cmpw r4,r4 #set CR0 to "equal" Release Lock using eieio or lwsync bne− $−8 #branch never taken stw r7,0(r5) #store other shared data If the shared data structure is in storage that is neither Write Through Required nor Caching Inhibited, An alternative is to use a technique similar to that an eieio instruction can be used as the export barrier. described in Section B.2.1.2, by causing the stw to Using eieio rather than sync will yield better perform- depend on the value returned by the lwz and omitting ance in most systems. the cmpw and bne-. The dependency could be created by ANDing the value returned by the lwz with zero and then adding the result to the value to be stored by the stw. If both storage operands are in storage that is neither Write Through Required nor Caching Inhibited, another alternative is to replace the cmpw and bne- with an lwsync instruction. 42 PowerPC Virtual Environment Architecture Version 2.01 B.3 List Insertion B.4 Notes This section shows how the lwarx and stwcx. 1. To increase the likelihood that forward progress instructions can be used to implement simple is made, it is important that looping on insertion into a singly linked list. (Complicated list lwarx/stwcx. pairs be minimized. For example, in insertion, in which multiple values must be changed the “Test and Set” sequence shown in Section atomically, or in which the correct order of insertion B.1, this is achieved by testing the old value depends on the contents of the elements, cannot be before attempting the store; were the order implemented in the manner shown below and requires reversed, more stwcx. instructions might be exe- a more complicated strategy such as using locks.) cuted, and reservations might more often be lost between the lwarx and the stwcx.. The “next element pointer” from the list element after 2. The manner in which lwarx and stwcx. are com- which the new element is to be inserted, here called municated to other processors and mechanisms, the “parent element”, is stored into the new element, and between levels of the storage hierarchy so that the new element points to the next element in within a given processor, is implementa- the list; this store is performed unconditionally. Then tion-dependent. In some implementations per- the address of the new element is conditionally stored formance may be improved by minimizing looping into the parent element, thereby adding the new on a lwarx instruction that fails to return a element to the list. desired value. For example, in the “Test and Set” sequence shown in Section B.1, if the pro- In this example it is assumed that the address of the grammer wishes to stay in the loop until the word parent element is in GPR 3, the address of the new loaded is zero, he could change the “bne- $ + 1 2 ” element is in GPR 4, and the next element pointer is to “bne- loop”. However, in some implementa- at offset 0 from the start of the element. It is also tions better performance may be obtained by assumed that the next element pointer of each list using an ordinary Load instruction to do the initial element is in a reservation granule separate from checking of the value, as follows. that of the next element pointer of all other list ele- ments. loop: lwz r5,0(r3) #load the word cmpwi r5,0 #loop back if word loop: lwarx r2,0,r3 #get next pointer bne− loop # not equal to 0 stw r2,0(r4) #store in new element lwarx r5,0,r3 #try again, reserving eieio #order stw before stwcx. cmpwi r5,0 # (likely to succeed) stwcx. r4,0,r3 #add new element to list bne− loop bne− loop #loop if stwcx. failed stwcx. r4,0,r3 #try to store non-0 In the preceding example, lwsync can be used instead bne− loop #loop if lost reserv'n of eieio. 3. In a multiprocessor, livelock is possible if there is a Store instruction (or any other instruction that In the preceding example, if two list elements have can clear another processor's reservation; see next element pointers in the same reservation Section 1.7.3.1) between the lwarx and the stwcx. granule then, in a multiprocessor, “livelock” can of a lwarx/stwcx. loop and any byte of the occur. (Livelock is a state in which processors storage location specified by the Store is in the interact in a way such that no processor makes reservation granule. For example, the first code forward progress.) sequence shown in Section B.3 can cause livelock if two list elements have next element pointers in If it is not possible to allocate list elements such that the same reservation granule. each element's next element pointer is in a different reservation granule, then livelock can be avoided by using the following, more complicated, sequence. lwz r2,0(r3) #get next pointer loop1: mr r5,r2 #keep a copy stw r2,0(r4) #store in new element sync #order stw before stwcx. # and before lwarx loop2: lwarx r2,0,r3 #get it again cmpw r2,r5 #loop if changed (someone bne− loop1 # else progressed) stwcx. r4,0,r3 #add new element to list bne− loop2 #loop if failed In the preceding example, livelock is avoided by the fact that each processor reexecutes the stw only if some other processor has made forward progress. Appendix B. Programming Examples for Sharing Storage 43 Version 2.01 44 PowerPC Virtual Environment Architecture Version 2.01 Appendix C. Cross-Reference for Changed POWER Mnemonics The following table lists the POWER instruction mne- second column of the table: the remainder of the line monics that have been changed in the PowerPC gives the PowerPC mnemonic and the page on which Virtual Environment Architecture, sorted by POWER the instruction is described, as well as the instruction mnemonic. names. To determine the PowerPC mnemonic for one of these POWER mnemonics that have not changed are not POWER mnemonics, find the POWER mnemonic in the listed. POWER PowerPC Page Mnemonic Instruction Mnemonic Instruction 19 dclz Data Cache Line Set to Zero dcbz Data Cache Block set to Zero 25 dcs Data Cache Synchronize sync Synchronize 21 ics Instruction Cache Synchronize isync Instruction Synchronize Appendix C. Cross-Reference for Changed POWER Mnemonics 45 Version 2.01 46 PowerPC Virtual Environment Architecture Version 2.01 Appendix D. New Instructions The following instructions in the PowerPC Virtual Envi- ronment Architecture are new: they are not in the POWER Architecture. The eciwx and ecowx instructions are optional. dcbf Data Cache Block Flush dcbst Data Cache Block Store dcbt Data Cache Block Touch dcbtst Data Cache Block Touch for Store eciwx External Control In Word Indexed ecowx External Control Out Word Indexed eieio Enforce In-order Execution of I/O icbi Instruction Cache Block Invalidate ldarx Load Doubleword And Reserve Indexed lwarx Load Word And Reserve Indexed mftb Move From Time Base stdcx. Store Doubleword Conditional Indexed stwcx. Store Word Conditional Indexed Appendix D. New Instructions 47 Version 2.01 48 PowerPC Virtual Environment Architecture Version 2.01 Appendix E. PowerPC Virtual Environment Instruction Set Opcode Mode Form 1 Page Mnemonic Instruction Primary Extend Dep. X 31 86 20 dcbf Data Cache Block Flush X 31 54 19 dcbst Data Cache Block Store X 31 278 18 dcbt Data Cache Block Touch X 31 246 18 dcbtst Data Cache Block Touch for Store X 31 1014 19 dcbz Data Cache Block set to Zero X 31 310 34 eciwx External Control In Word Indexed X 31 438 34 ecowx External Control Out Word Indexed X 31 854 27 eieio Enforce In-order Execution of I/O X 31 982 17 icbi Instruction Cache Block Invalidate XL 19 150 21 isync Instruction Synchronize X 31 84 23 ldarx Load Doubleword And Reserve Indexed X 31 20 23 lwarx Load Word And Reserve Indexed XFX 31 371 30 mftb Move From Time Base X 31 214 24 stdcx. Store Doubleword Conditional Indexed X 31 150 24 stwcx. Store Word Conditional Indexed X 31 598 25 sync Synchronize 1Key to Mode Dependency Column Except as described in the section entitled “Effective Address Calculation” in Book I, all instructions in the PowerPC Virtual Environment Architecture are inde- pendent of whether the processor is in 32-bit or 64-bit mode. Appendix E. PowerPC Virtual Environment Instruction Set 49 Version 2.01 50 PowerPC Virtual Environment Architecture Version 2.01 Index A G aliasing 6 Guarded 5 alignment effect on performance 13, 36 atomic operation 8 I atomicity 3 single-copy 3 icbi instruction 11, 17 instruction cache instructions 17 instruction restart 14 B instruction storage 1 instructions block 2 dcbf 20 dcbst 11, 19 dcbt 18, 35 C dcbtst 18 dcbz 19 cache management instructions 16, 35 eciwx 33, 34 cache model 3 ecowx 33, 34 cache parameters 15 eieio 6, 27 Caching Inhibited 4 icbi 11, 17 consistency 6 isync 11, 21 ldarx 8, 23 lwarx 8, 23 D lwsync 25 mftb 30 data cache instructions 18, 35 ptesync 25 data storage 1 rfid 11 dcbf instruction 20 stdcx. 8, 24 dcbst instruction 11, 19 storage control 15, 35 dcbt instruction 18, 35 stwcx. 8, 24 dcbtst instruction 18 sync 11, 25 dcbz instruction 19 isync instruction 11, 21 E L eciwx instruction 33, 34 ldarx instruction 8, 23 ecowx instruction 33, 34 lwarx instruction 8, 23 eieio instruction 6, 27 lwsync instruction 25 extended mnemonics 37 M F main storage 1 forward progress 10 memory barrier 6 Index 51 Version 2.01 Memory Coherence Required 5 mftb instruction 30 V virtual storage 2 O optional instructions 33 W dcbt 35 eciwx 34 Write Through Required 4 ecowx 34 P page 2 performed 2 program order 1 ptesync instruction 25 R registers Time Base 29 rfid instruction 11 S single-copy atomicity 3 stdcx. instruction 8, 24 storage access order 6 atomic operation 8 instruction restart 14 order 6 ordering 6, 25, 27 reservation 8 shared 6 storage access 1 definitions program order 1 storage access ordering 39 storage control attributes 4 storage control instructions 15, 35 storage location 1 stwcx. instruction 8, 24 sync instruction 11, 25 Synchronize 6 T TB 29 TBL 29 TBU 29 Time Base 29 52 PowerPC Virtual Environment Architecture Version 2.01 Last Page - End of Document Last Page - End of Document 53