Free Shipping on shipments of $40 or more. See details for more.

Don't Miss

Visit Our Stores

Powell's Staff: Five Book Friday: In Memoriam (0 comment)

Every year, the booksellers at Powell’s submit their Top Fives: their five favorite books that were released in 2023. It’s a list that, when put together, shows just how varied and interesting the book tastes of Powell’s booksellers are. I highly recommend digging into the recommendations — we would never lead you astray — but today...

Brontez Purnell: Powell’s Q&A: Brontez Purnell, author of ‘Ten Bridges I’ve Burnt’ (0 comment)
Rachael P.: Starter Pack: Where to Begin with Ursula K. Le Guin (0 comment)

High Performance Computing 2nd Edition

High Performance Computing 2nd Edition

ISBN13: 9781565923126
ISBN10: 156592312X
Condition: Standard

All Product Details

Add to Cart

Trade Paperback

Add to Wishlist

Excerpt

[ Symbols ], [ A ], [ B ], [ C ], [ D ], [ E ], [ F ], [ G ], [ H ], [ I ], [ J ], [ K ], [ L ], [ M ], [ N ], [ O ], [ P ], [ Q ], [ R ], [ S ], [ T ], [ U ], [ V ], [ W ], [ X ], [ Y ],

Symbols[ Top ] /
Numbers
32- and 64-bit numbers, 69
100100 program (Linpack benchmark), 333-334
21164 processor (DEC), 388
21264 processor (DEC), 390

A[ Top ]
access time, 33
addresses (see pointers)
addressing memory, modes for, 22
advanced optimization, 90
ALLCACHE architecture, 264
allocatable arrays (FORTRAN 90), 282
Alpha 21164 processor (DEC), 388
Alpha 21264 processor (DEC), 390
ambiguous references, 187-190
Amdahl, Gene, 246
Amdahl's Law, 109, 246-248
antidependencies, 175, 184
architecture (see ISAs)
arrays
   element storage, 161
   FORTRAN 90 and, 279-280, 282
   handling elements in loops, 143
   memory access patterns, 157-160
assembly language, 9, 395-406
assertions (parallel-processing comments), 229-232
assignment primitive (FORTRAN 90), 281
"assume no dependencies" flag, 228
automatic arrays (FORTRAN 90), 282
automatic parallelization, 222-228
   compiler considerations, 226-241
autopar flag, 223
average memory utilization, 103

B[ Top ]
backward dependencies (see flow dependencies)
bandwidth, 47
barriers, 215
base-10, converting to floating-point, 70
baseline timing, 107
basic block profilers, 120-122
basic blocks, 88
Basic optimizations (optimization level), 90
BCD (binary coded decimal), 59
benchmark kits, 359
benchmarks for performance, 100, 329-346
   creating your own, 347-364
      checklist for, 360-363
      coding, 358-363
      types of benchmarks, 353-358
      what to benchmark, 348-353
   industry-defined, 339-345
      SPEC benchmarks, 340-344
      for transaction processing, 344-345
   memory performance, 425-434
   Mflops, 331
   MIPS, 330
   program timing, 101-107
   subroutine profiles, 108-120
   user-defined, 332-339
      HINT benchmark, 339
      Linpack benchmark, 332-334
      Linpack Highly Parallel Computing benchmark, 334-335
      NAS benchmark, 336
      SPLASH benchmark, 335
      STREAM benchmark, 337
binary coded decimal (BCD), 59
blocked I/O operations, 104
blocking, 161-165
branch delay slots, 19
branch instructions
   delayed, 19-21
   detecting, 16
   predicated instructions, 410-411
branches, 134-139
   transfering control with, 138
broadcast/gather computing model (MPI), 302, 320-324
buses, 194, 250
   montoring traffic on, 198, 263
bypassing caches, 50

C[ Top ]
C language on MC68020, 398
cache coherency protocol, 198
cache-coherent nonuniform memory access (ccNUMA), 263-264
cached DRAM, 55
caches, 35-39
   bypassing, 50
   coherency between, 37, 198, 263-264
      shared memory without, 264
   large, 47
   line states, 199
   mapping schemes, 39-43
   prefetching, 52
   shared multiprocessing and, 196-200
call graphs, 113
ccNUMA (cache-coherent nonuniform memory access), 263-264
character operations, 140
CISC processors, 8-11, 367-375
   instruction length, 18
classical optimizations, list of, 91-98
clock speed, 14
clutter, eliminating, 128-145
   common subexpression elimination, 95, 141
   conditional branches, 134-139
   data type conversions, 139
   macros, 131-133
   procedure inlining, 133
   reorganizing loops, 142-144
   subroutine calls, 129-131
code (see programs)
coherency between caches, 37, 198, 263-264
   shared memory without, 264
common subexpression elimination, 95, 141
common variables
   memory storage for, 104
   subroutine calls and, 130
communication modes (MPI), 313
communicators (MPI), 313
company, benchmarking, 353
compilation process, 85
compilers, 81-99
   automatic parallelization and, 226-241
   benchmarking, 353
   choosing language for, 83-85
   classical optimizations, list of, 91-98
   floating-point operations and, 75
   history of, 82-83
   how they work, 85-89
   macros, 131-133
   optimization levels, 89-90
complex instruction sets, 9-11, 369-371
computations with floating-point numbers (see floating-point operations)
conditional branches, 134-139
   loops with, 134-139, 152
   transfering control with, 138
conditional execution, 20
constant folding, 92
control dependencies, 173-175
control features, FORTRAN 90, 281
control section (CPU), 372
control structures (HPF), 294
control-transfer conditionals, 138
Convex C-240 processor, 404
copy propagation, 91
copyback caching, 199
cpp macros, 131-133
CPU registers, 34
CPU time, 102
   elapsed time and, 103
   etime function for (FORTRAN), 105
CPUs
   control section, 372
   dividing work among, 287-289
Cray-1S computer, 375
critical sections, 214, 234
crossbars, 194, 251
custom benchmarks, 347-364
   checklist for, 360-363
   coding, 358-363
   types of benchmarks, 353-358
   what to benchmark, 348-353
cycle time, 33

D[ Top ]
DAGs (directed acyclic graphs), 177-178
Dally, William, 253
data decomposition, 287
data dependencies, 173, 175-176
   loops and, 181-187
data flow analysis, 90, 178
data flow processors, 384
data layout directives (HPF), 291-294
data-parallel problems, 273-277, 288
   in FORTRAN 90, 283-285
   in HPF, 296
   in MPI, 314-324
   in PVM, 305-311
data type conversions, 139
dead code removal, 93
DEC Alpha 21164 processor, 388
DEC Alpha 21264 processor, 390
decomposing work for CPU distribution, 287-289
#define construct, 131
delayed branch instructions, 19-21
demand paging, 46
denormalized numbers, 74
dependencies, 173-180
   ambiguous references, 187-190
   "assume no dependencies" flag, 228
   loops and, 181-187
   "no/ignore dependencies" assertions, 229
dependency distance, 187
dependent loop conditionals, 137
development environment, benchmarking, 353
direct mapped caches, 39-40
directed acyclic graphs (DAGs), 177-178
directory-based caching, 263
distributed-memory MIMD architecture, 265-267
DOALL loops, 226
Dongarra, Jack, 332
DOUBLE data type, 69
double-precision calculations, 139
DRAM (dynamic RAM), 32
   access and cycle times, 33
   technology trends, 54
   (see also memory)
dual mode (i860 processor), 382
dynamic iteration scheduling, 239
dynamic memory allocation, 282
dynamic random access memory (see DRAM)

E[ Top ]
EDO RAM, 55
elapsed time, 102
   gettimeofday routine (example), 106
   page faults and swaps, 105
EPIC (explicitly parallel instruction computing), 29, 407-413
etime function (FORTRAN), 105
exceptions for floating-point operations, 75
exclusive cache lines, 199
executing instructions (see instructions)
execution profiles (see subroutine profiles)
execution rate of system, 330
explicitly parallel instruction computing (EPIC), 29, 407-413
exponent/mantissa representation, 61-63
extended data out RAM, 55
external variables
   memory storage for, 104
   subroutine calls and, 130
extrinsics, HPF, 295

F[ Top ]
families of architecture, 373
fast instruction set computer (FISC), 25
fast page mode DRAM, 55
fat loops, 151
fences, 128
Firstprivate option, 236
FISC (fast instruction set computer), 25
fixed-point representation, 61
fixed thread scheduling, 414
FLASH project, 335
flat profiles, 109
   gprof's timing profile, 117
flexibility, subroutine calls and, 129
floating-point operations, 58-78
   CISC processors and, 9, 13
   compilers issues, 75
   consequences of inexact representation, 63-67
   exceptions and traps, 75
   guard digits, 67, 73
   IEEE 754 standard, 68-76
   Mflops, 331
   MIPS R8000 processor and, 387
   number representation schemes, 59-63
   number storage format, 69
   optimizations, 90
   parallelization and, 228
   pipelining, 18
   RISC processors and, 12
   special values, 73
   subroutine-based, 130
flow dependencies, 175, 183
flow of control, 173
Flynn, Michael, 257
FORALL statement (HPF), 294
fork(), 202
fork-join programming, 213
FORTRAN language, 83-85
   cpp macros with, 132
   FORTRAN 77, 84, 278, 285
   FORTRAN 90, 278-287
      FORTRAN 77 vs., 84, 285
      heat flow problem, 283-285
      HPF and, 290
   FORTRAN 95, 290
   HPF (High Performance FORTRAN), 289-297, 324
   thread management, 414-424
fractions to represent real numbers, 60
fully associative caches, 40
function calls, 129-131
   loops with, 152

G[ Top ]
gather (memory access pattern), 39
gettimeofday routine (example), 106
gprof profiling tool, 113-119
accumulating several result sets, 118
flat profile of, 117
gradual underflow, 74
granularity, 171
greatest common divisor (GCD), 61
guard digits, 67, 73
Gustafson, John, 339
Gustafson's Law, 247

H[ Top ]
HAL SPARC64 processor, 390
hardware performance benchmarks, 329-346
   creating your own, 347-364
      checklist for, 360-363
      coding, 358-363
      types of benchmarks, 353-358
      what to benchmark, 348-353
   industry-defined, 339-345
      SPEC benchmarks, 340-344
      for transaction processing, 344-345
   Mflops, 331
   MIPS, 330
   user-defined, 332-339
      HINT benchmark, 339
      Linpack benchmark, 332-334
      Linpack Highly Parallel Computing benchmark, 334-335
      NAS benchmark, 336
      SPLASH benchmark, 335
      STREAM benchmark, 337
hardware, vector processors and, 379
Harvard Memory Architecture, 42
heat flow problem, 273-277
   in FORTRAN 90, 283-285
   in HPF, 296
   in MPI, 314-324
   in PVM, 305-311
high performance (see performance)
High Performance FORTRAN (HPF), 289-297, 324
   control structures, 294
   data layout directives, 291-294
   heat flow problem, 296
HINT benchmark, 339
hinv command, 124
hoisting/sinking loop operations, 142
HP PA-8000 processor, 392
HPF (High Performance FORTRAN), 289-297, 324
   control structures, 294
   data layout directives, 291-294
   heat flow problem, 296
hypercube interconnect topology, 255

I[ Top ]
IA-64 processor, 29, 407-413
IBM PowerPC 64 processor, 391
IBM RS-6000 processor, 405
IEEE 754 floating-point standard, 68-76
if statement (see conditional branches)
"ignore dependencies" assertion, 229
IL (intermediate language) representation, 86
independent loop conditionals, 137
index, cache, 36
induction variable simplification, 96
industry performance benchmarks, 339-345
   SPEC benchmarks, 340-344
   for transaction processing, 344-345
infinity (value), 74
inline substitution assertions, 232
inlining subroutines and functions, 133
inner loops, 153
instruction caches, 42
instruction-level parallelism, 171
Instruction Reorder Buffer, 392
instruction set architectures (see ISAs)
instructions
   architectures for (see ISAs)
   branches (see branch instructions), 16
   conditional execution, 20
   for data flow processors, 385
   load/store architecture, 21
   microcode during execution, 374
   microinstructions, 373
   out-of-order (speculative) execution, 25-28
      RS (Reservation Station), 393
   pipelining, 14-18
      floating-point operations, 18
      memory pipelines, 377
      RISC vs. post-RISC, 27
      superpipelined processors, 23, 387
   predicated, 410-411
   reduced sets (see RISC processors)
   simultaneous execution of, 14
      superscalar processors, 23, 382, 389
   uniform instruction length, 18-19
   for vector processors, 378-379
insufficient memory for programs, 166
   determining in advance, 123
Intel 8008 processor, 375
Intel 8088 processor, 396
Intel i860 processor, 382
Intel IA-64 processor, 29, 407-413
Intel Pentium Pro processor, 393
interactive benchmarks, 357
interchanging loops, 156
   memory access patterns and, 159-161
interconnect technology, 249-257
interleaved memory systems, 51
intermediate language (IL) representation, 86
interprocedural analysis, 90, 232
intrinsics, FORTRAN 90, 280
intrinsics, HPF, 295
invariant conditionals, 135
I/O operations, blocked, 104
IRB (Instruction Reorder Buffer), 392
irregular interconnect topologies, 256
ISAs (instruction set architectures), 367-394
   architecture families, 373
   CISC (see CISC processors)
   experiments on, 381-386
   FISC, 25
   future trends, 28-29
   RISC (see RISC processors)
   VLIW (very long instruction word), 383
iteration scheduling, 236-241

K[ Top ]
kernel benchmarks, 335, 349-351
kernel mode, 101

L[ Top ]
language, compiler, 83-85
large-scale parallelism, 245-271
   architecture taxonomy, 257-259
   distributed-memory MIMD, 265-267
   interconnect technology, 249-257
   shared nonuniform MIMD, 261
   shared uniform MIMD, 260
   SIMD machines, 267-270
Lastprivate option, 236
latency, 47
line states, caches, 199
lines, cache, 36
   (see also caches, mapping schemes)
Linpack benchmark, 332-334
Linpack Highly Parallel Computing benchmark, 334-335
load/store architecture, 21
locality of reference, 37, 196
loop index dependent conditionals, 136
loop-invariant code motion, 96
loop-invariant conditionals, 135
loop nests, 153
   parallelization of, 226
   rearranging, 156, 159-161
loopinfo flag, 223
loops
   array elements in, 143
   conditional branches in, 134-139, 152
   dependencies and, 181-187
   hoisting/sinking operations, 142
   loop interchange, 156, 159-161
   memory access patterns and, 157-160
   nested, 153-156
   optimizing, 146-168
      loop unrolling, 149-153
      operation counting, 147-149
   parallel loops, 235
   parallelism and, 180-181
   preconditioning loops, 150
   procedure calls in, 152
lscfg command, 124

M[ Top ]
machine balance, 338
macros, 131-133
magnetic core memory, 32
mantissa/exponent representation, 61-63
manual parallelism, 233-241
mapping caches, 39-43
master/slave computing model, 301
matrix multiplication, 159
max function, 138
MC68000 processors, 368
measuring performance, 6, 329-346
   creating your own benchmarks, 347-364
      checklist for, 360-363
      coding, 358-363
      types of benchmarks, 353-358
      what to benchmark, 348-353
   industry-defined benchmarks, 339-345
      SPEC benchmarks, 340-344
      for transaction processing, 344-345
   memory performance, 425-434
   Mflops, 331
   MIPS, 330
   user-defined benchmarks, 332-339
      HINT benchmark, 339
      Linpack benchmark, 332-334
      Linpack Highly Parallel Computing benchmark, 334-335
      NAS benchmark, 336
      SPLASH benchmark, 335
      STREAM benchmark, 337
mem1d benchmark, 425-429
mem2d benchmark, 430-434
memory, 31-57
   access and cycle times, 33
   access patterns, 37, 157-160
      blocking and, 161-165
   addressing modes, 22
   average memory utilization, 103
   bandwidth benchmark (STREAM), 337
   benchmarking, 349
   caches (see caches)
   data flow processor semantics, 384
   dynamic memory allocation (FORTRAN 90), 282
   floating-point storage format, 69
   HINT benchmark, 339
   improving performance of, 47-55
   insufficient for programs, 123, 166
   interleaved and pipelined systems, 51
   load/store architecture and, 21
   mem1d and mem2d benchmarks, 425-434
   page faults, 105, 124
   post-RISC architecture and, 53
   registers, 34
   shared-memory multiprocessors, 193-221
      caches and, 196-200
      data placement/movement, 200
      example of, 216-219
      multithreading techniques, 212-216
      programming, 222-242
      SMP hardware, 194-200
      software concepts, 200-212
      (see also parallelism)
   speculative loads, 411-412
   technology of, 32-34, 54
   vector processor memory systems, 377
   virtual memory, 43-47, 123-125
memory pipelines, 377
memsize command, 124
Merced (see Intel IA-64 processor)
mesh connections, 253
MESI protocol, 199
message-passing environments, 4, 299-325
   MPI (message-passing interface), 312-324
   PVM (parallel virtual machine), 300-312
message-passing interface (MPI), 312-324
   heat flow problem, 314-324
   PVM vs., 313
Mflops, 331
microcoding, 371-375
microinstructions, 373
microprocessors (see processors)
   EPIC and Intel IA-64, 28-29, 407-413
microprograms, 373
MIMD architecture
   distributed-memory MIMD, 265-267
   shared nonuniform memory MIMD, 261
   shared uniform memory MIMD, 260
min function, 138
MINs (multistage interconnection networks), 252
MIPS architecture
   R10000 processor, 391
   R2000, R3000 processors, 380
   R4000, R4400 processors, 387
   R4300i processor, 388
   R8000 processor, 387
MIPS (millions of instructions per second), 330
mixed stride, 165
mixed typing, 139
Motorola MC68000 processors, 368
Motorola MC68020 processor, 397-399
MPI (message-passing interface), 312-324
   heat flow problem, 314-324
   PVM vs., 313
MQUIPS, 339
Multiflow's Trace systems, 383
multiple instruction issue processors, 23
multiprocessors, 37
   operating-system-supported, 201-205
   shared-memory, 193-221
      caches and, 196-200
      data placement/movement, 200
      example of, 216-219
      multithreading techniques, 212-216
      programming, 222-242
      SMP hardware, 194-200
      software concepts, 200-212
      (see also parallelism)
   (see also multithreading)
multistage interconnection networks (MINs), 252
multithreading, 200
   barriers, 215
   example of, 216-219
   FORTRAN thread management, 414-424
   operating-system-supported, 210-212
   parallel regions, 233-235
   synchronization, 213-215
   thread-level parallelism, 171
   thread private area, 205
   user space, 205-210
mutexes, 214

N[ Top ]
NaN (not a number), 74
NAS parallel benchmarks, 336
nested loops, 153-156
network of workstations (NOW), 4, 265
Nintendo-64 video game, 388
Ni's Law, 247
"no dependencies" assertion, 229
"no equivalences" assertion, 231
No optimization (optimization level), 90
"no side effects" assertions, 232
nonuniform memory access (NUMA) systems, 5, 261
ccNUMA, 263-264
non-unit stride, 37, 159-161
not a number (NaN), 74
NOW (network of workstations), 4, 265
NPB benchmarks, 336
NUMA (nonuniform memory access) systems, 5, 261
ccNUMA, 263-264

O[ Top ]
object code generation, 97
OpenMP standard, 233
operating system, benchmarking, 353
operating system-supported multiprocessors, 201-205
operating system-supported multithreading, 210-212
operation counting, 147-149
optimizing compilers, 81-99
   choosing language for, 83-85
   classical optimizations, list of, 91-98
   how they work, 85-89
   optimization levels, 89-90
optimizing performance (see performance)
orthogonality of benchmarks, 348
"Out of memory?" message, 125
outer loops, 153
out-of-core solutions, 123, 166
out-of-order (speculative) execution
   RS (Reservation Station), 393
out-of-order instruction execution, 25-28
output dependencies, 175, 185
overflow errors, 70
overlapping instruction execution, 14
   superscalar processors, 23, 382, 389
   (see also pipelining instructions)

P[ Top ]
PA-8000 processor (HP), 392
page faults, 105, 124
pages of memory, 43
   page faults, 46
   page tables, 43
parallel regions, 233-235
parallel virtual machine (PVM), 300-312, 324
   heat flow problem, 305-311
   MPI vs., 313
parallelism, 171-192
   ambiguous references, 187-190
   Amdahl's Law, 109, 246-248
   architecture taxonomy, 257-259
   assertions, 229-232
   automatic parallelization, 222-228
   broadcast/gather computing model (MPI), 302, 320-324
   DAGs (directed acyclic graphs), 177-178
   dependencies, 173-180
      loops and, 181-187
   distributed-memory MIMD, 265-267
   heat flow problem, 273-277
      in FORTRAN 90, 283-285
      in HPF, 296
      in MPI, 314-324
      in PVM, 305-311
   interconnect technology, 249-257
   large-scale, 245-271
   loops and, 180-181, 235
   manual, 233-241
   message-passing environments, 299-325
      MPI (message-passing interface), 312-324
      PVM (parallel virtual machine), 300-312
   parallel languages, 277-278
   shared nonuniform MIMD, 261
   shared uniform MIMD, 260
   SIMD machines, 267-270
partial differential equations, 276
Pentium Pro processor (Intel), 393
percent utilitization, 103
performance
   Amdahl's Law, 109, 246-248
   benchmarks, 329-346
      creating your own, 347-364
      industry-defined, 339-345
      memory performance, 425-434
      Mflops, 331
      MIPS, 330
      user-defined, 332-339
   dividing work among CPUs, 287-289
   eliminating clutter, 128-145
      common subexpression elimination, 95, 141
      conditional branches, 134-139
      data type conversions, 139
      macros, 131-133
      procedure inlining, 133
      reorganizing loops, 142-144
      subroutine calls, 129-131
   importance of, 3
   loops (see loops)
   measuring, 6
   memory, 425-434
   parallelism (see parallelism)
   program timing, 101-107
   programming languages and, 272-298
   scope of, 4
   studying, 5
   subroutine profiles, 108-120
   virtual memory and, 123-125
permutation assertions, 231
permutations, 188
pipelined memory systems, 51
pipelined multistep communications, 251
pipelining instructions, 14-18
   floating-point operations, 18
   memory pipelines, 377
   RISC vs. post-RISC, 27
   superpipelined processors, 23, 387
pixie basic block profiler, 122
pointer ambiguity, 188-190
pointer chasing, 38
pointers (addresses), 84
   common subexpression elimination, 95
portability, benchmark, 358
post-RISC architecture, 25-28, 390-393
   loop unrolling, 150
   memory references, 53
PowerPC architecture (IBM), 391
precision of floating-point operations
   guard digits, 67, 73
   inexact representation and, 63-67
preconditioning loops, 150
predicated instructions, 410-411
prefetching, 52
problem decomposition, 248, 287-289
procedure calls, 129-133
   inlining, 133
   loops with, 152
   macros, 131-133
processors, 8-30
   CISC (see CISC processors)
   compiler design and, 83
   data flow processors, 384
   RISC (see RISC processors)
   superscalar, 23
      Intel i860 processor, 382
      Sun UltraSPARC processor, 389
   vector processors, 375-380
   VLIW processors, 383
prof profiling tool, 110-112
profiling, 108-120
   Amdahl's Law, 109, 246-248
   basic block profilers, 120-122
   quantization errors, 119
   runtime profile analysis, 90
   sharp vs. flat, 108
   subroutine profilers, 110-119
   virtual memory, 123-125
programming languages, 272-298
   assembly language, 395-406
programs
   eliminating clutter, 128-145
      common subexpression elimination, 95, 141
      conditional branches, 134-139
      data type conversions, 139
      macros, 131-133
      procedure inlining, 133
      reorganizing loops, 142-144
      subroutine calls, 129-131
   insufficient memory for, 123, 166
   subroutine profiles, 108-120
   timing, 101-107
      Amdahl's Law, 109, 246-248
      getting time information, 105
ps aux command, 124
pthread_create(), 207, 213
pthread_join(), 208
pthread_mutex_init(), 215
pthreads library, 209
published benchmarks for performance, 329-346
   industry-defined, 339-345
      SPEC benchmarks, 340-344
      for transaction processing, 344-345
   Mflops, 331
   MIPS, 330
   user-defined, 332-339
      Linpack benchmark, 332-334
      Linpack Highly Parallel Computing benchmark, 334-335
      NAS benchmark, 336
      SPLASH benchmark, 335
PURE functions (HPF), 294
PVM (parallel virtual machine), 300-312, 324
   heat flow problem, 305-311
   MPI vs., 313

Q[ Top ]
quadruples, 87
quantization errors in profiling, 119
queue of tasks, 302
QUIPS (quality improvement per second), 339

R[ Top ]
R10000 processor (MIPS), 391
R2000, R3000 processors (MIPS), 380
R4000, R4400 processors (MIPS), 387
R4300i processor (MIPS), 388
R8000 processor (MIPS), 387
RAM, dymamic vs. static, 32
RAMBUS technology, 55
rational number representation of reals, 60
REAL data type, 69
real numbers, 58
red-black technique, 274
   (see also heat flow problem)
reduced instruction set (see RISC processors)
reductions, 138, 186, 236
   automatic parallelization of, 228
   FORTRAN 90, 281
   SIMD machines and, 269
registers, 34
   IA-64 processor and, 413
   rename registers, 27
   vector registers, 376
relation assertions, 230
remote terminal emulation benchmark, 344
rename registers, 27
renaming variables, 94
Reservation Station (RS), 393
RISC processors, 4, 8, 11-25, 50, 83, 380-381
   design philosophy of, 13-14
   instruction length, 18
   post-RISC architecture, 25-28, 390-393
      loop unrolling, 150
      memory references, 53
   second-generation, 22-24, 387-389
   SIMD machines and, 269
   SPARC architecture, 399-403
RS (Reservation Station), 393
RS-6000 processor, 405
runtime (see benchmarks for performance; performance; profiling)

S[ Top ]
S3MP project, 256
sanitized benchmarks, 349-351
sar command, 125
scalable coherent interface (SCI), 264
scalable computing, 246
scalable NUMA systems, 5, 261
   ccNUMA, 263-264
scalable processors, 259
scatter (memory access pattern), 39
sched_yield(), 210
scheduling iterations, 236-241
SCI (scalable coherent interface), 264
SDRAM (synchronous DRAM), 55
second-generation RISC processors, 22-24, 387-389
semaphores, 214
set associative caches, 41
SGI Challenge, 416, 419-423
shared cache lines, 200
shared nonuniform memory MIMD systems, 261
shared uniform memory MIMD systems, 260
shared-memory multiprocessors, 193-221
   caches and, 196-200
   data placement/movement, 200
   example of, 216-219
   multithreading techniques, 212-216
   programming, 222-242
   SMP hardware, 194-200
   software concepts, 200-212
   (see also parallelism)
shared-memory space, 103
sharp profiles, 108
side effects, 232
sign bits, 135
SIMD machines, 267-270
simulations, 245
simultaneous instruction execution, 14
   superscalar processors, 23, 382, 389
single-call collective operations (MPI), 314
single-chip processors, 11
single instruction mode (i860 processor), 382
single instruction, multiple data (SIMD) machines, 267-270
single-precision calculations, 139
single stream benchmarks, 354-356
sinking/hoisting loop operations, 142
SISAL language, 278
SISD architecture, 257
size command, 123
SLALOM benchmark, 339
SMP (symmetric multiprocessing), 4
   hardware for, 194-200
   software concepts, 200-212
   (see also shared-memory multiprocessors)
snooping bus traffic, 198, 263
software for shared-memory multiprocessors, 200-212
software-managed caches, 52
software-managed out-of-core solutions, 123, 166
SPARC architecture, 399-403
SPARC-1 processor, 380
SPARC64 processor (HAL), 390
spatial locality of reference, 37
SPEC benchmarks, 340-344
speculative instruction execution, 25-28
   RS (Reservation Station), 393
speculative loads, 411-412
speedup, 224
   Amdahl's Law and, 246
SPLASH benchmark, 335
SPMD/data decomposition computing model, 302
SRAM (static random access memory), 32
   caches, 35-43
   integrated with DRAM, 55
   (see also memory)
statement function, 131
static iteration scheduling, 237
static random access memory (see SRAM)
sticky bits, 72
STREAM benchmark, 337
strength reduction, 94
subexpression elimination, 95, 141
subnormal numbers, 70
subroutine calls, 129-131, 152
subroutine profiles, 108-120
   Amdahl's Law, 109, 246-248
   basic block profilers, 120-122
   profilers, 110-119
      quantization errors, 119
   sharp vs. flat, 108
Sun UltraSPARC processor, 389
superpipelined processors, 23, 387
superscalar processors, 23
   Intel i860 processor, 382
   Sun UltraSPARC processor, 389
swap area, 123
swaps, 105
symmetric multiprocessing (SMP), 4
   hardware for, 194-200
   software concepts, 200-212
   (see also shared-memory multiprocessors)
synchronization of threads, 213-215
synchronous DRAM (SDRAM), 55
system time, 102

T[ Top ]
tags (cache memory addresses), 39
task decomposition, 287
task queue computing model, 302
tcov basic block profiler, 120
temporal locality of reference, 37
third-party code, benchmarking, 351-352
thrashing, 39
thread-level parallelism, 171
thread private area, 205
threads (see multithreading)
throughput benchmarks, 356
tightly coupled distributed-memory MIMD systems, 266-267
time-based simulations, 215
time command, 101, 103
timing, 101-107
   Amdahl's Law, 109, 246-248
   getting time information, 105
   percent utilitization, 103
   portions of programs, 105
   subroutine profiles, 108-120
TLB (translation lookaside buffer), 44
Top 500 Report, 260
topologies (MPI), 313
toroid connections, 254
TPC (transaction processing) benchmarks, 344-345
Trace systems (Multiflow), 383
transaction processing benchmarks, 344-345
translation lookaside buffer (TLB), 44
traps for floating-point operations, 75
trip count assertions, 231
trip counts, loop, 151

U[ Top ]
UltraSPARC processor (Sun), 389
UMA (uniform memory access), 193
uniform instruction length, 18-19
uniform memory access (UMA), 193
uniform-memory multiprocessors, 259
   shared nonuniform memory MIMD systems, 261
   shared uniform memory MIMD systems, 260
unit stride, 37, 158-160
unlimit command, 125
unrolling loops, 149-153
   nested loops, 153
   qualifying candidates for, 150-153
unsafe optimization flag, 228
unshared-memory space, 104
uptime command, 102
user-defined performance benchmarks, 332-339
   HINT benchmark, 339
   Linpack benchmark, 332-334
   Linpack Highly Parallel Computing benchmark, 334-335
   NAS benchmark, 336
   SPLASH benchmark, 335
   STREAM benchmark, 337
user mode, 101
user space multithreading, 205-210
user-space thread context switch, 208

V[ Top ]
variable length instructions, 18
variable remaining, 94
variable uses and definitions, 178
variables, subroutine calls and, 130
vector processors, 375-380
vector registers, 376
virtual machine, parallel (PVM), 300-312
virtual memory, 43-47, 123-125
VLIW architecture, 383
vmstat command, 124

W[ Top ]
wait(), 205
wait states, 33
wall clock time, 102
   gettimeofday routine (example), 106
   page faults and swaps, 105
widening the memory system, 48
workstations, 13
   network of (see NOW)
wormhole routing, 253
writeback caching, 199

END

What Our Readers Are Saying

Be the first to share your thoughts on this title!

Product Details

ISBN:: 9781565923126
Binding:: Trade Paperback
Publication date:: 07/01/1998
Publisher:: OREILLY & ASSOCIATES INC
Series info:: RISC Architectures, Optimization & Benchmarks
Edition:: 2ED
Pages:: 464
Height:: 9.19 in.
Width:: 7 in.
Thickness:: 1.03 in.
Series:: RISC Architectures, Optimization & Benchmarks
Number of Units:: 1
Copyright Year:: 1998
UPC Code:: 2801565923128
Author:: Charles Severance
Author:: Kevin Dowd
Subject:: MULTIPROCESSING
Subject:: Computer Books: General
Subject:: Electronic digital computers
Subject:: DIGITAL COMPUTERS
Subject:: General Computers
Subject:: Technology
Subject:: Multi-Tasking (Data Processing)
Subject:: Supercomputers
Subject:: Computers
Subject:: Programming (electronic computers)
Subject:: High performance computing
Subject:: Architecture

Add to Cart

Trade Paperback

Add to Wishlist

Receive an email when this ISBN is available used.