S[ Top ]
S3MP project, 256
sanitized benchmarks, 349-351
sar command, 125
scalable coherent interface (SCI), 264
scalable computing, 246
scalable NUMA systems, 5, 261
ccNUMA, 263-264
scalable processors, 259
scatter (memory access pattern), 39
sched_yield(), 210
scheduling iterations, 236-241
SCI (scalable coherent interface), 264
SDRAM (synchronous DRAM), 55
second-generation RISC processors, 22-24, 387-389
semaphores, 214
set associative caches, 41
SGI Challenge, 416, 419-423
shared cache lines, 200
shared nonuniform memory MIMD systems, 261
shared uniform memory MIMD systems, 260
shared-memory multiprocessors, 193-221
caches and, 196-200
data placement/movement, 200
example of, 216-219
multithreading techniques, 212-216
programming, 222-242
SMP hardware, 194-200
software concepts, 200-212
(see also parallelism)
shared-memory space, 103
sharp profiles, 108
side effects, 232
sign bits, 135
SIMD machines, 267-270
simulations, 245
simultaneous instruction execution, 14
superscalar processors, 23, 382, 389
single-call collective operations (MPI), 314
single-chip processors, 11
single instruction mode (i860 processor), 382
single instruction, multiple data (SIMD) machines, 267-270
single-precision calculations, 139
single stream benchmarks, 354-356
sinking/hoisting loop operations, 142
SISAL language, 278
SISD architecture, 257
size command, 123
SLALOM benchmark, 339
SMP (symmetric multiprocessing), 4
hardware for, 194-200
software concepts, 200-212
(see also shared-memory multiprocessors)
snooping bus traffic, 198, 263
software for shared-memory multiprocessors, 200-212
software-managed caches, 52
software-managed out-of-core solutions, 123, 166
SPARC architecture, 399-403
SPARC-1 processor, 380
SPARC64 processor (HAL), 390
spatial locality of reference, 37
SPEC benchmarks, 340-344
speculative instruction execution, 25-28
RS (Reservation Station), 393
speculative loads, 411-412
speedup, 224
Amdahl's Law and, 246
SPLASH benchmark, 335
SPMD/data decomposition computing model, 302
SRAM (static random access memory), 32
caches, 35-43
integrated with DRAM, 55
(see also memory)
statement function, 131
static iteration scheduling, 237
static random access memory (see SRAM)
sticky bits, 72
STREAM benchmark, 337
strength reduction, 94
subexpression elimination, 95, 141
subnormal numbers, 70
subroutine calls, 129-131, 152
subroutine profiles, 108-120
Amdahl's Law, 109, 246-248
basic block profilers, 120-122
profilers, 110-119
quantization errors, 119
sharp vs. flat, 108
Sun UltraSPARC processor, 389
superpipelined processors, 23, 387
superscalar processors, 23
Intel i860 processor, 382
Sun UltraSPARC processor, 389
swap area, 123
swaps, 105
symmetric multiprocessing (SMP), 4
hardware for, 194-200
software concepts, 200-212
(see also shared-memory multiprocessors)
synchronization of threads, 213-215
synchronous DRAM (SDRAM), 55
system time, 102
T[ Top ]
tags (cache memory addresses), 39
task decomposition, 287
task queue computing model, 302
tcov basic block profiler, 120
temporal locality of reference, 37
third-party code, benchmarking, 351-352
thrashing, 39
thread-level parallelism, 171
thread private area, 205
threads (see multithreading)
throughput benchmarks, 356
tightly coupled distributed-memory MIMD systems, 266-267
time-based simulations, 215
time command, 101, 103
timing, 101-107
Amdahl's Law, 109, 246-248
getting time information, 105
percent utilitization, 103
portions of programs, 105
subroutine profiles, 108-120
TLB (translation lookaside buffer), 44
Top 500 Report, 260
topologies (MPI), 313
toroid connections, 254
TPC (transaction processing) benchmarks, 344-345
Trace systems (Multiflow), 383
transaction processing benchmarks, 344-345
translation lookaside buffer (TLB), 44
traps for floating-point operations, 75
trip count assertions, 231
trip counts, loop, 151
U[ Top ]
UltraSPARC processor (Sun), 389
UMA (uniform memory access), 193
uniform instruction length, 18-19
uniform memory access (UMA), 193
uniform-memory multiprocessors, 259
shared nonuniform memory MIMD systems, 261
shared uniform memory MIMD systems, 260
unit stride, 37, 158-160
unlimit command, 125
unrolling loops, 149-153
nested loops, 153
qualifying candidates for, 150-153
unsafe optimization flag, 228
unshared-memory space, 104
uptime command, 102
user-defined performance benchmarks, 332-339
HINT benchmark, 339
Linpack benchmark, 332-334
Linpack Highly Parallel Computing benchmark, 334-335
NAS benchmark, 336
SPLASH benchmark, 335
STREAM benchmark, 337
user mode, 101
user space multithreading, 205-210
user-space thread context switch, 208
V[ Top ]
variable length instructions, 18
variable remaining, 94
variable uses and definitions, 178
variables, subroutine calls and, 130
vector processors, 375-380
vector registers, 376
virtual machine, parallel (PVM), 300-312
virtual memory, 43-47, 123-125
VLIW architecture, 383
vmstat command, 124
W[ Top ]
wait(), 205
wait states, 33
wall clock time, 102
gettimeofday routine (example), 106
page faults and swaps, 105
widening the memory system, 48
workstations, 13
network of (see NOW)
wormhole routing, 253
writeback caching, 199
END