A paper on power optimization which points out to the various sources of architectural power waste and possibilities of optimization!
Dean M. Tullsen and John S. Seng
Following is the summary of the paper:
Idealized study for aggressive wide issue superscaler processor (not
comparison with statically scheduled processor architecture).
The categories of power waste:
1. Program waste: instructions that are fetched and executed that are
not necessary for correct execution.instruction that produce dead or
redundant values( silent stores
2. architectural waste (static sizing of cache and memory structures
3. Speculation waste : speculative fetching and execution but finally
are not committed
Simulations done on STMSIM simulator in single thread mode.
The processor model is 8 fetch 8 stage out-of-order. 6 integer FU, 3
FP and 4 load/store.
Program waste: run the trace and identify the redundant instruction
and mark them. Then rerun the trace without charging the processor with
power cost of these instructions(how is this quantified??) .( but do
these instruction affect the issue width or scheduling the actual
instructions). The power required for the additional resources like
the reuse buffer is not accounted for.
When just the dead instruction and not the instruction which acts a
producer to the these ,are considered.
Little energy is wasted in the benchmark ( both FP and Int benchmarks)
runs on conditional MOVE instruction.
When the producer instruction are considered as well the power saving
increases significantly. For Int Specs the producer leading to silent
Integer operation and silent load contribite maximum to the waste.
for FP - silent FP and silent loads are max power waste.
producer of Silent store inst. also contribute heavily to the waste.
=> in FP there are long sequence of instructions executed before the
value is stored and if that store is silent ,leads to greater loss.
preictable instruction redundancy(Value Prediction) : this includes the
class of inst. which operate on the same input values and potentially
produce the same output (mostly??). they are not exactly redundant coz
they may still change the architectural state (the execution of these
instr. separated by the instruction which update the same destination
and therefore not redundant by previous definition) therefore these
instruction set may overlap with the register silent.
Speculation waste: the integer benchmarks see a more waste on
speculative execution for more of difficult to predict branches.
Eliminating speculation is performance degrading so not much can be done
here but controling the level of exection proivdes some oppurtunity(
Pipeline Gating and SMT)
Architectural Waste:
Suboptimal structure sizing than required for the performance required
for the particular application .The structures studied are Data Cache
and Insturction Queues.
Most of the benchmarks put less pressure on the IC and the large size
of IC is mostly not warranted for.
Instruction queues: the instruction executed are mostly from the top
of the queue( or atleast the small portion of the Inst queue) => large
queues size is mostly not required).
Removing Waste:
the total energy waste for Integer and FP benchmarks are not very
different but the source contributing more differs.
Conclusion:
Paper points out to the sources and possibility of power saving .
The question that are important:
1. how to measure the energy cost of instruction.
2.energy impact of extra resources for redundant and dead code eliminations
3.value prediction is looks increasing complex with increase data
widths (but saving could be potentially also more with these large
width of computational structures
Wednesday, July 27, 2005
Tuesday, July 26, 2005
Smart Memories
-->Smart Memories
Runtime - configuring the caches on basis of Working Set and other program behavior.
related Questions:
How to determine dynamic memory requirements of a program(statically or dynamically)?
Characterizing ILP,TLP,DLP from a particular application ?[for a particular arch]
using the above feedback to gain better locality(spatial and temporal), runtime
changing cache config and other uses to architecture/compiler/system software.
Mapping various CMP architecures to Smart Memories , for eg WaveScalar,DataScalar.
Developing Runtime system support for Smartmemories exploiting its strengths.
cache-conscious malloc
-->intelligent cache concious malloc allocating dynamic memory maximising locality
*Chilimbi has tried out this, his work summary below:
Making Pointer-based Data Structures Cache Conscious:
Root of Memory Latency problem is poor reference locality: Changing program's data access
pattern or data organization and layout can improve locality.Created 2 semi-automatic
tools ccmorph and ccmalloc that does cache-concious re-organization and cache-xconcious
Another Idea:
-->Cache Conscious Compiler
- Instr re-ordering (keeping in mind locality)
- Cache re-arranging
- Virtual/Physically Memory: choosing address that might lead to better
locality.
* dont know how fruitful this is or what has been done in this area.
*Chilimbi has tried out this, his work summary below:
Making Pointer-based Data Structures Cache Conscious:
Root of Memory Latency problem is poor reference locality: Changing program's data access
pattern or data organization and layout can improve locality.Created 2 semi-automatic
tools ccmorph and ccmalloc that does cache-concious re-organization and cache-xconcious
Another Idea:
-->Cache Conscious Compiler
- Instr re-ordering (keeping in mind locality)
- Cache re-arranging
- Virtual/Physically Memory: choosing address that might lead to better
locality.
* dont know how fruitful this is or what has been done in this area.
large transistor chip study
A study of Single-Chip Processor/Cache organisations for Large Number
of transistors.
a) Having a large transistor budget does not necessarily lead to improved performance.
b)very imp to have single cycle access data cache. As long as L1 is single-cycle L2 may
have a access time of 4 cycles without sacrificing a great deal of performance.
c) 4-8 million Transistors - single processor config best.
8-16 million - dual processor best. in fact with 16 million dual proc config can achieve twice
the throughput of single proc system with same transistor budget.
the 2-4 proc systems also have improved throughput but point of diminishing returns is
quickly reached.
of transistors.
a) Having a large transistor budget does not necessarily lead to improved performance.
b)very imp to have single cycle access data cache. As long as L1 is single-cycle L2 may
have a access time of 4 cycles without sacrificing a great deal of performance.
c) 4-8 million Transistors - single processor config best.
8-16 million - dual processor best. in fact with 16 million dual proc config can achieve twice
the throughput of single proc system with same transistor budget.
the 2-4 proc systems also have improved throughput but point of diminishing returns is
quickly reached.
vlsi architecture survey
1.VLSI Architecures: Past, Present and Future
For fine-grain machines to bcome reality one has to overcome obstacles in 3 areas:
a. Managing Locality.(for real programs with irregular,data-dependent,time
-varying data
staructures. )
b. Reducing Overhead(communication, synchronization etc)
c. smooth transition (see below)
"One approach to migration is to use Smart Memories chip as memory for a conventional
processor" -William J Dally.
HW/SW mechanisms are required to manage the placement and migration of data to minimize
use of scarce off-chip bandwidth and avoiding long on-chip latencies.
cache-aware compiling
-->Cache Conscious Compiler
- Instr re-ordering (keeping in mind locality)
- Cache re-arranging
- Virtual/Physically Memory: choosing address that might lead to better
locality.
* dont know how fruitful this is or what has been done in this area.
- Instr re-ordering (keeping in mind locality)
- Cache re-arranging
- Virtual/Physically Memory: choosing address that might lead to better
locality.
* dont know how fruitful this is or what has been done in this area.
Speculative Pre-computation
-->Extending traditional Neumann Model by Memory helper threads(for Prefetching) either
compile time or runtime addition.Runtime more helpful.
Cache Concsious Compiler produces binary with attached memory helper threads.
Binary Code Augmentation(adapting existing binaries by introducing explicit threads in
the exe)
* Havent digged this much, but seems to have been tried.
*intel guys(Speculative Precomputation) seem to have created a tool for post-pass
compilation tool:a)analyzes existing single-thread binary to generate pre-fetch threads.
b) Identify and embed triggering points in original binary code.
c)Produce a new binary that has pre-fetch threads attached, which can be spawned at
run-time.
when I found this, was really frustated at these guys for not leaving this untapped, but
then thats life;), I guess that was one of the few specific areas I had identified future work.
Syntax:
--> is a thought/idea/direction.
* is wat has already been done in this direction.
while reading CA basics:
-->Slave processor runs ahead of the main program only looking at dynamic
memory operations avoiding cache misses(perfect cache).
compile time or runtime addition.Runtime more helpful.
Cache Concsious Compiler produces binary with attached memory helper threads.
Binary Code Augmentation(adapting existing binaries by introducing explicit threads in
the exe)
* Havent digged this much, but seems to have been tried.
*intel guys(Speculative Precomputation) seem to have created a tool for post-pass
compilation tool:a)analyzes existing single-thread binary to generate pre-fetch threads.
b) Identify and embed triggering points in original binary code.
c)Produce a new binary that has pre-fetch threads attached, which can be spawned at
run-time.
when I found this, was really frustated at these guys for not leaving this untapped, but
then thats life;), I guess that was one of the few specific areas I had identified future work.
Syntax:
--> is a thought/idea/direction.
* is wat has already been done in this direction.
while reading CA basics:
-->Slave processor runs ahead of the main program only looking at dynamic
memory operations avoiding cache misses(perfect cache).
cache stat tools
-->Metrics of software apps shud include cache misses/page faults/disk accesses(cache performace).
* Aleady done, lot of tools also available that produce such metrics automatically.
subset: such metrics giving specific directed info helping programmer to redesign
'parts' of code improving cache performance.Dont know if this has been done.
* Aleady done, lot of tools also available that produce such metrics automatically.
subset: such metrics giving specific directed info helping programmer to redesign
'parts' of code improving cache performance.Dont know if this has been done.
smartass IDE
-->Programmer Productivity (optional feature-)
Optimisations visible to programmer : some guidelines while writing apps(clever IDE) to
exploit arch,ISA,memory design etc. Normally the trend is to make them invisible. this is
shifting the problem more to the source (developer).
Optimisations visible to programmer : some guidelines while writing apps(clever IDE) to
exploit arch,ISA,memory design etc. Normally the trend is to make them invisible. this is
shifting the problem more to the source (developer).
expose memory hierarchy
->Memory Hierarchy having a say in design/implementation of programming models/systemsoftware/compilers.
for eg in SmartMemories if cache is reconfigured from 2-way to direct mapped , passing
runtime options to the compiler for it to optimise the program more efficiently.
Basically exposing some key Arch features which have a significant impact on performance,
okay we sacrifice a lil on abstraction, but if the Arch is already implemented, might as well exploit it.
* OS and compilers are normally highly optimised for their specific Architectures.
for eg in SmartMemories if cache is reconfigured from 2-way to direct mapped , passing
runtime options to the compiler for it to optimise the program more efficiently.
Basically exposing some key Arch features which have a significant impact on performance,
okay we sacrifice a lil on abstraction, but if the Arch is already implemented, might as well exploit it.
* OS and compilers are normally highly optimised for their specific Architectures.
on async
Asynchronous System-on-Chip Interconnect
A Major overhead in MP design is communication and synchronization, can this
be addressed by async comm infra. Globally async Locally sync cores.
have to read this if one needs to follow up on the GALS approach
to CMPs:
bainbridge thesis
Another issue:
MPI is inherently asynchronous, can the MPI model in inter-chip(inter-quad) core
communication be leveraged by async solutions.
A Major overhead in MP design is communication and synchronization, can this
be addressed by async comm infra. Globally async Locally sync cores.
have to read this if one needs to follow up on the GALS approach
to CMPs:
bainbridge thesis
Another issue:
MPI is inherently asynchronous, can the MPI model in inter-chip(inter-quad) core
communication be leveraged by async solutions.
Subscribe to:
Posts (Atom)