BOTTLENECK!!

Tuesday, July 03, 2007

"stream programming on general purpose processors" : Paper summary

The study is done to map stream programs ( written in streaming style: This is done using a language developed by the Stanford folks called StreamIT for the stream processor architecture Imagine and RAW)
The basic characteristic of this programming language is to explicitly expose the data parallelism and decouple computation and memory access.. I will later send you another update on what efforts are taking place on Parallel programming language and compilers.StreamIt is just the tip of the iceberg.
The streamIt exposes the parallelism by mapping the program to computation kernels,local memories (Stream Virtual Machine : control processor,compute/kernel processor,Dma engines, memories(global memory,local memory(Stream Register File) and local registers(tied to functional unit)) and _*Asynchronous Bulk memory Loads/Stores ( which is important as this tries to convert a Memory Latency problem to a Memory Bandwidth problem)
*_The stream processor architecture is such that it helps to support such a programming paradigm. So it has a concept of Stream Register File and then the local register file for execution units exploiting the locality of data.
So the mapping fundamentally involves mapping the processors and memory components of the Stream Virtual Machine(SVM)
1. SVM local register maps to general purpose registers
2. SVM Global memory maps to main memory
3. Mapping SRF to Cache and the three processor (can be visualized as different threads of the program) is challenging and as following :
The assumption while compiling for SVM and stream architecture in general ( also the need for most of the stream programs) is large compiler controlled local memories( either as register file or low latency /high bandwidth memories) between both, the functional units( where actual work happens) and global memory because of the large working sets.
The cache in GPU are transparent to the compiler. and therefore we need to effectively align the data access and utilize the cache organization. To map the SRF onto the Cache,we can effectively pin a cached-sized range of memory addresses in cache. and carefully managing the memory accesses in the mapped program.
=> Size the SRF to comfortably fit in the Cache and control non SRF memory reference to ensure that they do not trigger replacement
=> it would help to have high associativity for the cache and bigger caches close to the functional units.
=> help to have a good prefetcher
since in stream paradigm the memory access are decoupled from computation, the loads/stores are explicit with streamGather/streamScatter respectively.
the interfering memory access like Instruction Fetch, stack accesses, OS code can disrupt and can potentially induce conflicts/replacements in cache
=> study suggests that this interference is minimal in scientific program which have large number of local variables)
Evaluation for SRF on Pentium 4:
positives: good at streaming sequentially through memory whereas random memory access are not optimized. The behavior is dominated by the effect of the cachline loading and hardware prefetch of the processor. If the record size is increased with field size remaining constant, thereby not utilizing the cache line fetches, the effective bandwidth drops.
for Random access the major bandwidth loss comes *not *from absence of data locality but by the TLB miss.
Also for stores residing in the cache, its a read-modify-write operation and therefore halfs the effective bandwidth.
Take away for memory access:
_*Bulk Memory Mode operation needs to be optimized
*_
4. Mapping the Computation kernel and schedulling:
The study shows that scheduling the computational kernel inter-spread by memory access is more optimized than trying to overlap two memory access or two computational kernel for obvious reasons.
This benefit results from the mutually exclusive resources of the processor utilized (Alu vs memory system resources)_*
Scheduling effectively is one of the most challenging aspect

Mapping to multiple hardware context( multiple cores/ multiple threads in a core):
*_It helps to have a cache at certain level shared among processors(nearer the better for latency but not necessary if we can effectively get the working set to the nearest cache after loading from the galobal memory)
Its effective to have one context for memory access (utilize the shared cache) and one for the computational and control context.

Inter-thread Communication overhead : fast is essential for good performance especially if the memory access/computation thread are on different hardware context.
Thread abstraction of the hardware context provided by the OS is not very optimal. It would be much more effective to support better event based communication and thereby not providing a Shared Memory Abstraction.
SSE extensions like MWAIT/MONITOR provide this abstraction by putting the hardware context effectively into sleep state and thereby freeing the resource rather than spin loop. Also the latency overhead of the wake operation must be reduced else this could dominate the execution time.

The paper then later presents the speed ups by using the Streaming programming paradigm over the normal sequential programming.
Examples which would not benefit from this: Specint which uses data structures such as trees/linked-lists
matrix multiply using Sparsed row storage

Conclusion:
Large part of general purpose processors chip area is there to extract fine-grained parallelism rather than to functional unit that perform the actual computation.
TLB mapping hinder with asynchronous bulk memory transfers( do not understand this).
So increasing number of functional units and TLB mapping would help greatly to increase streaming performance.

Wednesday, July 27, 2005

Paper Summary: Architecture-Level Power Optimization What are the Limits

A paper on power optimization which points out to the various sources of architectural power waste and possibilities of optimization!

Dean M. Tullsen and John S. Seng

Following is the summary of the paper:

Idealized study for aggressive wide issue superscaler processor (not
comparison with statically scheduled processor architecture).

The categories of power waste:
1. Program waste: instructions that are fetched and executed that are
not necessary for correct execution.instruction that produce dead or
redundant values( silent stores
2. architectural waste (static sizing of cache and memory structures
3. Speculation waste : speculative fetching and execution but finally
are not committed

Simulations done on STMSIM simulator in single thread mode.
The processor model is 8 fetch 8 stage out-of-order. 6 integer FU, 3
FP and 4 load/store.

Program waste: run the trace and identify the redundant instruction
and mark them. Then rerun the trace without charging the processor with
power cost of these instructions(how is this quantified??) .( but do
these instruction affect the issue width or scheduling the actual
instructions). The power required for the additional resources like
the reuse buffer is not accounted for.
When just the dead instruction and not the instruction which acts a
producer to the these ,are considered.
Little energy is wasted in the benchmark ( both FP and Int benchmarks)
runs on conditional MOVE instruction.
When the producer instruction are considered as well the power saving
increases significantly. For Int Specs the producer leading to silent
Integer operation and silent load contribite maximum to the waste.

for FP - silent FP and silent loads are max power waste.
producer of Silent store inst. also contribute heavily to the waste.
=> in FP there are long sequence of instructions executed before the
value is stored and if that store is silent ,leads to greater loss.

preictable instruction redundancy(Value Prediction) : this includes the
class of inst. which operate on the same input values and potentially
produce the same output (mostly??). they are not exactly redundant coz
they may still change the architectural state (the execution of these
instr. separated by the instruction which update the same destination
and therefore not redundant by previous definition) therefore these
instruction set may overlap with the register silent.

Speculation waste: the integer benchmarks see a more waste on
speculative execution for more of difficult to predict branches.
Eliminating speculation is performance degrading so not much can be done
here but controling the level of exection proivdes some oppurtunity(
Pipeline Gating and SMT)

Architectural Waste:
Suboptimal structure sizing than required for the performance required
for the particular application .The structures studied are Data Cache
and Insturction Queues.

Most of the benchmarks put less pressure on the IC and the large size
of IC is mostly not warranted for.
Instruction queues: the instruction executed are mostly from the top
of the queue( or atleast the small portion of the Inst queue) => large
queues size is mostly not required).

Removing Waste:
the total energy waste for Integer and FP benchmarks are not very
different but the source contributing more differs.

Conclusion:
Paper points out to the sources and possibility of power saving .
The question that are important:
1. how to measure the energy cost of instruction.
2.energy impact of extra resources for redundant and dead code eliminations
3.value prediction is looks increasing complex with increase data
widths (but saving could be potentially also more with these large
width of computational structures

Tuesday, July 26, 2005

Smart Memories

-->Smart Memories
Runtime - configuring the caches on basis of Working Set and other program behavior.
related Questions:
How to determine dynamic memory requirements of a program(statically or dynamically)?
Characterizing ILP,TLP,DLP from a particular application ?[for a particular arch]
using the above feedback to gain better locality(spatial and temporal), runtime
changing cache config and other uses to architecture/compiler/system software.

Mapping various CMP architecures to Smart Memories , for eg WaveScalar,DataScalar.
Developing Runtime system support for Smartmemories exploiting its strengths.

cache-conscious malloc

-->intelligent cache concious malloc allocating dynamic memory maximising locality
*Chilimbi has tried out this, his work summary below:
Making Pointer-based Data Structures Cache Conscious:
Root of Memory Latency problem is poor reference locality: Changing program's data access
pattern or data organization and layout can improve locality.Created 2 semi-automatic
tools ccmorph and ccmalloc that does cache-concious re-organization and cache-xconcious

Another Idea:
-->Cache Conscious Compiler
- Instr re-ordering (keeping in mind locality)
- Cache re-arranging
- Virtual/Physically Memory: choosing address that might lead to better
locality.
* dont know how fruitful this is or what has been done in this area.

large transistor chip study

A study of Single-Chip Processor/Cache organisations for Large Number
of transistors.
a) Having a large transistor budget does not necessarily lead to improved performance.
b)very imp to have single cycle access data cache. As long as L1 is single-cycle L2 may
have a access time of 4 cycles without sacrificing a great deal of performance.
c) 4-8 million Transistors - single processor config best.
8-16 million - dual processor best. in fact with 16 million dual proc config can achieve twice
the throughput of single proc system with same transistor budget.
the 2-4 proc systems also have improved throughput but point of diminishing returns is
quickly reached.

vlsi architecture survey

1.VLSI Architecures: Past, Present and Future
For fine-grain machines to bcome reality one has to overcome obstacles in 3 areas:
a. Managing Locality.(for real programs with irregular,data-dependent,time
-varying data
staructures. )
b. Reducing Overhead(communication, synchronization etc)
c. smooth transition (see below)
"One approach to migration is to use Smart Memories chip as memory for a conventional
processor" -William J Dally.
HW/SW mechanisms are required to manage the placement and migration of data to minimize
use of scarce off-chip bandwidth and avoiding long on-chip latencies.

cache-aware compiling

-->Cache Conscious Compiler
- Instr re-ordering (keeping in mind locality)
- Cache re-arranging
- Virtual/Physically Memory: choosing address that might lead to better
locality.
* dont know how fruitful this is or what has been done in this area.

Speculative Pre-computation

-->Extending traditional Neumann Model by Memory helper threads(for Prefetching) either
compile time or runtime addition.Runtime more helpful.
Cache Concsious Compiler produces binary with attached memory helper threads.
Binary Code Augmentation(adapting existing binaries by introducing explicit threads in
the exe)
* Havent digged this much, but seems to have been tried.
*intel guys(Speculative Precomputation) seem to have created a tool for post-pass
compilation tool:a)analyzes existing single-thread binary to generate pre-fetch threads.
b) Identify and embed triggering points in original binary code.
c)Produce a new binary that has pre-fetch threads attached, which can be spawned at
run-time.
when I found this, was really frustated at these guys for not leaving this untapped, but
then thats life;), I guess that was one of the few specific areas I had identified future work.

Syntax:
--> is a thought/idea/direction.
* is wat has already been done in this direction.

while reading CA basics:
-->Slave processor runs ahead of the main program only looking at dynamic
memory operations avoiding cache misses(perfect cache).

cache stat tools

-->Metrics of software apps shud include cache misses/page faults/disk accesses(cache performace).
* Aleady done, lot of tools also available that produce such metrics automatically.
subset: such metrics giving specific directed info helping programmer to redesign
'parts' of code improving cache performance.Dont know if this has been done.

smartass IDE

-->Programmer Productivity (optional feature-)
Optimisations visible to programmer : some guidelines while writing apps(clever IDE) to
exploit arch,ISA,memory design etc. Normally the trend is to make them invisible. this is
shifting the problem more to the source (developer).

expose memory hierarchy

->Memory Hierarchy having a say in design/implementation of programming models/systemsoftware/compilers.
for eg in SmartMemories if cache is reconfigured from 2-way to direct mapped , passing
runtime options to the compiler for it to optimise the program more efficiently.
Basically exposing some key Arch features which have a significant impact on performance,
okay we sacrifice a lil on abstraction, but if the Arch is already implemented, might as well exploit it.
* OS and compilers are normally highly optimised for their specific Architectures.

on async

Asynchronous System-on-Chip Interconnect
A Major overhead in MP design is communication and synchronization, can this
be addressed by async comm infra. Globally async Locally sync cores.
have to read this if one needs to follow up on the GALS approach
to CMPs:
bainbridge thesis

Another issue:
MPI is inherently asynchronous, can the MPI model in inter-chip(inter-quad) core
communication be leveraged by async solutions.

ole amigos

welcome to our technical blog playground.

Harsh
Mitesh
Ram