Tuesday, July 03, 2007

"stream programming on general purpose processors" : Paper summary

The study is done to map stream programs ( written in streaming style: This is done using a language developed by the Stanford folks called StreamIT for the stream processor architecture Imagine and RAW)
The basic characteristic of this programming language is to explicitly expose the data parallelism and decouple computation and memory access.. I will later send you another update on what efforts are taking place on Parallel programming language and compilers.StreamIt is just the tip of the iceberg.
The streamIt exposes the parallelism by mapping the program to computation kernels,local memories (Stream Virtual Machine : control processor,compute/kernel processor,Dma engines, memories(global memory,local memory(Stream Register File) and local registers(tied to functional unit)) and _*Asynchronous Bulk memory Loads/Stores ( which is important as this tries to convert a Memory Latency problem to a Memory Bandwidth problem)
*_The stream processor architecture is such that it helps to support such a programming paradigm. So it has a concept of Stream Register File and then the local register file for execution units exploiting the locality of data.
So the mapping fundamentally involves mapping the processors and memory components of the Stream Virtual Machine(SVM)
1. SVM local register maps to general purpose registers
2. SVM Global memory maps to main memory
3. Mapping SRF to Cache and the three processor (can be visualized as different threads of the program) is challenging and as following :
The assumption while compiling for SVM and stream architecture in general ( also the need for most of the stream programs) is large compiler controlled local memories( either as register file or low latency /high bandwidth memories) between both, the functional units( where actual work happens) and global memory because of the large working sets.
The cache in GPU are transparent to the compiler. and therefore we need to effectively align the data access and utilize the cache organization. To map the SRF onto the Cache,we can effectively pin a cached-sized range of memory addresses in cache. and carefully managing the memory accesses in the mapped program.
=> Size the SRF to comfortably fit in the Cache and control non SRF memory reference to ensure that they do not trigger replacement
=> it would help to have high associativity for the cache and bigger caches close to the functional units.
=> help to have a good prefetcher
since in stream paradigm the memory access are decoupled from computation, the loads/stores are explicit with streamGather/streamScatter respectively.
the interfering memory access like Instruction Fetch, stack accesses, OS code can disrupt and can potentially induce conflicts/replacements in cache
=> study suggests that this interference is minimal in scientific program which have large number of local variables)
Evaluation for SRF on Pentium 4:
positives: good at streaming sequentially through memory whereas random memory access are not optimized. The behavior is dominated by the effect of the cachline loading and hardware prefetch of the processor. If the record size is increased with field size remaining constant, thereby not utilizing the cache line fetches, the effective bandwidth drops.
for Random access the major bandwidth loss comes *not *from absence of data locality but by the TLB miss.
Also for stores residing in the cache, its a read-modify-write operation and therefore halfs the effective bandwidth.
Take away for memory access:
_*Bulk Memory Mode operation needs to be optimized
*_
4. Mapping the Computation kernel and schedulling:
The study shows that scheduling the computational kernel inter-spread by memory access is more optimized than trying to overlap two memory access or two computational kernel for obvious reasons.
This benefit results from the mutually exclusive resources of the processor utilized (Alu vs memory system resources)_*
Scheduling effectively is one of the most challenging aspect

Mapping to multiple hardware context( multiple cores/ multiple threads in a core):
*_It helps to have a cache at certain level shared among processors(nearer the better for latency but not necessary if we can effectively get the working set to the nearest cache after loading from the galobal memory)
Its effective to have one context for memory access (utilize the shared cache) and one for the computational and control context.

Inter-thread Communication overhead : fast is essential for good performance especially if the memory access/computation thread are on different hardware context.
Thread abstraction of the hardware context provided by the OS is not very optimal. It would be much more effective to support better event based communication and thereby not providing a Shared Memory Abstraction.
SSE extensions like MWAIT/MONITOR provide this abstraction by putting the hardware context effectively into sleep state and thereby freeing the resource rather than spin loop. Also the latency overhead of the wake operation must be reduced else this could dominate the execution time.

The paper then later presents the speed ups by using the Streaming programming paradigm over the normal sequential programming.
Examples which would not benefit from this: Specint which uses data structures such as trees/linked-lists
matrix multiply using Sparsed row storage

Conclusion:
Large part of general purpose processors chip area is there to extract fine-grained parallelism rather than to functional unit that perform the actual computation.
TLB mapping hinder with asynchronous bulk memory transfers( do not understand this).
So increasing number of functional units and TLB mapping would help greatly to increase streaming performance.