Compilers for Parallel Computers (CPC2006)

Program (List of Talks)

Funded by the Vicerreitoría de Investigación of the University of A Coruña and the Ministry of Education and Science of Spain

ESODYP: An Entirely Software and Dynamic Data Prefetcher based on a Markov Model

Jean Christophe Beyler and Philippe Clauss
Université de Strasbourg, LSIIT/ICPS, Illkirch-Graffenstaden, France

Many works have shown that data prefetching can be an efficient answer to the well-known memory bottleneck. Although several approaches ranging from dynamic hardware to static software mechanisms have been proposed [5, 6, 2, 3], to our knowledge no automatic, pure and standalone dynamic software data prefetching solution has yet been proposed. However such a portable approach can be quite worthy regarding the number of data-intensive applications having a dynamically changing memory behavior and regarding the ever growing variety of processor and memory architectures.

In this work, we propose an Entirely SOftware and DYnamic data Prefetcher (ESODYP) based on a Markov model. It runs in two main phases: a short training phase where a graph coding sequences of occurring memory accesses is constructed, and an optimizing phase where predicted addresses are prefetched while some informations in the graph are updated by continuously monitoring the program. The different schemes that are studied are patterns that can be found using the logical addresses or directly memory strides.

In some programs, the memory usage can be intensive but on a small amount of data. Knowing which accesses will come next will enable an optimizer to prefetch correctly, thus reducing the number of cache-misses. Since the number of different addresses accessed is relatively low, we can use them directly as keys for the pattern creation and monitoring.

However, in other programs, the accessed data is too large to be able to model the patterns this way. Some methods are based on hash tables and LRU systems to keep the memory usage of the optimizer low. Our technique consists in studying memory strides instead of directly using the data addresses. This gives us the possibility to handle data traversals that are done by link-lists or certain array traversals in for loops. Large link-lists are an interesting target for this optimization since the dynamic allocation makes static optimizations complicated and compilers rarely optimize anything because of the limited knowledge of the link-lists allocation and traversal behaviour. They leave such programs unoptimized since they are unsure at compile time whether they will speedup or slowdown the program.

A program memory behaviour can be described as a succession of phases. Each phase having its own behaviour. Our model continuously monitors the data behaviour to check the current graph's relevancy. This is done by checking if the current memory behaviour is still relatively similar to the one created during the training phase. If it is not, the graph is destroyed and a new one is created. This can be seen as a flush of the model to allow a new model to be created. Of course, if there are too many phase changes then our model will continuously be destroying the graphs created. Thus a threshold is implemented to allow small phase changes or irregularities to go unnoticed as long as we return to a phase corresponding to the one of the created model.

Significant speed-ups are obtained on an Itanium-2 processor [1, 4]using ESODYP for several benchmark programs that could not have been optimized statically. It is particularly shown that the induced software overhead can represent a minor execution time regarding performance improvements, due to a careful lowering of the optimizer computations and memory accesses.

This work also opens some interesting perspectives in the development of pure software dynamic optimizers.

References

[1] Reference Manual for Software Development and Optimization, chapter Optimal use of lfetch, pages 71Ð72. Intel Corporation, May 2004.

[2] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. ACM SIGPLAN Notices, 35(5):1Ð12, 2000.

[3] G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J. A. Fisher. Deli: a new run-time control point. In 35th Annual ACM/IEEE International Symposium on Microarchitecture, pages 257Ð268, December 2002.

[4] Intel. Hyper-threading technology. http://www.intel.com/technology/hyperthread/.

[5] D. Joseph and D. Grunwald. Prefetching using markov predictors. In IEEE Transactions on Computers, Vol. 48, NO. 2, pages 121Ð 133, Febuary 1999.

[6] J. Lu, H. Chen, R. Fu, W. Hsu, B. Othmer, P. Yew, and D. Chen. The performance of runtime data cache prefetching in a dynamic optimization system. In 36th Annual IEEE/ACM International Symposium on Microarchitecture, December 2003.

Back to the Workshop Program

Please contact our webadmin with any comments or changes.