Indep study

2011年9月24日星期六

locality of reference

In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources within relatively small time durations. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, e.g., traversing the elements in a one-dimensional array.
Locality is merely one type of predictable behavior that occurs in computer systems. Systems which exhibit strong locality of reference are good candidates for performance optimization through the use of techniques, like the cache and instruction prefetch technology for memory, or like the advanced branch predictor at the pipelining of processors.

[hide]

[edit] Locality of reference

The locality of reference, also known as the locality principle, is the phenomenon that the collection of the data locations referenced in a short period of time in a running computer often consists of relatively well predictable clusters. Important special cases of locality are temporal, spatial, equidistant and branch locality.

Temporal locality: if at one point in time a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future. There is a temporal proximity between the adjacent references to the same memory location. In this case it is common to make efforts to store a copy of the referenced data in special memory storage, which can be accessed faster. Temporal locality is a very special case of the spatial locality, namely when the prospective location is identical to the present location.

Spatial locality: if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. In this case it is common to attempt to guess the size and shape of the area around the current reference for which it is worthwhile to prepare faster access.

Equidistant locality: it is halfway between the spatial locality and the branch locality. Consider a loop accessing locations in an equidistant pattern, i.e. the path in the spatial-temporal coordinate space is a dotted line. In this case, a simple linear function can predict which location will be accessed in the near future.

Branch locality: if there are only few amount of possible alternatives for the prospective part of the path in the spatial-temporal coordinate space. This is the case when an instruction loop has a simple structure, or the possible outcome of a small system of conditional branching instructions is restricted to a small set of possibilities. Branch locality is typically not a spatial locality since the few possibilities can be located far away from each other.

In order to make benefit from the very frequently occurring temporal and spatial kind of locality, most of the information storage systems are hierarchical; see below. The equidistant locality is usually supported by the diverse nontrivial increment instructions of the processors. For the case of branch locality, the contemporary processors have sophisticated branch predictors, and on the base of this prediction the memory manager of the processor tries to collect and preprocess the data of the plausible alternatives.

[edit] Reasons for locality

There are several reasons for locality. These reasons are either goals to achieve or circumstances to accept, depending on the aspect. The reasons below are not disjoint; in fact, the list below goes from the most general case to special cases.

Predictability: In fact, locality is merely one type of predictable behavior in computer systems. Luckily, many of the practical problems are decidable and hence the corresponding program can behave predictably, if it is well written.

Structure of the program: Locality occurs often because of the way in which computer programs are created, for handling decidable problems. Generally, related data is stored in nearby locations in storage. One common pattern in computing involves the processing of several items, one at a time. This means that if a lot of processing is done, the single item will be accessed more than once, thus leading to temporal locality of reference. Furthermore, moving to the next item implies that the next item will be read, hence spatial locality of reference, since memory locations are typically read in batches.

Linear data structures: Locality often occurs because code contains loops that tend to reference arrays or other data structures by indices. Sequential locality, a special case of spatial locality, occurs when relevant data elements are arranged and accessed linearly. For example, the simple traversal of elements in a one-dimensional array, from the base address to the highest element would exploit the sequential locality of the array in memory.^[1] The more general equidistant locality occurs when the linear traversal is over a longer area of adjacent data structures having identical structure and size, and in addition to this, not the whole structures are in access, but only the mutually corresponding same elements of the structures. This is the case when a matrix is represented as a sequential matrix of rows and the requirement is to access a single column of the matrix.

[edit] Use of locality in general

If most of the time the substantial portion of the references aggregate into clusters, and if the shape of this system of clusters can be well predicted, then it can be used for speed optimization. There are several ways to make benefit from locality. The common techniques for optimization are:

to increase the locality of references. This is achieved usually on the software side.

to exploit the locality of references. This is achieved usually on the hardware side. The temporal and spatial locality can be capitalized by hierarchical storage hardwares. The equidistant locality can be used by the appropriately specialized instructions of the processors, this possibility is not only the responsibility of hardware, but the software as well, whether its structure is suitable for compiling a binary program which calls the specialized instructions in question. The branch locality is a more elaborate possibility, hence more developing effort is needed, but there is much larger reserve for future exploration in this kind of locality than in all the remaining ones.

[edit] Use of spatial and temporal locality: hierarchical memory

Main article: Memory hierarchy

Hierarchical memory is a hardware optimization that takes the benefits of spatial and temporal locality and can be used on several levels of the memory hierarchy. Paging obviously benefits from temporal and spatial locality. A cache is a simple example of exploiting temporal locality, because it is a specially designed faster but smaller memory area, generally used to keep recently referenced data and data near recently referenced data, which can lead to potential performance increases. Data in cache does not necessarily correspond to data that is spatially close in main memory; however, data elements are brought into cache one cache line at a time. This means that spatial locality is again important: if one element is referenced, a few neighboring elements will also be brought into cache. Finally, temporal locality plays a role on the lowest level, since results that are referenced very closely together can be kept in the machine registers. Programming languages such as C allow the programmer to suggest that certain variables are kept in registers.
Data locality is a typical memory reference feature of regular programs (though many irregular memory access patterns exist). It makes the hierarchical memory layout profitable. In computers, memory is divided up into a hierarchy in order to speed up data accesses. The lower levels of the memory hierarchy tend to be slower, but larger. Thus, a program will achieve greater performance if it uses memory while it is cached in the upper levels of the memory hierarchy and avoids bringing other data into the upper levels of the hierarchy that will displace data that will be used shortly in the future. This is an ideal, and sometimes cannot be achieved.
Typical memory hierarchy (access times and cache sizes are approximations of typical values used as of 2006 for the purpose of discussion; actual values and actual numbers of levels in the hierarchy vary):

CPU registers (8-32 registers) – immediate access
L1 CPU caches (32 KiB to 128 KiB) – fast access
L2 CPU caches (128 KiB to 12 MiB) – slightly slower access
Main physical memory (RAM) (256 MiB to 24 GiB) – slow access
Disk (file system) (100 GiB to 2 TiB) – very slow
Remote Memory (such as other computers or the Internet) (Practically unlimited) – speed varies

Modern machines tend to read blocks of lower memory into the next level of the memory hierarchy. If this displaces used memory, the operating system tries to predict which data will be accessed least (or latest) and move it down the memory hierarchy. Prediction algorithms tend to be simple to reduce hardware complexity, though they are becoming somewhat more complicated.

Spatial and temporal locality example: matrix multiplication

A common example is matrix multiplication:

for i in 0..n
  for j in 0..m
    for k in 0..p
      C[i][j] = C[i][j] + A[i][k] * B[k][j];

When dealing with large matrices, this algorithm tends to shuffle data around too much. Since memory is pulled up the hierarchy in consecutive address blocks, in the C programming language it would be advantageous to refer to several memory addresses that share the same row (spatial locality). By keeping the row number fixed, the second element changes more rapidly. In C and C++, this means the memory addresses are used more consecutively. One can see that since j affects the column reference of both matrices C and B, it should be iterated in the innermost loop (this will fix the row iterators, i and k, while j moves across each column in the row). This will not change the mathematical result, but it improves efficiency. By switching the looping order for j and k, the speedup in large matrix multiplications becomes dramatic. (In this case, 'large' means, approximately, more than 100,000 elements in each matrix, or enough addressable memory such that the matrices will not fit in L1 and L2 caches.)
Temporal locality can also be improved in the above example by using a technique called blocking. The larger matrix can be divided into evenly-sized sub-matrices, so that the smaller blocks can be referenced (multiplied) several times while in memory.

for (ii = 0; ii < SIZE; ii += BLOCK_SIZE)
  for (kk = 0; kk < SIZE; kk += BLOCK_SIZE)
    for (jj = 0; jj < SIZE; jj += BLOCK_SIZE)
      for (i = ii; i < ii + BLOCK_SIZE && i < SIZE; i++)
        for (k = kk; k < kk + BLOCK_SIZE && k < SIZE; k++)
          for (j = jj; j < jj + BLOCK_SIZE && j < SIZE; j++)
            C[i][j] = C[i][j] + A[i][k] * B[k][j];

The temporal locality of the above solution is provided because a block can be used several times before moving on, so that it is moved in and out of memory less often. Spatial locality is improved because elements with consecutive memory addresses tend to be pulled up the memory hierarchy together.

2011年9月23日星期五

MemTrack: Tracking Memory Allocations in C++

Introduction

This article describes MemTrack, a library which replaces the standard operator new with a special version that records information about each memory allocation. This information can be analyzed later for debugging purposes or for performance measurement. The article begins by discussing the use of macros to instrument memory allocation in C and the use of macros combined with "placement operator new" to accomplish the same thing in C++. Then it describes the alternative approach used in the MemTrack library. This alternative approach differs from the more conventional approach in two important ways: First, it does not rely on the implicit use of placement operator new syntax and, second, it can capture the type of each memory allocation.

Instrumenting Memory Allocations in C

The use of macros to instrument memory allocation in C is a well-known technique, and it's worth reviewing. Two standard C preprocessor macros are the key to this technique: the __FILE__ and __LINE__ macros. __FILE__ expands to a string literal containing the filename of the source file currently being processed, and __LINE__ always expands to the line number of the line that the __LINE__ macro appears on.
Typically, __FILE__ and __LINE__ will be embedded inside another macro, so that they can be applied transparently. The typical targets for this approach are malloc() and its sibling functions. For example:

#define malloc(size) custom_malloc(size, __FILE__, __LINE__)

Ideally, the instrumented memory allocation functions can be transparently substituted for the regular ones throughout an entire project by making changes to a single header file.

Instrumenting Memory Allocations in C++

A similar approach can be used in C++, but to understand it, it's first necessary to discuss "placement operator new". C++ makes it possible to define custom versions of operator new, and these custom versions may take extra arguments. The simplest example is the classic placement operator new:

void *operator new(size_t size, void *p) { return p; }（is replacement new expensive also? since there is no

malloc or something in the funtion procedure, I would assume it is not expensive...）

This version of operator new can be used to construct a C++ object at an arbitrary place in memory, hence the name "placement operator new". This capability was thought to be so useful that a stock implementation of placement operator new with these semantics is provided by the standard C++ library. This operator new can be called like so:

Object *object = new(buffer) Object;

The first argument to any operator new function is the size of the memory block to allocate. This argument is always supplied implicitly by the compiler. If there are any extra arguments, then they are listed in parentheses immediately after the new keyword, using normal function call syntax.
There's no restriction on the number of parameters a custom implementation of operator new may take, nor is there any restriction on how those parameters may be used. Nevertheless, the moniker "placement operator new" is used broadly (and confusingly) to refer to all forms of operator new which take extra arguments, even those forms which do not match the classic placement semantics.
We can use the placement syntax to define a version of operator new equivalent to our hypothetical custom_malloc() function above:

void *operator new(size_t size, char const *filename, int lineNum) 
 { 
  ... 
 }

Now that we know how to pass arguments to operator new, and since the preprocessor will not recursively expand macros, we can define a new macro equivalent to the malloc() macro above:

#define new    new(__FILE__, __LINE__)

As in the case for C, we'd like to be able to transparently switch between the normal memory allocation function and the instrumented one by changing a single include file. This goal may be more difficult to achieve in C++ than in C, however. One problem is that the new macro will conflict with any explicit use of placement syntax. For example, a use of classic placement new syntax will conflict with the new macro:

Object *object = new(buffer) Object;

will expand to something like:

Object *object = new("object.cpp", 131)(buffer) Object;

which is clearly in error. Another problem is that any declaration of a custom operator new will also conflict with the macro.

Storing the Extra Information

The notion that we can have a custom memory allocator begs the question of just how such an allocator should be implemented. If we are only interested in recording extra information about each memory allocation then the easiest technique is to simply piggyback on top of the standard allocator. When the custom allocator is called with a request for memory, it simply asks the standard allocator for the memory, which the custom allocator then returns to the original caller.
It's easy to store the extra information in the piggyback scheme. Assume we need to store m bytes of extra data for every allocation. If the original caller requests n bytes of information, then the custom allocator will simply request n +m bytes of storage from the standard allocator. The custom allocator will then retain the first m bytes of storage for its own use and return a pointer to the original caller that points m bytes into the new memory block. Needless to say, this kind of allocator needs to be paired with a de-allocator that is aware of this offset. The piggyback scheme is the approach used by MemTrack.

An Alternative Approach to Instrumenting Memory Allocation in C++

Overview

MemTrack uses an alternative approach to instrumenting memory allocation in C++. It separates the allocation of a block of memory from the storage of the descriptive information about that memory. Instead, this information is applied later in a process I refer to as "stamping". Since the stamping step is embedded inside a new macro, this approach is as easy to use as the placement new approach.
There's more work involved in making the stamping process work. Since the new operator does not require parentheses for its type argument, we can't write a new macro that will capture the type argument. This means that we can't just pass the result of the new expression into a stamp() function which then returns its argument after stamping it. However, if we make our stamping function an overloaded operator we can accomplish the same thing, and we don't have to capture the type in the macro -- we just have to juxtapose the stamping operator with the new expression, which we can do without capturing the type in the new macro.
Astute readers will have already noticed that the stamping function must not only be an operator but that it must also be a template function. This is because the stamping operator must be capable of returning exactly the same type as the new expression. This is a lot of work simply to avoid the use of placement new. However, the use of the template introduces a new capability that's simply not available with the placement new approach: we can capture the type of the new expression inside the stamping operator and record it along with the other information we care about. It's exactly this capability that makes the stamping approach worth exploring.

The `new` Macro

The new macro and its component parts are the heart of MemTrack, so let's examine them in detail. These examples were drawn from the "MemTrack.h" header file, although they've been simplified somewhat for the sake of clarity.
Let's start with the new macro itself:

#define new    MemStamp(__FILE__, __LINE__) * new

This will expand the following statement

Object *object = new Object;

into something like:

Object *object = MemStamp("object.cpp", 131) * new Object;

The MemStamp temporary combines the filename and line number into a single entity, which then serves as the first argument to operator*. operator* is then called with both the stamp and the newly allocated object as arguments, and it returns the new object after stamping it.
The MemStamp class looks like this:

class MemStamp
 {
        public:  // member variables
            char const * const filename;
            int const lineNum;
        public:  // construction/destruction
            MemStamp(char const *filename, int lineNum)
                : filename(filename), lineNum(lineNum) { }
            ~MemStamp() { }
    };

The operator* template is quite simple, and the only thing really of note is the call to typeid(T).name() in the call to the internal stamping function.

template <class T> inline T *operator*(const MemStamp &stamp, T *p)
    {
        TrackStamp(p, stamp, typeid(T).name());
        return p;
    }

Since the first argument to operator* is a MemStamp value, it's very unlikely that this definition of operator* will conflict with any ordinary use of the operator.

Public MemTrack Functions

The previous section describes the new macro that MemTrack uses to automatically and transparently (as much as is possible) instrument source code. This is only half the story, though. The other half of the story is the custom allocator, the stamping function, and the mechanism for later retrieving the information that we've stamped on all those memory blocks.
The public interface to MemTrack consists of two sets of functions. The first set of functions are the Track* functions:

void *TrackMalloc(size_t size);
 void TrackFree(void *p);
 void TrackStamp(void *p, const MemStamp &stamp, char const *typeName);
 void TrackDumpBlocks();
 void TrackListMemoryUsage();

Normally there will be no need to call TrackMalloc() or TrackFree() directly, but they're available if you can think of a good reason to use them. TrackStamp() is needed for the implementation of the stamping operator, and TrackDumpBlocks() and TrackListMemoryUsage() provide ways to dump information about the currently allocated memory blocks.
The second set of functions are the overrides for global new and delete operators.

void *operator new(size_t size) { ... }
    void operator delete(void *p) { ... }
    void *operator new[](size_t size) { ... }
    void operator delete[](void *p) { ... }

There are no prototypes for these functions in "MemTrack.h" since the signatures of these functions are automatically known to the C++ compiler. However, the override definitions appear in "MemTrack.cpp" and they will override the default versions supplied by the compiler. These functions basically just call TrackMalloc() and TrackFree().

Implementation Notes

TrackMalloc() uses the piggyback approach described earlier in the article. The extra information stored for each allocated block of memory is referred to as the prolog, and currently the prolog is broken into two parts, the "block header", and the "signature". Block headers serve two important functions. First, they store all the extra information for each memory block, such as the filename and the type name. Second, they store the forward and backward pointers of the doubly-linked list that links all currently allocated memory blocks together.
The signature consists of a specific 64 bit pattern that can be used to identify a memory block as one that has been allocated by TrackMalloc() and which can safely be "stamped". If a pointer is pointing to a block of memory returned from TrackMalloc(), then the 8 bytes before that pointer will consist of the signature. This approach is rather crude, and it introduces some risks when testing pointers to valid memory blocks allocated by other memory allocators. The first problem is that there is a chance of a false positive on a signature test which would then result in a memory overwrite error when the memory block is stamped. The second problem is that the signature test could attempt to read memory not assigned to the process, resulting in a memory access fault. Neither of these problems should occur when the MemTrack allocator is the only one being used in a program, however.

MemTrack in Action

I tested MemTrack using FH-reader, a library I wrote several years ago for reading Macromedia FreeHand files. FH-reader made a good platform for testing the type capture capabilities of MemTrack since it represents FreeHand files in memory using objects of more than 70 different classes.
TrackDumpBlocks() listed 80701 blocks after one test run of the FH-reader test driver. This quantity of information is a little hard to interpret usefully. TrackListMemoryUsage() lists total memory usage by type from highest to lowest. Table 1 is the first 20 items from a test run of FH-reader.

Table 1

-----------------------
Memory Usage Statistics
-----------------------

allocated type                        blocks          bytes  
--------------                        ------          -----  
struct FHRDocPath::IndexedRec          11031  13.7% 2756600  45.8%
class FHRDocPath                       10734  13.3%  772848  12.8%
class FHRDocElemPropLst                13132  16.3%  420224   7.0%
struct FHRDocVDict::IndexedRec          3595   4.5%  370336   6.2%
struct FHRDocMDict::IndexedRec         13368  16.6%  208200   3.5%
class FHRDocObject *                      36   0.0%  172836   2.9%
struct FHRDocData::IndexedRec            890   1.1%  159880   2.7%
struct FHRDocLineTable::IndexedRec       408   0.5%  152824   2.5%
struct FHRDocMList::IndexedRec          2656   3.3%  119168   2.0%
class FHRDocMList                       1964   2.4%   62848   1.0%
class FHRDocVMpObj                      2096   2.6%   58688   1.0%
class FHRDocProcessColor                1259   1.6%   50360   0.8%
struct FHRDocTextBlok::IndexedRec        680   0.8%   48756   0.8%
class FHRDocUString                     1800   2.2%   43200   0.7%
class FHRDocGroup                        684   0.8%   41040   0.7%
class FHRDocObject * (__cdecl*)(void)     36   0.0%   39928   0.7%
class FHRDocXform                        516   0.6%   35088   0.6%
class FHRDocTextColumn                   403   0.5%   33852   0.6%
class FHRDocTString                      407   0.5%   29304   0.5%
struct FHRDocUString::IndexedRec        1800   2.2%   27904   0.5%

Caveats

Besides the potential for conflict with the definition of other operator new functions there is another problem worth mentioning. C++ allows classes to be declared inside functions, but it does not allow templates to be specialized on these classes. This restriction makes the MemTrack new macro unusable for local classes.
MemTrack also lacks a lot of features that a fully functional library would provide, such as logic to check for memory overwrites and double frees. A fully functional library should also be able to make claims about portability that MemTrack currently can't since it has only been tested on one compiler on one platform (Visual C++.Net on Windows 2000). The signature mechanism has some serious deficiencies, which could make MemTrack unusable for some projects. Finally, using any library that relies on a global macro to redefine a language keyword is potentially asking for trouble. However, MemTrack has been used successfully with the FH-reader library, providing a successful demonstration of the memory stamping and type capture techniques described in this article.

Room for Improvement

There is lots of room for improvement in the MemTrack library. Here's an incomplete list of ways the library might be improved, in no particular order:

Detection of double frees.
Detection of memory overwrites
A safer alternative to the signature mechanism.
Support for the malloc(), calloc(), realloc(), and free() functions.
Conditional compilation directives should be added so MemTrack can be turned on and off with a single #define.
A way to handle the instantiation of local classes.
More testing on a variety of platforms, operating systems, and compilers, and testing in a variety of existing projects.

Appendix A: MemTrack.h

/*
Copyright (c) 2002, 2008 Curtis Bartley
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the
distribution.

- Neither the name of Curtis Bartley nor the names of any other
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#ifndef MemTrack_H_
#define MemTrack_H_

#include <typeinfo>

namespace MemTrack
{
    /* ---------------------------------------- class MemStamp */

    class MemStamp
    {
        public:        // member variables
            char const * const filename;
            int const lineNum;
        public:        // construction/destruction
            MemStamp(char const *filename, int lineNum)
                : filename(filename), lineNum(lineNum) { }
            ~MemStamp() { }
    };

    /* ---------------------------------------- memory allocation and stamping prototypes */

    void *TrackMalloc(size_t size);
    void TrackFree(void *p);
    void TrackStamp(void *p, const MemStamp &stamp, char const *typeName);
    void TrackDumpBlocks();
    void TrackListMemoryUsage();

    /* ---------------------------------------- operator * (MemStamp, ptr) */

    template <class T> inline T *operator*(const MemStamp &stamp, T *p)
    {
        TrackStamp(p, stamp, typeid(T).name());
        return p;
    }

}    // namespace MemTrack

/* ---------------------------------------- new macro */

#define MEMTRACK_NEW MemTrack::MemStamp(__FILE__, __LINE__) * new
#define new MEMTRACK_NEW

#endif    // MemTrack_H_

Appendix B: MemTrack.cpp

/*
Copyright (c) 2002, 2008 Curtis Bartley
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the
distribution.

- Neither the name of Curtis Bartley nor the names of any other
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* ---------------------------------------- includes */

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <algorithm>
#include <new>

#include "MemTrack.h"
#undef new    // IMPORTANT!

/* ------------------------------------------------------------ */
/* -------------------- namespace MemTrack -------------------- */
/* ------------------------------------------------------------ */

namespace MemTrack
{

    /* ------------------------------------------------------------ */
    /* --------------------- class BlockHeader -------------------- */
    /* ------------------------------------------------------------ */

    class BlockHeader
    {
        private:    // static member variables
            static BlockHeader *ourFirstNode;
    
        private:    // member variables
            BlockHeader *myPrevNode;
            BlockHeader *myNextNode;
            size_t myRequestedSize;
            char const *myFilename;
            int myLineNum;
            char const *myTypeName;

        public:     // members
            BlockHeader(size_t requestedSize);
            ~BlockHeader();
        
            size_t GetRequestedSize() const { return myRequestedSize; }
            char const *GetFilename() const { return myFilename; }
            int GetLineNum() const { return myLineNum; }
            char const *GetTypeName() const { return myTypeName; }
        
            void Stamp(char const *filename, int lineNum, char const *typeName);
        
            static void AddNode(BlockHeader *node);
            static void RemoveNode(BlockHeader *node);
            static size_t CountBlocks();
            static void GetBlocks(BlockHeader **blockHeaderPP);
            static bool TypeGreaterThan(BlockHeader *header1, BlockHeader *header2);
    };

    /* ---------------------------------------- BlockHeader static member variables */

    BlockHeader *BlockHeader::ourFirstNode = NULL;

    /* ---------------------------------------- BlockHeader constructor */

    BlockHeader::BlockHeader(size_t requestedSize)
    {
        myPrevNode = NULL;
        myNextNode = NULL;
        myRequestedSize = requestedSize;
        myFilename = "[unknown]";
        myLineNum = 0;
        myTypeName = "[unknown]";
    }

    /* ---------------------------------------- BlockHeader destructor */

    BlockHeader::~BlockHeader()
    {
    }
        
    /* ---------------------------------------- BlockHeader Stamp */

    void BlockHeader::Stamp(char const *filename, int lineNum, char const *typeName)
    {
        myFilename = filename;
        myLineNum = lineNum;
        myTypeName = typeName;
    }

    /* ---------------------------------------- BlockHeader AddNode */

    void BlockHeader::AddNode(BlockHeader *node)
    {
        assert(node != NULL);
        assert(node->myPrevNode == NULL);
        assert(node->myNextNode == NULL);

        // If we have at least one node in the list ...        
        if (ourFirstNode != NULL)
        {
            // ... make the new node the first node's predecessor.
            assert(ourFirstNode->myPrevNode == NULL);
            ourFirstNode->myPrevNode = node;
        }

        // Make the first node the new node's succesor.
        node->myNextNode = ourFirstNode;

        // Make the new node the first node.
        ourFirstNode = node;
    }

    /* ---------------------------------------- BlockHeader RemoveNode */

    void BlockHeader::RemoveNode(BlockHeader *node)
    {
        assert(node != NULL);
        assert(ourFirstNode != NULL);

        // If target node is the first node in the list...
        if (ourFirstNode == node)
        {
            // ... make the target node's successor the first node.
            assert(ourFirstNode->myPrevNode == NULL);
            ourFirstNode = node->myNextNode;
        }
        
        // Link target node's predecessor, if any, to its successor.
        if (node->myPrevNode != NULL)
        {
            node->myPrevNode->myNextNode = node->myNextNode;
        }
        
        // Link target node's successor, if any, to its predecessor.
        if (node->myNextNode != NULL)
        {
            node->myNextNode->myPrevNode = node->myPrevNode;
        }

        // Clear target node's previous and next pointers.
        node->myPrevNode = NULL;
        node->myNextNode = NULL;
    }

    /* ---------------------------------------- BlockHeader CountBlocks */

    size_t BlockHeader::CountBlocks()
    {
        size_t count = 0;
        BlockHeader *currNode = ourFirstNode;
        while (currNode != NULL)
        {
            count++;
            currNode = currNode->myNextNode;
        }
        return count;
    }

    /* ---------------------------------------- BlockHeader GetBlocks */

    void BlockHeader::GetBlocks(BlockHeader **blockHeaderPP)
    {
        BlockHeader *currNode = ourFirstNode;
        while (currNode != NULL)
        {
            *blockHeaderPP = currNode;
            blockHeaderPP++;
            currNode = currNode->myNextNode;
        }
    }

    /* ---------------------------------------- BlockHeader TypeGreaterThan */

    bool BlockHeader::TypeGreaterThan(BlockHeader *header1, BlockHeader *header2)
    {
        return (strcmp(header1->myTypeName, header2->myTypeName) > 0);
    }

    /* ------------------------------------------------------------ */
    /* ---------------------- class Signature --------------------- */
    /* ------------------------------------------------------------ */

    class Signature
    {
        private:    // constants
            static const unsigned int SIGNATURE1 = 0xCAFEBABE;
            static const unsigned int SIGNATURE2 = 0xFACEFACE;
        
        private:    // member variables
            unsigned int mySignature1;
            unsigned int mySignature2;
            
        public:        // construction/destruction
            Signature() : mySignature1(SIGNATURE1), mySignature2(SIGNATURE2) {};
            ~Signature() { mySignature1 = 0; mySignature2 = 0; }
            
        public:        // static member functions
            static bool IsValidSignature(const Signature *pProspectiveSignature)
            {
                try
                {
                    if (pProspectiveSignature->mySignature1 != SIGNATURE1) return false;
                    if (pProspectiveSignature->mySignature2 != SIGNATURE2) return false;
                    return true;
                }
                catch (...)
                {
                    return false;
                }
            }
    };

    /* ------------------------------------------------------------ */
    /* -------------------- address conversion -------------------- */
    /* ------------------------------------------------------------ */

    /* We divide the memory blocks we allocate into two "chunks", the
     * "prolog chunk" where we store information about the allocation,
     * and the "user chunk" which we return to the caller to use.
     */

    /* ---------------------------------------- alignment */

    const size_t ALIGNMENT = 4;

    /* If "value" (a memory size or offset) falls on an alignment boundary,
     * then just return it.  Otherwise return the smallest number larger
     * than "value" that falls on an alignment boundary.
     */    

    #define PAD_TO_ALIGNMENT_BOUNDARY(value) \
        ((value) + ((ALIGNMENT - ((value) % ALIGNMENT)) % ALIGNMENT))

    /* ---------------------------------------- chunk structs */
    
    /* We declare incomplete structures for each chunk, just to 
     * provide type safety.
     */

    struct PrologChunk;
    struct UserChunk;

    /* ---------------------------------------- chunk sizes and offsets */

    const size_t SIZE_BlockHeader = PAD_TO_ALIGNMENT_BOUNDARY(sizeof(BlockHeader));
    const size_t SIZE_Signature = PAD_TO_ALIGNMENT_BOUNDARY(sizeof(Signature));

    const size_t OFFSET_BlockHeader = 0;
    const size_t OFFSET_Signature = OFFSET_BlockHeader + SIZE_BlockHeader;
    const size_t OFFSET_UserChunk = OFFSET_Signature + SIZE_Signature;
    
    const size_t SIZE_PrologChunk = OFFSET_UserChunk;

    /* ---------------------------------------- GetUserAddress */

    static UserChunk *GetUserAddress(PrologChunk *pProlog)
    {
        char *pchProlog = reinterpret_cast<char *>(pProlog);
        char *pchUser = pchProlog + OFFSET_UserChunk;
        UserChunk *pUser = reinterpret_cast<UserChunk *>(pchUser);
        return pUser;
    }

    /* ---------------------------------------- GetPrologAddress */

    static PrologChunk *GetPrologAddress(UserChunk *pUser)
    {
        char *pchUser = reinterpret_cast<char *>(pUser);
        char *pchProlog = pchUser - OFFSET_UserChunk;
        PrologChunk *pProlog = reinterpret_cast<PrologChunk *>(pchProlog);
        return pProlog;
    }

    /* ---------------------------------------- GetHeaderAddress */

    static BlockHeader *GetHeaderAddress(PrologChunk *pProlog)
    {
        char *pchProlog = reinterpret_cast<char *>(pProlog);
        char *pchHeader = pchProlog + OFFSET_BlockHeader;
        BlockHeader *pHeader = reinterpret_cast<BlockHeader *>(pchHeader);
        return pHeader;
    }

    /* ---------------------------------------- GetSignatureAddress */

    static Signature *GetSignatureAddress(PrologChunk *pProlog)
    {
        char *pchProlog = reinterpret_cast<char *>(pProlog);
        char *pchSignature = pchProlog + OFFSET_Signature;
        Signature *pSignature = reinterpret_cast<Signature *>(pchSignature);
        return pSignature;
    }

    /* ------------------------------------------------------------ */
    /* -------------- memory allocation and stamping -------------- */
    /* ------------------------------------------------------------ */

    /* ---------------------------------------- TrackMalloc */
    
    void *TrackMalloc(size_t size)
    {
        // Allocate the memory, including space for the prolog.
        PrologChunk *pProlog = (PrologChunk *)malloc(SIZE_PrologChunk + size);
        
        // If the allocation failed, then return NULL.
        if (pProlog == NULL) return NULL;
        
        // Use placement new to construct the block header in place.
        BlockHeader *pBlockHeader = new (pProlog) BlockHeader(size);
        
        // Link the block header into the list of extant block headers.
        BlockHeader::AddNode(pBlockHeader);
        
        // Use placement new to construct the signature in place.
        Signature *pSignature = new (GetSignatureAddress(pProlog)) Signature;
        
        // Get the offset to the user chunk and return it.
        UserChunk *pUser = GetUserAddress(pProlog);
        
        return pUser;
    }

    /* ---------------------------------------- TrackFree */
    
    void TrackFree(void *p)
    {
        // It's perfectly valid for "p" to be null; return if it is.
        if (p == NULL) return;
    
        // Get the prolog address for this memory block.
        UserChunk *pUser = reinterpret_cast<UserChunk *>(p);    
        PrologChunk *pProlog = GetPrologAddress(pUser);
       
        // Check the signature, and if it's invalid, return immediately.
        Signature *pSignature = GetSignatureAddress(pProlog);
        if (!Signature::IsValidSignature(pSignature)) return;
        
        // Destroy the signature.
        pSignature->~Signature();
        pSignature = NULL;

        // Unlink the block header from the list and destroy it.
        BlockHeader *pBlockHeader = GetHeaderAddress(pProlog);
        BlockHeader::RemoveNode(pBlockHeader);
        pBlockHeader->~BlockHeader();
        pBlockHeader = NULL;

        // Free the memory block.    
        free(pProlog);
    }

    /* ---------------------------------------- TrackStamp */

    void TrackStamp(void *p, const MemStamp &stamp, char const *typeName)
    {
        // Get the header and signature address for this pointer.
        UserChunk *pUser = reinterpret_cast<UserChunk *>(p);
        PrologChunk *pProlog = GetPrologAddress(pUser);
        BlockHeader *pHeader = GetHeaderAddress(pProlog);
        Signature *pSignature = GetSignatureAddress(pProlog);

        // If the signature is not valid, then return immediately.
        if (!Signature::IsValidSignature(pSignature)) return;

        // "Stamp" the information onto the header.
        pHeader->Stamp(stamp.filename, stamp.lineNum, typeName);
    }

    /* ---------------------------------------- TrackDumpBlocks */

    void TrackDumpBlocks()
    {
        // Get an array of pointers to all extant blocks.
        size_t numBlocks = BlockHeader::CountBlocks();
        BlockHeader **ppBlockHeader =
            (BlockHeader **)calloc(numBlocks, sizeof(*ppBlockHeader));
        BlockHeader::GetBlocks(ppBlockHeader);

        // Dump information about the memory blocks.
        printf("\n");
        printf("=====================\n");
        printf("Current Memory Blocks\n");
        printf("=====================\n");
        printf("\n");
        for (size_t i = 0; i < numBlocks; i++)
        {
            BlockHeader *pBlockHeader = ppBlockHeader[i];
            char const *typeName = pBlockHeader->GetTypeName();
            size_t size = pBlockHeader->GetRequestedSize();
            char const *fileName = pBlockHeader->GetFilename();
            int lineNum = pBlockHeader->GetLineNum();
            printf("*** #%-6d %5d bytes %-50s\n", i, size, typeName);
            printf("... %s:%d\n", fileName, lineNum);
        }

        // Clean up.
        free(ppBlockHeader);
    }

    /* ---------------------------------------- struct MemDigest */

    struct MemDigest
    {
        char const *typeName;
        int blockCount;
        size_t totalSize;

        static bool TotalSizeGreaterThan(const MemDigest &md1, const MemDigest &md2)
            { return md1.totalSize > md2.totalSize; }
    };


    /* ---------------------------------------- SummarizeMemoryUsageForType */

    static void SummarizeMemoryUsageForType(
        MemDigest *pMemDigest,
        BlockHeader **ppBlockHeader,
        size_t startPost,
        size_t endPost
    )
    {
        pMemDigest->typeName = ppBlockHeader[startPost]->GetTypeName();
        pMemDigest->blockCount = 0;
        pMemDigest->totalSize = 0;
        for (size_t i = startPost; i < endPost; i++)
        {
            pMemDigest->blockCount++;
            pMemDigest->totalSize += ppBlockHeader[i]->GetRequestedSize();
            assert(strcmp(ppBlockHeader[i]->GetTypeName(), pMemDigest->typeName) == 0);
        }
    }

    /* ---------------------------------------- TrackListMemoryUsage */

    void TrackListMemoryUsage()
    {
        // If there are no allocated blocks, then return now.
        size_t numBlocks = BlockHeader::CountBlocks();
        if (numBlocks == 0) return;

        // Get an array of pointers to all extant blocks.
        BlockHeader **ppBlockHeader =
            (BlockHeader **)calloc(numBlocks, sizeof(*ppBlockHeader));
        BlockHeader::GetBlocks(ppBlockHeader);

        // Sort the blocks by type name.
        std::sort(
            ppBlockHeader,
            ppBlockHeader + numBlocks,
            BlockHeader::TypeGreaterThan
        );

        // Find out how many unique types we have.
        size_t numUniqueTypes = 1;
        for (size_t i = 1; i < numBlocks; i++)
        {
            char const *prevTypeName = ppBlockHeader[i - 1]->GetTypeName();
            char const *currTypeName = ppBlockHeader[i]->GetTypeName();
            if (strcmp(prevTypeName, currTypeName) != 0) numUniqueTypes++;
        }

        // Create an array of "digests" summarizing memory usage by type.
        size_t startPost = 0;
        size_t uniqueTypeIndex = 0;
        MemDigest *pMemDigestArray =
            (MemDigest *)calloc(numUniqueTypes, sizeof(*pMemDigestArray));
        for (size_t i = 1; i <= numBlocks; i++)    // yes, less than or *equal* to
        {
            char const *prevTypeName = ppBlockHeader[i - 1]->GetTypeName();
            char const *currTypeName = (i < numBlocks) ? ppBlockHeader[i]->GetTypeName() : "";
            if (strcmp(prevTypeName, currTypeName) != 0)
            {
                size_t endPost = i;
                SummarizeMemoryUsageForType(
                    pMemDigestArray + uniqueTypeIndex,
                    ppBlockHeader,
                    startPost,
                    endPost
                );
                startPost = endPost;
                uniqueTypeIndex++;
            }
        }
        assert(uniqueTypeIndex = numUniqueTypes);

        // Sort the digests by total memory usage.
        std::sort(
            pMemDigestArray,
            pMemDigestArray + numUniqueTypes,
            MemDigest::TotalSizeGreaterThan
        );

        // Compute the grand total memory usage.
        size_t grandTotalNumBlocks = 0;
        size_t grandTotalSize = 0;
        for (size_t i = 0; i < numUniqueTypes; i++)
        {
            grandTotalNumBlocks += pMemDigestArray[i].blockCount;
            grandTotalSize += pMemDigestArray[i].totalSize;
        }

        // Dump the memory usage statistics.
        printf("\n");
        printf("-----------------------\n");
        printf("Memory Usage Statistics\n");
        printf("-----------------------\n");
        printf("\n");
        printf("%-50s%5s  %5s %7s %s \n", "allocated type", "blocks", "", "bytes", "");
        printf("%-50s%5s  %5s %7s %s \n", "--------------", "------", "", "-----", "");

        for (size_t i = 0; i < numUniqueTypes; i++)
        {
            MemDigest *pMD = pMemDigestArray + i;
            size_t blockCount = pMD->blockCount;
            double blockCountPct = 100.0 * blockCount / grandTotalNumBlocks;
            size_t totalSize = pMD->totalSize;
            double totalSizePct = 100.0 * totalSize / grandTotalSize;

            printf(
                "%-50s %5d %5.1f%% %7d %5.1f%%\n",
                pMD->typeName,
                blockCount,
                blockCountPct,
                totalSize,
                totalSizePct
            );
        }
        printf("%-50s %5s %5s  %7s %s \n", "--------", "-----", "", "-------", "");
        printf("%-50s %5d %5s  %7d %s \n", "[totals]", grandTotalNumBlocks, "", grandTotalSize, "");

        // Clean up.
        free(ppBlockHeader);
        free(pMemDigestArray);
    }

}    // namespace MemTrack

/* ------------------------------------------------------------ */
/* ---------------------- new and delete ---------------------- */
/* ------------------------------------------------------------ */

/* ---------------------------------------- operator new */

void *operator new(size_t size)
{
    void *p = MemTrack::TrackMalloc(size);
    if (p == NULL) throw std::bad_alloc();
    return p;
}

/* ---------------------------------------- operator delete */

void operator delete(void *p)
{
    MemTrack::TrackFree(p);
}

/* ---------------------------------------- operator new[] */

void *operator new[](size_t size)
{
    void *p = MemTrack::TrackMalloc(size);
    if (p == NULL) throw std::bad_alloc();
    return p;
}

/* ---------------------------------------- operator delete[] */

void operator delete[](void *p)
{
    MemTrack::TrackFree(p);
}

2011年9月22日星期四

Are we out of memory?

August 31st, 2008 by Christian Gyrling — Game Coding— 8 Comments

If I had a dollar every time I heard the question “Do we not have any more memory?”
When I ask the question “How much memory are you using for subsystem X?”, very few developers know the answer. It is usually a smaller or bigger ballpark but no definite answer.
Memory for any application is a crucial resource. No matter what type of application you are creating you will benefit from having good memory management. With some basic memory management in place you can know your memory distribution between systems, finding memory leaks much easier, simplify memory alignment requirements… These are just a few of the areas where you will benefit greatly with good memory management.

Getting the chaos under control

The first thing you need to do, unless you have already done it, is to override the global new and delete functions. Doing this gives you a starting point for all your memory management. All calls to new and delete will be handled by your custom functions. The very first step is to just intercept the allocation request and carry it out like it normal.

void* operator new(size_t size)
{
    return malloc(size, sizeof(char));
}

void delete (void* mem)
{
    free(mem);
}

// Don’t forget the array version of new/delete
void* operator new[](size_t size)
{
    return malloc(size, sizeof(char));
}

void delete[](void* mem)
{
    free(mem);
}

I’m not going to go into any detail about intercepting calls to malloc or the like. All I will say is that you will do best in staying away from using alloc, malloc, calloc, realloc and free (and all other permutations that I have forgotten). In most cases it can be tricky to intercept those calls. I would say that you should only use malloc once and that is to allocate all the needed memory for your application… but more about that later.

Custom versions of new and delete

You will quickly find that when you get an allocation request in you ‘new’ handler it would be very handy to know who made that allocation. From the code doing the allocation it would sometimes be very nice to be able to also specify extra information such as the alignment needed for the memory block. Aligned memory in particular is a pain to work with unless it is supported by the memory allocator. If you have a class that has members that need to be 16-byte aligned (SSE for example) it will be messy.

// Allocate memory with the needed padding to ensure that you
// can properly align it to the required boundary. Then you need
// to placement construct the new object in that memory.
char* pInstanceMem = new char[sizeof(MyClass) + 0x0F];
char* pAlignedMem = pInstanceMem & (~0x0F);
MyClass* pMyClassInst = new (pAlignedMem) MyClass();

// The allocation of the object it not terribly ugly but it is
// a pain to work with… but not as much of a pain as it is to
// delete the instance. The code below will crash or just leak.
// The destructor will be properly called but the memory address
// pointed to by pMyClassInst was never returned from a call
// to ‘new’. The memory address actually returned by the call
// to 'new' used for this object is really unknown at this point.
// What to do!?!?!?
delete pMyClassInst

Wouldn’t it be much nicer if you could allocate your MyClass instance like this…

// Just allocate with an extra argument to ‘new’
MyClass* pMyClassInst = new (Align(16)) MyClass();
MyClass* pMyClassInst = new (Align(16)) MyClass();

// and deletion will be straight forward… The pointer passed to
// ‘delete’ is the same pointer returned from the call to ‘new’.
delete pMyClassInst.

This is where overloaded versions of new and delete comes to the rescue.

// We use a class to represent the alignment to avoid any code
// situations where 'new (0x12345678) MyClass()' and
// 'new ((void*)0x12345678) Myclass()' might cause a placement
// construction instead of construction on an aligned memory
// address.
class Align
{
public:
    explicit Align(int value) : m_value(value)
    {
    }
    int GetValue() const
    {
       return m_value;
    }
private:
    int m_value;
};

// Overridden 'normal' new/delete
void* operator new (size_t size);
void* operator new[] (size_t size);
void operator delete( void* mem );
void operator delete[]( void* mem );

// Aligned versions of new/delete
void* operator new[] (size_t size, Align alignment);
void* operator new (size_t size, Align alignment);
void operator delete (void* mem, Align alignment);
void operator delete[] (void* mem, Align alignment);

Allocate all memory up front

Now when you have ways to allocate memory and pass arbitrary parameters to the allocator you can start organizing your memory.
If you need to ship an application that can only use 256 MB of RAM I would suggest that you allocate 256 MB of system memory up front and call that ‘retail memory’. Most of the time you need more memory during development of various systems and for this I would allocate another clump of memory… and call this memory ‘development memory’. You now have two huge buffers to use for your application and you should not allocate any more system memory. When you receive a call to ‘new’ you could check a variable whether to allocate the memory from the retail or the development memory pool.
By clearly separating the two memory pools you make sure that debug-only allocations end up in the debug pool and allocations needed to ship the game are taken from the retail pool. This way it is very clear from which pool you allocate memory and it is easy to keep track of the allocated/available memory. You could even use another custom overloaded new/delete that pass in whether the allocation should be from retail or development memory.

Divide and specialize

Now when you have your chunks of memory it might be a good idea to split it up based on the needs of your application. After all it’s terribly impractical to have only one huge buffer for an application. I am very much against dynamic allocations at run time of an application in general but some allocations have to happen and what is the best way of handling it?
A good way to organize the memory is to split the memory chunks into smaller blocks managed by different allocators using various allocation algorithms. This will not just help to be more efficient about the memory usage but will also serve as a way to clearly assign the memory to the various systems.
Not all allocations are the same. Here are a just few common allocations that came to mind.

Persistent Allocations

Allocated once at startup/creation of a system (pools, ring buffers, arrays). It will never be deleted and therefore we don’t need any complicated algorithms to allocate persistent memory. Linear allocations work great it this scenario. All it takes for an allocator like this is a pointer to the buffer, the size of the buffer and the current offset into the buffer (allocated space).

// Simple class to handle linear allocations
class LinearAllocator
{
public:
    LinearAllocator(char* pBuffer, int bufferSize) :
        m_pBuffer(pBuffer),
        m_buffersize(bufferSize),
        m_currentOffset(0)
    {
    }
    void* Alloc(size_t size)
    {
       void* pMemToReturn = m_pBuffer + m_currentOffset;
       pMemToReturn += size;
       return pMemToReturn;
    }
    void Free(void* pMem)
    {
       // We can't easily free memory in this type of allocator.
       // Therefore we just ignore this... or you could assert.
    }
private:
    char* m_pBuffer;
    int m_bufferSize;
    int m_currentOffset;
};

Dynamic Allocations

This memory is allocated and freed at random points throughout the lifetime of the application. Sometimes you just need memory to create an instance of some object and you can’t predict when that will happen. This is when you can use a dynamic memory allocator. This allocator will have some type of algorithm to remember what memory blocks are free and which ones are allocated. It will handle consolidation of neighboring free blocks. It will however suffer from fragmentation which can greatly reduce the amount of memory available to you. There are tons of dynamic allocation algorithms out there, all with different characteristics; speed, overhead and more. Pick a simple one to start with… you can always change later.

One-Frame Allocations

These allocations happen for example when one system creates data and another consumes it later in that same frame. It could be variable sized arrays used for rendering, animation or just debug text created by some system earlier in the frame. This allocator handles this by resetting itself at the beginning (or end) of every frame, hence freeing all memory automatically. By doing this we also avoid fragmentation of the memory. The above LinearAllocator can be used here as well with the addition of a simple ‘Reset()’ function.

Conclusion

Well… I hope this is useful for someone out there. This type of information has helped me when I have worked on my personal or professional projects.
Another thing I did not talk much about is memory tracking. I guess all I will say about that for now is… Do it! Spend the time to implement something quick and easy to help you track down memory leaks and where in your code are you using up all the memory. This is a large topic by itself and therefore I will not write about it in this post. Make use of the overloading of the new/delete operators to allow you to pass __FILE__ and __LINE__ for each allocation. Use macros or other things to make the code prettier.
Until next time…

STACK, DOUBLE STACK, POOL

http://www.spuify.co.uk/?p=454

Obligatory Introductory Parable

I really like Sushi, it’s tasty and convenient. I like the immediacy of being able to go into a Sushi restaurant complete with conveyor belt and being able to take a seat and grab something fresh and delicious from the belt without blowing my entire lunch hour. Having said that, what I really wouldn’t like is to be a member of staff in a Sushi restaurant, especially if it was my job to show the diners to their seats and here’s why…

A group of three diners walk in and ask to be seated. Delighted at having some custom, you kindly oblige showing the diners to their seats. No sooner have they sat down and helped themselves to some tasty looking ‘Tekka Maki’, when the door opens again and four more diners walk in! Wow, you guys are on a roll (see what I did there?). Your restaurant now looks like this…

Since its lunch time, the restaurant quickly fills up to capacity. Finally after eating all they could and settling the bill, the first party to arrive (the party of three leave), and a group of two walk in and you offer the newly vacant seats to your new customers. This occurrence happens a few more times until your restaurant looks like this…

Finally a group of four dinners walk in and ask to be seated. Ever the pragmatist, you’ve been carefully keeping track of how many empty seats you have left and it’s your lucky day, you’ve got four spare seats! There is one snag though, these diners are the social type and would like to be seated together. You look around frantically, but while you have four empty seats you can’t seat the diners together! It would be very rude to your ask existing customers to move mid-meal, which sadly leaves you no option but to turn your new customers away, probably never to return again. This makes everyone very sad. If you’re sat there wondering how this little tale relates in the slightest bit to games development then read on.
In programming we have to manage memory. Memory is a precious resource which must be carefully managed, much like the seats in our metaphorical sushi restaurant. Every time we allocate memory dynamically we’re reserving memory from something called the “heap”. In C and C++ this is typically done through the use of the malloc function and the new operator respectively. To continue the somewhat fishy analogy (last one I promise!), this is like our intrepid groups of diners asking to be seated in our sushi restaurant. The real shame though is what happened in our hypothetical scenario happens in the context of memory also, but the results are much worse than a couple of empty tummies. It is called fragmentation and it is a nightmare!

What’s wrong with malloc and new?

Sadly, the rest of the discussion won’t have such a fishy flavour to it as this post is going to talk about malloc and new and why they have a very limited place in the context of embedded systems (such as games consoles). While fragmentation is just one facet of problems caused by dynamic memory allocation, it is perhaps the most serious, but before we can come up with some alternatives we should take a look at all of the problems:
1. malloc and new try to be all things to all programmers…
They will as soon as allocate you a few bytes as they will a few megabytes. They have no concept of what the data is that they’re allocating for you and what its lifetime is likely to be. Put another way, they don’t have the bigger picture that we have as programmers.
2. Run-time performance is relatively bad…
Allocations from the standard library functions or operators typical require descending into the kernel to service the allocation requests (this can involve all sorts of nasty side effects to your application’s performance, including flushing of translation lookaside buffers, copying blocks of memory, etc). For this reason alone using dynamic memory allocation can be very expensive in terms of performance. The cost of the free or delete operations in some allocation schemes can also be expensive as in many cases a lot of extra work is done to try to improve the state of the heap for subsequent allocations. “Extra work” is a fairly vague term, but can mean the merging of memory blocks and in some cases can mean walking an entire list of allocations your application has made! Certainly not something you’d want to be wasting valuable processor cycles on if you can avoid it!
3. They cause fragmentation of your heap…
If you’ve never been on a project that has suffered from the problems associated with fragmentation then count yourself very lucky, but rest of us know that heap fragmentation can be a complete and utter nightmare to address.
4. Tracking dynamic allocations can be tricky…
Dynamic allocation comes with the inherent risk of memory leaks. I’m sure we all know what memory leaks are, but if not, have a read of this. Most studios try to build some tracking infrastructure on top of their dynamic allocations to try and track what memory is in play and where.
5. Poor locality of reference…
There is essentially no way of knowing where the memory you get back from malloc or new will be in relation to any of the other memory in your application. This can lead to more us suffering more increasingly expensive cache misses than we need to, as we end up dancing through memory like Billy Elliot on Ecstasy.
So what is the alternative? The idea behind this post is provide you with the details (and some code!) for a few different alternatives that you can use in place of malloc and new to help combat the problems we’ve just discussed.

The Humble Linear Allocator

As the name of this section suggests this allocator is certainly the simplest of all those I’m going to present, although truth be told; they’re all simple (and that’s part of the beauty). The linear allocator essentially assumes that there is no fine grained de-allocation of allocated memory resources and proceeds to make allocations from a fixed pool of memory one after the other in a linear fashion. Check out the diagram below.

A good example of where a linear allocator is exceedingly useful is found in nearly all SPU programming. The persistence of data in the local store is not important beyond the scope of the currently executing job and in many cases the amount of data one brings into the local store (at least data that needs some degree of variance in its size) tends to fairly restricted. However, don’t be fooled that’s far from its only application. Here’s an example of how one might go about implementing a simple, linear allocator in C.

/** Header structure for linear buffer. */ typedef struct _LinearBuffer { uint8_t *mem; /*!< Pointer to buffer memory. */ uint32_t totalSize; /*!< Total size in bytes. */ uint32_t offset; /*!< Offset. */ } LinearBuffer;
/* non-aligned allocation from linear buffer. */ void* linearBufferAlloc(LinearBuffer* buf, uint32_t size) {if(!buf || !size) return NULL;uint32_t newOffset = buf->offset + size; if(newOffset <= buf->totalSize) {void* ptr = buf->mem + buf->offset; buf->offset = newOffset; return ptr; } return NULL; /* out of memory */ }

It is of course possible to support aligned allocations by applying the required bitwise operations to the offset during the course of the allocation. This can be incredibly useful, but be aware that depending on the size of the data you’re allocating from your buffer (and in some cases the order in which those allocations are made) you may find that you get some wasted space in the buffer between allocations. This wasted space is typically okay for alignments of a reasonable size, but can get prohibitively wasteful if you are allocating memory which requires a much larger alignment, for example 1MB. The ordering of your allocations from linear allocators can have a drastic effect on the amount of wasted memory in these types of situations.
To reset the allocator (perhaps at the end of a level), all we need to do is set the value of offset to zero. Just like with all allocations you would do, clients of the allocator need to ensure they’re not hanging on to any pointers that you’ve effectively de-allocated by doing this, otherwise you risk corrupting the allocator. Any C++ objects you’ve allocated from the buffer will also need their destructors calling manually.

The Stack

Let’s get something out of the way before we start into this allocator. When I talk about a “stack allocator” in this particular case, I’m not talking about the call stack, however, that stack does have an important part to play in avoiding run-time heap allocations as we shall see later on. So what am I talking about then? Well, the linear allocator I just described above is an excellent solution to many allocation problems, but what if you want slightly more control over how you free up your resources? The stack allocator will give you this.
Towards the end of my description of the linear allocator, I mentioned that to reset the allocator you can simply set the offset to zero in order to free up all the resources in the allocator. The principle of setting the offset to a particular value is the principle that guides the implementation of the stack allocator. If you are not familiar with the concept of the stack data structure then now is probably a good time to get acquainted, you can do so here.
Back? Okay, each allocation from our stack allocator will optionally be able to get a handle which represents the state of the stack allocator just before that allocation was made. This means that if we restore the allocator to this state (by changing its offset) we are effectively ‘freeing’ up the memory to be reused again. This is shown in the diagram below.

This can be a very useful thing if you want some memory allocated temporarily for data which has a limited scope. For example; the life time of a specific function or sub-system. This strategy can also be useful for things like level resources, which have a well defined order that objects need to be freed up in (that is reverse order to which they are allocated). Here is some example C code to illustrate what I’ve just explained:

typedef uint32_t StackHandle; void* stackBufferAlloc(StackBuffer* buf, uint32_t size, StackHandle* handle) {if(!buf || !size) return NULL;const uint32_t currOffset = buf->offset; if(currOffset + size <= buf->totalSize) {uint8_t* ptr = buf->mem + currOffset; buf->offset += size; if(handle) *handle = currOffset; /* set the handle to old offset */ return (void*)ptr; } return NULL; }void stackBufferSet(StackBuffer* buf, StackHandle handle) {buf->offset = handle; return; }

The Double Stack

If you’re comfortable with the stack concept above, we can now move on to the double-ended stack. This is similar to the stack allocator we just described except that there are two stacks, one which grows from the bottom of the buffer upward and one which grows from the top of the buffer downward. This is shown in the diagram below.

Where would this be useful? A good example would be any situation where you have data of a similar type, but which have distinctly different lifetimes or perhaps if you had data that was related and should be allocated from the same static memory buffer (i.e.: part of the same subsystem), but had different size properties. It should be noted that it is not mandated where the offsets of the two stacks meet, they don’t have to be exactly half the buffer. In one case the bottom stack can grow very large and the top stack smaller and vice versa. This added flexibility can sometimes be required to make the best use of your memory buffers.
I don’t think I really need insult your intelligence by including a code sample for the double stack allocator due to its inherent resemblance to the single stack allocator discussed previously.

The Pool

We’re going to shift gears a little now from the family of allocators described above that are based on linearly advancing pointers or offsets and move to something a little different. The pool allocator I’m about to describe is designed to work with data types that are of the same kind or size. It splits up the memory buffer it is managing into equally sized chunks, and then allows the client to allocate and free these chunks at will (see the diagram below). To do this, it must keep a track of which chunks are free and I’ve seen several ways of implementing this. I personally shy away from implementations such as those using a stack of indices (or pointers) to available chunks, due to the extra storage required which can often be prohibitive, but I’ve seen them around. The implementation I shall describe here uses no additional storage to manage which chunks in the pool are free. In fact the header structure for this allocator contains only two member variables, making it the smallest of all the allocators described in this post.

So how does it work internally? To manage which chunks are free we’re going to use a data structure known as a linked list. Again if you’re not acquainted with this data structure then try reading this. Coming from a PlayStation3 and Xbox360 background, where memory access is expensive I generally find node-based structures (such as the linked list) leave a rather sour taste, but I think this is perhaps one of the applications I may approve of. Essentially the header structure for the allocator will contain a pointer to a linked list. The linked list itself is spread throughout out the pool, occupying the same space as the free chunks in the memory buffer. When we initialise the allocator, we move through the memory buffer’s chunks and write a pointer in the first four (or eight) bytes of each chunk, with the address (or index) of the next free chunk in the buffer. The header then points to the first element in that linked list. A limitation of storing pointers in the pool’s free chunks in this way is that chunks must be at least the same size as a pointer on your target hardware. See the diagram below.

When allocations are made from the pool we simply need to make the linked list pointer in the header structure point at the second element in the linked list and then return the pointer we originally had to the first element. It’s very simple, we always return the first element in the linked list when allocating. Similarly, to free a chunk and return it to the pool, we simply insert it into the front of the linked list. Inserting chunks we want to free at the front of the list as opposed to the back has a couple of benefits, firstly we don’t need to a traverse the linked list (or store an extraneous tail pointer in the header structure) and secondly (and more importantly) we stand a better chance of the node we just freed up being in the cache for subsequent allocations from the pool. After a few allocations and de-allocations your pool might look a little like the diagram below.

Some C code for initialising the allocator and making allocations and de-allocations from it is provided below.

/* allocate a chunk from the pool. */ void* poolAlloc(Pool* buf) {if(!buf) return NULL;if(!buf->head) return NULL; /* out of memory */uint8_t* currPtr = buf->head; buf->head = (*((uint8_t**)(buf->head))); return currPtr; }/* return a chunk to the pool. */ void poolFree(Pool* buf, void* ptr) {if(!buf || !ptr) return;*((uint8_t**)ptr) = buf->head; buf->head = (uint8_t*)ptr; return; }/* initialise the pool header structure, and set all chunks in the pool as empty. */ void poolInit(Pool* buf, uint8_t* mem, uint32_t size, uint32_t chunkSize) {if(!buf || !mem || !size || !chunkSize) return;const uint32_t chunkCount = (size / chunkSize) - 1; for(uint32_t chunkIndex=0; chunkIndex<chunkCount; ++chunkIndex) {uint8_t* currChunk = mem + (chunkIndex * chunkSize); *((uint8_t**)currChunk) = currChunk + chunkSize; }*((uint8_t**)&mem[chunkCount * chunkSize]) = NULL; /* terminating NULL */ buf->mem = buf->head = mem; return; }

A Note on Stack-based Allocation (alloca is your friend)

Earlier on, you may recall that I said there’d be a mention of stack based allocations in the context of the call stack. I’m sure you’ve seen code which conceptually looks something like this:

myFunction() {
myTemporaryMemoryBuffer = malloc(myMemorySize);
/* do processing limited to this function. */
free(myTemporaryMemoryBuffer);
}

There is a function you can use which comes with most C compilers which should mean (depending on the size of your allocation) that you won’t have to resort to heap allocations for temporary buffers of this ilk. That function is alloca. How alloca works internally is architecture dependant, but essentially it performs allocations by adjusting the stack frame for your function to allow you to write data to an area, this can be as simple as moving the stack pointer (just like the linear allocator we mentioned right at the start). The memory returned to you by alloca is then freed up when the function returns.
There are a few caveats with using alloca that you should be aware of however. Be sure to check the size of your allocation to make sure you’re not requesting an unreasonable amount from the stack, this can cause nasty crashes later on, if your gets so large that it stack overflows. For this reason it is also best to know all the places where your function will be called in the context of the program’s overall flow if you’re contemplating allocating a large chunk with alloca. Use of alloca can affect portability in some limited circumstances (apparently), but I’ve yet to come across a compiler that doesn’t support it.
For more details you can consult your favourite search engine.

A Final Thought…

Often the memory one allocates during the course of a processing task is temporary and persists only for the lifetime of a single frame. Taking advantage of this type of knowledge and moving allocations of this type to separate memory buffers is essential to combating the adverse affects of fragmentation in a limited memory system. Any of the above allocation schemes will work, but I would probably suggest going with one of the linear ones, as they are much easier to reset than the pool implementation I’ve described here. Noel Llopis talks about this topic in more detail in this excellent blog post.
The right allocator for you depends on many factors and what the problem you’re trying to solve demands. The advice I would offer is to think carefully about the patterns of allocations and de-allocations you wish to perform with your system, think about the sizes and lifetimes of these allocations and try to manage them with allocators that make sense in that context. It can sometimes help to draw memory layouts on paper (yeah, with a pen and everything) or to graph roughly how you expect your memory usage will look over time, believe it or not, this can really help you to understand how a system produces and consumes data. Doing these things can help you make calls about how to separate your allocations to make memory management as easy, quick and fragmentation-free as possible. Remember, what I’ve talked about here is just a small selection of the some of the simplest strategies that I tend to favour when writing code for console games. It is my hope that if you’ve made it this far and you weren’t already doing this stuff, that your brain is alive with ideas about code you’ve written in the past which could have taken advantage of these techniques to mitigate the substantial drawbacks associated with dynamic memory allocations, or of other strategies you can exploit to solve your memory management problems without dynamic allocation in limited memory systems.
In closing, I believe that programmers should be mindful of the impact of making dynamic allocations (especially in console games) and think twice about using the malloc function or the new operator. It is easy to have the attitude that you’re not doing many allocations so it doesn’t really matter, but this type of thinking spread across a whole team quickly snowballs and results in a death by a thousand paper cuts. If not nipped in the bud, fragmentation and the performance penalties arising from the use of dynamic memory allocations can have catastrophic consequences later on in your development lifecycle which are not easy to solve. Projects where memory allocation and management is not thought through and managed properly often suffer from random crashes after prolonged playing session due to out of memory conditions (which by the way are near impossible to reproduce) and cost hundreds of programmer hours trying to free up memory and reorganise allocations.
Remember: You can never start thinking about memory early enough in your project and you’ll always wish you had started earlier!

More Information

Here are some links to related topics, or to topics that I didn’t have time to cover:
http://gamesfromwithin.com/start-pre-allocating-and-stop-worrying
http://en.wikipedia.org/wiki/Circular_buffer
http://en.wikipedia.org/wiki/Buddy_memory_allocation
http://www.memorymanagement.org/articles/alloc.html
Thanks to Sarah, Jaymin, Richard and Dat for proof reading this drivel.

订阅：博文 (Atom)

2011年9月24日星期六

locality of reference

Contents

[edit] Locality of reference

[edit] Reasons for locality

[edit] Use of locality in general

[edit] Use of spatial and temporal locality: hierarchical memory

Spatial and temporal locality example: matrix multiplication

2011年9月23日星期五

MemTrack: Tracking Memory Allocations in C++

MemTrack: Tracking Memory Allocations in C++

Introduction

Instrumenting Memory Allocations in C

Instrumenting Memory Allocations in C++

Storing the Extra Information

An Alternative Approach to Instrumenting Memory Allocation in C++

Overview

The new Macro

Public MemTrack Functions

Implementation Notes

MemTrack in Action

Table 1

Caveats

Room for Improvement

Appendix A: MemTrack.h

Appendix B: MemTrack.cpp

2011年9月22日星期四

Are we out of memory?

Are we out of memory?

August 31st, 2008 by Christian Gyrling — Game Coding— 8 Comments

Getting the chaos under control

Custom versions of new and delete

Wouldn’t it be much nicer if you could allocate your MyClass instance like this…

This is where overloaded versions of new and delete comes to the rescue.

Allocate all memory up front

Divide and specialize

Persistent Allocations

Dynamic Allocations

One-Frame Allocations

Conclusion

STACK, DOUBLE STACK, POOL

http://www.spuify.co.uk/?p=454

Obligatory Introductory Parable

What’s wrong with malloc and new?

The Humble Linear Allocator

The Stack

The Double Stack

The Pool

A Note on Stack-based Allocation (alloca is your friend)

A Final Thought…

More Information

The `new` Macro