Thursday, August 9, 2012

How much memory does a C# string take up?

I've seen various answers on the web and they're mostly wrong or make wrong assumptions (this page for example assumes the overhead is 20 bytes based on only a couple of tests).  The answer is a bit more complex but actually makes a lot of sense when you investigate what happens in the .NET runtime.

Let's assume a 32bit system.

A C# string is a reference type.  Every reference type has an 8 byte header.  The first 4 bytes are used for the lockbits (to support the C# lock statement).  The second 4 bytes are a pointer to the object type.  The object type in turn contains the object vtable.  In reality not all the bits are used in the header are needed for either of these fields since.  For example, the object type pointer is aligned to a 4 byte boundary so the lower two bits can be ignored and reused by the garbage collector for marking the object in its mark-and-sweep cycle.  That's 8 bytes minimum to just have an empty object (System.Object is 8 bytes). 

X = 8 + ...

C# strings store their length.  The length is a 4 byte integer (giving a maximum theoretical string length of 2^32).

X = 8 + 4 + ...

To speed up marshalling to native code, all .NET strings are additionally NULL terminated with a unicode null terminator.  Without this NULL terminator, all strings passed to Win32 APIs would need to be copied.  With the NULL terminator, API calls that take unicode strings can simply be given a pointer to the .NET string (after the string is pinned).  That's 2 bytes.

X = 8 + 4 + 2 + ...

Then you need to store the characters.  Each .NET char takes 2 bytes.

X = 8 + 4 + 2 + (2 * LEN)
But that's not the whole story.  The .NET garbage collector allocates memory with 32 bit alignment .  In other words, the total amount of memory allocated at a time will always be multiple of 4 (4, 8, 16, 20, 24, 32, 36 etc).  This theoretically means that every reference type can be referenced in .NET with 30 bits rather than 32 bits.  Every field that is a reference type in .NET is aligned.  Read more about data alignment here: http://en.wikipedia.org/wiki/Data_structure_alignment

So the final answer is:

X = (8 + 4 + 2 + (2 * LEN)) + 4 - 1) / 4 * 4

In .NET prior to version 4, .NET strings had an extra field named "m_arrayLength" which was never used.  This made strings at least 4 bytes longer.  This field was removed in 4.0.

Did you know that in Java, there is a buffer pointer (in C# the buffer comes straight after the string length) and an offset field used to store an offset within the string buffer.  This allows the java.lang.String.substring(int, int) method to operate on O(1) time rather than O(n) time like .NET.  The new string returned simply points to the origin string's buffer and the offset is taken into account with all operations.  This has the unfortunate side-effect whereby a string of 1 character that is the result of a substring call can take up 1MB of memory because its originator string was 1MB.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home