Why We Have Strings As Managed Reference Types

Posted by : on

Category : C#

Introduction

This is a post that is a little different from my usual posts. I recently gave an answer to the Unity forums, about the string class implementation in C#. The core of the question, was about the decision of why the strings in C# are managed and collected by the garbage collector and why they are not more like the C++ strings, where they don’t use the GC and are manually freed.

I can, empathize with this question. In performance critical scenarios, we try to avoid the garbage collector as much as we can, but eventually we have to deal with strings or delegates. From that point, the situation is less about avoiding the GC and more about controlling when the garbage will be collected, by caching our variables as much as we can and trying to call the GC, when we are not in a performance critical part of our application.

Although I have a background in low level programming with C and less with C++, I don’t have experience with implementing compilers, so my thoughts are completely theoretical. Still, I’m writing this post because I want to have these thoughts written somewhere where they can be easily found, hear opinions from other people and maybe from someone that has professional experience with the creation of a language compiler.

My Thoughts

The Different Implementations Of The String Class

In C# we have a reference that exists in the stack pointing to the string class that exists in the heap.

In C++ the string contents exist in the heap, as they are referenced from inside the string class that exists in the stack. There are sometimes, if the string in C++ is small enough, that its contents can be placed on the stack by the compiler, but that is the exception.

The string class in C++ exists in the stack and points to an array of chars that exist on the heap and when the string class goes out of scope frees that memory, that means that the implementer of the string class was responsible for freeing the memory.

In contrast, the string class in C# exists in heap memory and when there is no reference to it, the class becomes eligible for garbage collection (it may get collected or not) and the string contents “may” get deleted from memory too.

I say “may” because the string data in C# is not some data exclusive to that class, but the class is implemented in such a way that takes up less memory. For example in C#:

string a = "A string";
 
string b = "A string";
 
Console.WriteLine(Object.ReferenceEquals(a,b)); // true

Here we have two different strings, but we have memory allocated only for one “A string”.

So, we actually have two questions to answer, why the string is managed by the GC like everything else and why is chosen to be implemented this way.

Managed memory

For the first answer, it would be weird for a managed language to have one of its basic types unmanaged. There seems to be a confusion that the garbage collector is slower in freeing memory than manually doing it, the way the string class in C++ does, but in reality freeing memory is a very small part of the memory management.

The garbage collector does many other things, that someone coding in C++ will have to do manually, and do them better than the GC if he wants his code to be more efficient. It is much harder than just manually deleting the memory. Some examples:

You will have to keep track of all the references that exist for a string and only delete it when the last one goes out of scope, unless you implement the strings in the same way as C++ where it is less memory efficient for big strings. This is done automatically by the GC and adds to performance, but creating a mechanism to do it manually in C++ will affect performance too.

You will have to be careful with memory fragmentation. This is very important and is a big problem in C++, it is not as simple as freeing the memory because eventually you will have empty “pockets” of memory. This in C# is solved differently depending on the garbage collector.

Unity uses the Boehm GC, that keeps a pointer in each empty space and every time there is a need for memory allocation checks the size of that empty area to check if the object fits, if not tries the next one and so on. The GC in C#, every time memory is freed, consolidates the memory, so that there is only one pointer showing at the empty space and that makes the creation faster, but deletion is slower and for that reason that consolidation doesn’t get done in one go.

Without a garbage collector, this is a problem for the programmer. He will have to implement one of those ways (or a different one of his choosing) and do it in a way that will be faster than the GC’s implementations. If you just create and delete strings without doing that, your memory will eventually be like a Swiss cheese.

Another problem, is you have to take care to lock the memory manually when you are doing any defragmentation and update any pointers, another is bounds checking for your memory, another is updating all references every time you move memory and so on.

This is the answer to the first question, why string is managed like everything else by the GC, it is because manually freeing memory like the string class in C++ does, is more complicated for performance than just the act of freeing it.

Why String Is Implemented Like That

The strings in C# are optimized for memory size. In general, the strings in C# are considered to be relative large and immutable. C# is a general programming language and the performance hit of the GC for the strings for 99% of the cases doesn’t matter.

For the rest 1%, you can 99% of the time, cache your strings and null the reference when it is convenient to do so, like when loading a level, pausing your game, showing a menu etc. and calling the GC then. Caching the references and making them null and calling the GC when it is convenient to do so is much easier than managing all the above memory problems. For the rest 1% of the cases, C# offers low level tools to manually manage memory yourself, or use someone’s library that has done a low level string implementation for you. But this is rare and for this reason the managed string is preferred.

Conversions, Polymorphism and Boxing

Finally, there is the “object” problem. All reference types in C# inherit from the object class. This helps with conversions, polymorphism and the boxing of value types. If strings weren’t managed, many of the current things that seem natural in C#, like writing in the console the value of an int, that in fact boxes the int and calls the ToString method, would be much more complicated, slower because there would be a need for a mechanism to traverse between the managed and the unmanaged space and harder for the programmers as any implicit conversion to string would need manual memory handling.

Final Thoughts

So my answer as of why the strings are not unmanaged as in C++, is because it is an all or nothing deal. Either everything that is stored on the heap is managed so conversions and memory management are easier for the programmer and any performance is gained, by caching and having the garbage collected when you are not in a performance critical part of your app, or everything in the heap is unmanaged by the language and the programmer manually deals with memory in a way that is more performant for his use case than the GC by creating mechanisms that deal with all those memory management problems, not just freeing the memory.

Mixing both of these, will take the worst of each, less performance because of the traveling between managed and unmanaged space and any implicit string creations needed, less safety because of the manual memory management and less productivity because the programmer would have to think in two different ways about the memory a low level thinking for the unmanaged and a high level thinking for the managed part, many times both of those in the same statement ex:

Console.WriteLine($"{person.Name} {person.Age}");

(person a managed instance of a class, Name an unmanaged string, Age an int conversion to string) What gets manually managed here and what not? Is the conversion of Age to string performant? Does the Name reference the unmanaged space and if yes is the only one? The string created by the Age integer, when should be freed? Do I need it later, so I should cache it to avoid the cost of crossing managed/unmanaged space? If I free it, do I need to consolidate the unmanaged memory now or later ? etc…

Conclusion

In the end, I think both managed and unmanaged languages have pro’s and con’s depending on the use case, but having a language that has a mix of both managed and unmanaged objects, is the worst of both worlds. I’m sure there are other complications too of having a mix of managed and unmanaged instances at the same time, that someone with experience in building language compilers will know.

These were the arguments that I could think of, about why strings are implemented as managed objects, the same way as a normal class, and if you have anything to add or a different opinion on the subject I would love to hear your thoughts.

Thank you for reading, and as always, if you have any questions or comments you can use the comments section, or contact me directly via the contact form or by email. Also, if you don’t want to miss any of the new blog posts, you can subscribe to my newsletter or the RSS feed.


Follow me: