|  | Singleton overview |  | |
| | | Guest |  |
| Posted: Fri Aug 15, 2008 12:28 pm Post subject: Singleton overview |  |
| |  | |
I have a 481 line text document at 80 characters per line. It's a brief overview of globals, singletons, the static initialization order fiasco, the static de-initialization order fiasco, and double checked locking. I'll be basing a presentation to my colleagues on this little draft. In it, I do a quick overview of the concepts, what works, what doesn't work, and why. It contains nothing new, and it's not particularly formal.
I was wondering if this was a proper venue for posting such a thing for comments and criticism, and if not, what would be?
In the interests of brevity, the things I'm most interested in are:
-1- (A little beyond the scope of C++98, but in scope for C++0x). If one programs with the Java 1.5 memory model guidelines in mind, will this work with C++98 on all major operating systems with all major threading libraries? C++0x?
-2- I'm paranoid when it comes to problems like this. I dislike how the standard singleton interface returns a pointer to the singleton. I have thought about making the suggestion that instead of a getInstance function, you declare global functions in the header file. These global functions are defined in the cpp file to use the singleton. Thus the user is unable to get a reference to the singleton, and thus cannot cause static de-initialization order problems.
-3- I know the C++98 standard does not address threads, but if statics are initialized during a multi-threaded situation, will all major compilers still destroy them in the exact opposite order? This implies some locking overhead during the "first pass" over function local statics, as the compiler has to create a hidden global list of static objects to use to determine the order of static de-initialization, and update it whenever a static constructor exits. As two different function local statics can be initialized concurrently, access to this list must be synchronized for correct results.
-4- I believe I have an implementation for a singleton in C++98 so that the following are true. (I was wondering if I missed anything.) -4-a- It is lazily initialized. It will be initialized on first use, which may be after main has started. -4-b- It is thread-safe, a.k.a. the singleton is guaranteed to be constructed only once. -4-c- There is synchronization at most for the first call from each thread. Later calls have no synchronization. -4-d- It will be destroyed during static de-initialization. -4-e- It does not any static initialization order problems nor static de-initialization order problems. (Assuming this singleton does not use other singletons. If it does, as long as it references all needed singletons in its constructor, all static order problems are gone.) -4-f- It is otherwise correct.
Here is the "implementation", which may require implementation specific code, but I don't think so. What it does is replace synchronization overhead in all cases with an extra level of function indirection in all cases.
It was inspired when a Java friend of mine claimed that Java could have a singleton in a multi-threaded environment with all of the guarantees I've listed (except 4-d, static de-initialization). I at first said impossible, but it struct me how the JVM could do it by swapping out a virtual function pointer in the virtual function table after the first call. I do something similar here.
//.hpp file class Mutex {}; //from another header file class Guard { public: Guard(Mutex&); }; //from another header file
void function_1(); void function_2();
//.cpp file #include <cassert> namespace { class T { public: void function_1(); void function_2(); };
Mutex & functionContainingStaticMutex() { static Mutex m; return m; } struct ForceInstantionOfMutex { ForceInstantionOfMutex() { functionContainingStaticMutex(); } } instance_ForceInstantionOfMutex;
T& synchronizedCall(); T& unsynchronizedCall();
//initialization of POD types with constant expressions //is guaranteed to occur before runtime. T& (*funcPtr)() = &synchronizedCall;
T& functionContainingStaticT() { static T t; return t; } T& synchronizedCall() { Guard g(functionContainingStaticMutex()); T& t = functionContainingStaticT();
funcPtr = unsynchronizedCall;
//You need a memory barrier here. You need an implementation //specific instruction to prevent another core from seeing //funcPtr == &unsynchronizedCall and t partially initialized.
//The problem is that a core may load funcPtr into its cache //without calling getSingleton(). For example, it may load //the address right next to &funcPtr during some other operation, //and it just happens to load the addresses around it as well //to the cache as a kind of look-ahead-like optimization.
return t; } T& unsynchronizedCall() { return functionContainingStaticT(); } T& getSingleton() { return funcPtr(); } } void function_1() { getSingleton().function_1(); } void function_2() { getSingleton().function_2(); }
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Guest |  |
| Posted: Sun Aug 17, 2008 2:53 pm Post subject: Re: Singleton overview |  |
| |  | |
(Replying to myself, yes.)
I was looking over my singleton implementation which I claimed had lazy initialization, was threadsafe, and had no synchronization after the first call. In the implementation, I had a comment in there describing how I would need an implementation specific instruction of some kind to prevent other threads from seeing the writes out of order.
However, my several hours of googling seemed to show that there is no such instruction. All of the locking instructions I could find were reciprocal; both threads needed to use an instruction to guarantee the order that writes from one thread will be visible to another thread, for example mutex lock and release and a write memory barrier and read memory barrier.
For my algorithm to work, I need an instruction that will guarantee that all of the writes made by this thread before the instruction will be visible to all other threads before all writes made after the instruction. For a description of what it would have to do hardware- wise: 1- Block this core until all writes have gone to main memory, 2- Then invalidate all other cores' caches. (I suppoes the actual instruction implementation wouldn't have to be quite so draconian to give what's needed, but I believe that description is sufficient.)
Does such a hardware instruction exist on most / all platforms? And will C++0x give it to me?
As an alternative, I was toying around with this idea: When the first thread reaches the singleton synchronization, it modifies the threading library so that any call to the threading library will cause that thread to try and acquire the singleton initialization lock. The first thread will wait until all other currently existing threads are blocked as such, and only then would it release the lock, thus forcing all other threads to see the writes in the correct order. Unfortunately, it could take a long time before before all threads hit a thread library call, and this also requires that all code uses your threading library and not another one. This sounds possible, but you need to exert a lot more control over the entire application, and this has the very real and possibly high potential to cause the entire application to come to a screeching halt, waiting for all threads to hit a thread library call.
But it would work.
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Chris M. Thomasson |  |
| Posted: Sun Aug 17, 2008 4:58 pm Post subject: Re: Singleton overview |  |
<JoshuaMaurice@gmail.com> wrote in message news:26770c04-0668-4de2-96f6-ac769398be49@v1g2000pra.googlegroups.com...
| Quote: | (Replying to myself, yes.)
|
[...]
Unless I am missing something, your singleton "pseudo-impl" is totally busted. Here is a real compliable impl (MSVC) that happens to work on an x86:
LINK (READ ALL!)
AFAICT, you miss several important points... One being that you simply need a recursive mutex... Please explain how you handle a singleton object which creates another singleton object in its ctor? WRT the memory barriers, well, you only need a dependant load on the reader side, and a release barrier on the writer side. End of story. AFAICT, C++0x does not yet support load-dependencies; think DEC Alpha and RCU for a moment...
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Guest |  |
| Posted: Wed Aug 20, 2008 8:58 pm Post subject: Re: Singleton overview |  |
| |  | |
Specifically, I was reviewing the Java idiom which prompted my (potentially) failed attempt at C++ code. Let me write out the Java idiom to help explain what I want in C++.
public class SingletonGetter { public static Singleton getSingleton() { return Singleton.instance; } }
The Java spec guarantees that the Singleton class will not be loaded until the first function call to getSingleton(). Moreover, either the Java spec or the spec for some specific JVM guarantees that only the first call (or first X calls) will have any synchronization. The remaining majority of calls will not.
The C++ code I had in the first post of this thread was an attempt to mimic that. I used a changeable function pointer, very much like how the JVM can just change a virtual function pointer in the virtual function table to point to an unsynchronized version. However, with the hardware primitives I know, every other core will have to run at least one hardware synchronization primitive at the correct point to guarantee the correct ordering of writes.
I see this possible in several ways:
1- Thread Specific Storage. Now, unless there is some magical hardware support for this which I don't know, we're still looking at a synchronization primitive of some kind to access Thread Specific Storage, at least a read memory barrier, so I hope they don't mean this. (Amusingly enough, it's the chicken and the egg. We could have Thread Specific Storage without synchronization overhead if we solve this problem, but we can't use Thread Specific Storage to solve this problem without synchronization.) Either way, there's still going to be additional computation involved past the synchronization primitive, specifically some lookup in an associative container like a hash map or linked list (of pairs).
2- A magical hardware instruction which, at a minimum, will guarantee that this core's writes will finish and go to main memory, then force all other core's to invalidate their caches of this core's writes.
3- Make all other core's perform a synchronization primitive, but only at the first call. I don't know for sure any good way to do this. Waiting until other threads hit a threading library call seems like a bad idea. Maybe you could use a hardware interrupt to cause all cores to execute a signal handler which will do the required synchronization primitive. This seems possible, but it would require co-operation of the entity which knows about all threads and the hardware specific interrupts, a.k.a. the compiler.
Either way, the Java references I've found claim to be able to do a mutually exclusive / thread-safe initialization of a lazy-initialized singleton, and that once initialized, no call to get the singleton will have any synchronization. I don't know if the reference I have lies, and they actually use a read memory barrier on each call but not a full blown lock, or if JVMs actually do this without any hardware synchronization primitive. Either way, it seems like it's possible, but would require the compiler to provide the implementation.
Thus I ask again, will C++0x give me this? Unfortunately, I assume no, and it's too late for adding new material to C++0x, so maybe this new compiler support could be added to the TR2.
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | gpderetta |  |
| Posted: Thu Aug 21, 2008 12:57 pm Post subject: Re: Singleton overview |  |
| |  | |
On Aug 20, 10:58 pm, JoshuaMaur...@gmail.com wrote:
| Quote: | Specifically, I was reviewing the Java idiom which prompted my (potentially) failed attempt at C++ code. Let me write out the Java idiom to help explain what I want in C++.
public class SingletonGetter { public static Singleton getSingleton() { return Singleton.instance; }
}
The Java spec guarantees that the Singleton class will not be loaded until the first function call to getSingleton(). Moreover, either the Java spec or the spec for some specific JVM guarantees that only the first call (or first X calls) will have any synchronization. The remaining majority of calls will not.
|
IIRC C++0x will guarantee thread safe initialization of statics. There is no guarantee on the efficiency of the implementation, but the paper with that proposal included an algorithm that only needs a load dependant memory barrier (i.e. a plain load on most architectures) on the fast path, so it is likely that most implementations will follow the paper (IIRC gcc uses a global mutex ATM).
<snip>
| Quote: | I see this possible in several ways:
1- Thread Specific Storage. Now, unless there is some magical hardware support for this which I don't know, we're still looking at a synchronization primitive of some kind to access Thread Specific Storage, at least a read memory barrier, so I hope they don't mean this. (Amusingly enough, it's the chicken and the egg. We could have Thread Specific Storage without synchronization overhead if we solve this problem, but we can't use Thread Specific Storage to solve this problem without synchronization.) Either way, there's still going to be additional computation involved past the synchronization primitive, specifically some lookup in an associative container like a hash map or linked list (of pairs).
|
TSS is usually implemented with a dedicated register that points to a per thread memory block: you access variables as offsets from the beginning of the block (the details are architecture/Os specific). No need for synchronization primitives at all.
But I fail to see how TSS will help you.
| Quote: | 2- A magical hardware instruction which, at a minimum, will guarantee that this core's writes will finish and go to main memory, then force all other core's to invalidate their caches of this core's writes.
|
Such operations usually require inter-cpu interrupts and are usually privileged (in some cases unix signals can help).
<snip>
| Quote: | Thus I ask again, will C++0x give me this? Unfortunately, I assume no, and it's too late for adding new material to C++0x, so maybe this new compiler support could be added to the TR2.
|
Actually thread safe initialization of singletons should already be in the draft. Failing that, you could implement it yourself with load dependent memory barriers (assuming those get into the draft too).
BTW, TSS will also be in C++0x (and has been available as an extension for a long time in many C++ compilers).
-- gpd
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Chris M. Thomasson |  |
| Posted: Thu Aug 21, 2008 8:55 pm Post subject: Re: Singleton overview |  |
| |  | |
<JoshuaMaurice@gmail.com> wrote in message news:d367aced-dc4c-4e68-a233-fc0a23b1a093@79g2000hsk.googlegroups.com...
| Quote: | Specifically, I was reviewing the Java idiom which prompted my (potentially) failed attempt at C++ code. Let me write out the Java idiom to help explain what I want in C++.
public class SingletonGetter { public static Singleton getSingleton() { return Singleton.instance; } }
The Java spec guarantees that the Singleton class will not be loaded until the first function call to getSingleton(). Moreover, either the Java spec or the spec for some specific JVM guarantees that only the first call (or first X calls) will have any synchronization. The remaining majority of calls will not.
The C++ code I had in the first post of this thread was an attempt to mimic that. I used a changeable function pointer, very much like how the JVM can just change a virtual function pointer in the virtual function table to point to an unsynchronized version. However, with the hardware primitives I know, every other core will have to run at least one hardware synchronization primitive at the correct point to guarantee the correct ordering of writes.
I see this possible in several ways:
1- Thread Specific Storage. Now, unless there is some magical hardware support for this which I don't know, we're still looking at a synchronization primitive of some kind to access Thread Specific Storage, at least a read memory barrier, so I hope they don't mean this. (Amusingly enough, it's the chicken and the egg. We could have Thread Specific Storage without synchronization overhead if we solve this problem, but we can't use Thread Specific Storage to solve this problem without synchronization.) Either way, there's still going to be additional computation involved past the synchronization primitive, specifically some lookup in an associative container like a hash map or linked list (of pairs).
|
[...]
Have you looked at the disassembly of TLS/TSS on popular systems? On Windows, its only a FS register load with offset. The data-structure's are all per-thread; no sync. I don't have time to properly respond to the rest of your writings. However, it seems like you really don't understand membars all that much. Before you respond, examine the disassembly of an access to a variable declared as `__thread' or your compilers equlivant. Then dissemble `TlsGetValue', or `pthread_getspecific()' on an OS with a native pthread impl...
You don't seem to know the difference between a #LoadLoad membar and a data-dependant load membar; anyway... Well, the latter, and even the former, are implied on virtually every arch out there except the Alpha. x86 non-temporal stores aside for a moment indeed.
Please study the membars involves with the RCU algorithm for starters.
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Guest |  |
| Posted: Fri Aug 22, 2008 6:58 am Post subject: Re: Singleton overview |  |
| |  | |
On Aug 21, 1:55 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
| Quote: | JoshuaMaur...@gmail.com> wrote in message 1- Thread Specific Storage. Now, unless there is some magical hardware support for this which I don't know, we're still looking at a synchronization primitive of some kind to access Thread Specific Storage, at least a read memory barrier, so I hope they don't mean this. (Amusingly enough, it's the chicken and the egg. We could have Thread Specific Storage without synchronization overhead if we solve this problem, but we can't use Thread Specific Storage to solve this problem without synchronization.) Either way, there's still going to be additional computation involved past the synchronization primitive, specifically some lookup in an associative container like a hash map or linked list (of pairs).
[...]
Have you looked at the disassembly of TLS/TSS on popular systems? On Windows, its only a FS register load with offset. The data-structure's are all per-thread; no sync. I don't have time to properly respond to the rest of your writings. However, it seems like you really don't understand membars all that much. Before you respond, examine the disassembly of an access to a variable declared as `__thread' or your compilers equlivant. Then dissemble `TlsGetValue', or `pthread_getspecific()' on an OS with a native pthread impl...
|
Admittingly, no, I haven't looked at the assembly generated for accessing thread local storage. I just did a google, and I found LINK
It seems my initial assumptions about how thread local storage could be implemented are partially correct. Specifically, for the general case of dynamic linking, the current thread has a pointer to a vector of pointers, each element pointing to a thread local storage section of a (dynamically loaded) module.
When you first dynamically load a module with thread local storage, as the paper describes, your options are to stop all threads then allocate the storage, or to lazy allocate the storage.
Stopping all threads and executing an arbitrary function is one way out I proposed.
Contrary to my suppositions, the lazy initialize approach requires no any memory barriers at all on all accesses. However, even in this case, the dynamic loader must allocate some data structures for each currently existing thread when a dynamic module is loaded, presumably stopping all execution for a short period of time.
Both methods require stopping all threads for a short period of time to update a structure specific to each thread and performing the necessary synchronization to guarantee other cores see this in the right order, if only to allocate the element in vector of pointers which each point to a block thread specific storage for a module. (The actual memory allocation of the thread local storage of the module for each thread may be delayed.)
Now, let me analyze both approaches:
An average call to my implementation would be: 1- fetch the function pointer 2- jump to the value of the function pointer 3- the code of the function fetches another global memory location, the pointer to my singleton. 4- return that pointer
The average call with thread specific storage approach would be (using only single level indirection fetching with an offset [Sorry, I don't know the correct lingo, bear with me]): 1- fetch the pointer to vector of pointers to blocks of thread local storage of modules ... presumably accessible using the stack pointer or something 2- fetch (vector of pointers pointer)[offset of which dynamic module it's in] ... this is a pointer to my singleton, (or null if the first access of this thread, but we're discussing average case)
It seems I was mistaken in the cost of thread local storage. It is still more expensive than a regular load, but barely. By adding a some cost whenever you load a dynamic library and a little cost whenever you allocate a thread, it can be done quite efficiently, just 1 more dependent load than a regular memory access. My approach involves calling a (non-inlineable) function, which is most assuredly worse.
On Aug 21, 1:55 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
| Quote: | You don't seem to know the difference between a #LoadLoad membar and a data-dependant load membar; anyway... Well, the latter, and even the former, are implied on virtually every arch out there except the Alpha. x86 non-temporal stores aside for a moment indeed.
Please study the membars involves with the RCU algorithm for starters.
|
I admit that I don't fully understand when only a data dependency barrier is required and not a full read barrier, and various other intricacies of the DEC Alpha's memory model, but I am familiar enough to say all data dependency barriers can be replaced safely with read barriers, and I also saw no need to further complicate an already arcane topic.
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
|
|