Google
 
Webnews.only-4-geeks.com
Interesting places
news.only-4-geeks.com Forum Index » C++

Singleton overview

 
Jump to:  
 
Guest
PostPosted: Fri Aug 15, 2008 12:28 pm    Post subject: Singleton overview
       
I have a 481 line text document at 80 characters per line. It's a
brief overview of globals, singletons, the static initialization order
fiasco, the static de-initialization order fiasco, and double checked
locking. I'll be basing a presentation to my colleagues on this little
draft. In it, I do a quick overview of the concepts, what works, what
doesn't work, and why. It contains nothing new, and it's not
particularly formal.

I was wondering if this was a proper venue for posting such a thing
for comments and criticism, and if not, what would be?


In the interests of brevity, the things I'm most interested in are:

-1- (A little beyond the scope of C++98, but in scope for C++0x). If
one programs with the Java 1.5 memory model guidelines in mind, will
this work with C++98 on all major operating systems with all major
threading libraries? C++0x?

-2- I'm paranoid when it comes to problems like this. I dislike how
the standard singleton interface returns a pointer to the singleton. I
have thought about making the suggestion that instead of a getInstance
function, you declare global functions in the header file. These
global functions are defined in the cpp file to use the singleton.
Thus the user is unable to get a reference to the singleton, and thus
cannot cause static de-initialization order problems.

-3- I know the C++98 standard does not address threads, but if statics
are initialized during a multi-threaded situation, will all major
compilers still destroy them in the exact opposite order? This implies
some locking overhead during the "first pass" over function local
statics, as the compiler has to create a hidden global list of static
objects to use to determine the order of static de-initialization, and
update it whenever a static constructor exits. As two different
function local statics can be initialized concurrently, access to this
list must be synchronized for correct results.

-4- I believe I have an implementation for a singleton in C++98 so
that the following are true. (I was wondering if I missed anything.)
-4-a- It is lazily initialized. It will be initialized on first use,
which may be after main has started.
-4-b- It is thread-safe, a.k.a. the singleton is guaranteed to be
constructed only once.
-4-c- There is synchronization at most for the first call from each
thread. Later calls have no synchronization.
-4-d- It will be destroyed during static de-initialization.
-4-e- It does not any static initialization order problems nor static
de-initialization order problems. (Assuming this singleton does not
use other singletons. If it does, as long as it references all needed
singletons in its constructor, all static order problems are gone.)
-4-f- It is otherwise correct.

Here is the "implementation", which may require implementation
specific code, but I don't think so. What it does is replace
synchronization overhead in all cases with an extra level of function
indirection in all cases.

It was inspired when a Java friend of mine claimed that Java could
have a singleton in a multi-threaded environment with all of the
guarantees I've listed (except 4-d, static de-initialization). I at
first said impossible, but it struct me how the JVM could do it by
swapping out a virtual function pointer in the virtual function table
after the first call. I do something similar here.



//.hpp file
class Mutex {}; //from another header file
class Guard { public: Guard(Mutex&); }; //from another header file

void function_1();
void function_2();

//.cpp file
#include <cassert>
namespace
{
class T
{
public:
void function_1();
void function_2();
};

Mutex & functionContainingStaticMutex()
{ static Mutex m;
return m;
}
struct ForceInstantionOfMutex
{ ForceInstantionOfMutex()
{ functionContainingStaticMutex(); }
} instance_ForceInstantionOfMutex;


T& synchronizedCall();
T& unsynchronizedCall();

//initialization of POD types with constant expressions
//is guaranteed to occur before runtime.
T& (*funcPtr)() = &synchronizedCall;

T& functionContainingStaticT()
{ static T t;
return t;
}
T& synchronizedCall()
{ Guard g(functionContainingStaticMutex());
T& t = functionContainingStaticT();

funcPtr = unsynchronizedCall;

//You need a memory barrier here. You need an
implementation
//specific instruction to prevent another core from seeing
//funcPtr == &unsynchronizedCall and t partially
initialized.

//The problem is that a core may load funcPtr into its
cache
//without calling getSingleton(). For example, it may load
//the address right next to &funcPtr during some other
operation,
//and it just happens to load the addresses around it as
well
//to the cache as a kind of look-ahead-like optimization.

return t;
}
T& unsynchronizedCall()
{ return functionContainingStaticT();
}
T& getSingleton()
{ return funcPtr();
}
}
void function_1()
{ getSingleton().function_1();
}
void function_2()
{ getSingleton().function_2();
}

--
[ See LINK for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

 
Guest
PostPosted: Sun Aug 17, 2008 2:53 pm    Post subject: Re: Singleton overview
       
(Replying to myself, yes.)

I was looking over my singleton implementation which I claimed had
lazy initialization, was threadsafe, and had no synchronization after
the first call. In the implementation, I had a comment in there
describing how I would need an implementation specific instruction of
some kind to prevent other threads from seeing the writes out of
order.

However, my several hours of googling seemed to show that there is no
such instruction. All of the locking instructions I could find were
reciprocal; both threads needed to use an instruction to guarantee the
order that writes from one thread will be visible to another thread,
for example mutex lock and release and a write memory barrier and read
memory barrier.

For my algorithm to work, I need an instruction that will guarantee
that all of the writes made by this thread before the instruction will
be visible to all other threads before all writes made after the
instruction. For a description of what it would have to do hardware-
wise:
1- Block this core until all writes have gone to main memory,
2- Then invalidate all other cores' caches.
(I suppoes the actual instruction implementation wouldn't have to be
quite so draconian to give what's needed, but I believe that
description is sufficient.)

Does such a hardware instruction exist on most / all platforms? And
will C++0x give it to me?



As an alternative, I was toying around with this idea:
When the first thread reaches the singleton synchronization, it
modifies the threading library so that any call to the threading
library will cause that thread to try and acquire the singleton
initialization lock. The first thread will wait until all other
currently existing threads are blocked as such, and only then would it
release the lock, thus forcing all other threads to see the writes in
the correct order. Unfortunately, it could take a long time before
before all threads hit a thread library call, and this also requires
that all code uses your threading library and not another one. This
sounds possible, but you need to exert a lot more control over the
entire application, and this has the very real and possibly high
potential to cause the entire application to come to a screeching
halt, waiting for all threads to hit a thread library call.

But it would work.

--
[ See LINK for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

 
Chris M. Thomasson
PostPosted: Sun Aug 17, 2008 4:58 pm    Post subject: Re: Singleton overview
       
<JoshuaMaurice@gmail.com> wrote in message
news:26770c04-0668-4de2-96f6-ac769398be49@v1g2000pra.googlegroups.com...
Quote:
(Replying to myself, yes.)

[...]

Unless I am missing something, your singleton "pseudo-impl" is totally
busted. Here is a real compliable impl (MSVC) that happens to work on an
x86:

LINK
(READ ALL!)

AFAICT, you miss several important points... One being that you simply need
a recursive mutex... Please explain how you handle a singleton object which
creates another singleton object in its ctor? WRT the memory barriers, well,
you only need a dependant load on the reader side, and a release barrier on
the writer side. End of story. AFAICT, C++0x does not yet support
load-dependencies; think DEC Alpha and RCU for a moment...

--
[ See LINK for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

 
Guest
PostPosted: Wed Aug 20, 2008 8:58 pm    Post subject: Re: Singleton overview
       
Specifically, I was reviewing the Java idiom which prompted my
(potentially) failed attempt at C++ code. Let me write out the Java
idiom to help explain what I want in C++.

public class SingletonGetter {
public static Singleton getSingleton() {
return Singleton.instance;
}
}

The Java spec guarantees that the Singleton class will not be loaded
until the first function call to getSingleton(). Moreover, either the
Java spec or the spec for some specific JVM guarantees that only the
first call (or first X calls) will have any synchronization. The
remaining majority of calls will not.


The C++ code I had in the first post of this thread was an attempt to
mimic that. I used a changeable function pointer, very much like how
the JVM can just change a virtual function pointer in the virtual
function table to point to an unsynchronized version. However, with
the hardware primitives I know, every other core will have to run at
least one hardware synchronization primitive at the correct point to
guarantee the correct ordering of writes.

I see this possible in several ways:

1- Thread Specific Storage. Now, unless there is some magical hardware
support for this which I don't know, we're still looking at a
synchronization primitive of some kind to access Thread Specific
Storage, at least a read memory barrier, so I hope they don't mean
this. (Amusingly enough, it's the chicken and the egg. We could have
Thread Specific Storage without synchronization overhead if we solve
this problem, but we can't use Thread Specific Storage to solve this
problem without synchronization.) Either way, there's still going to
be additional computation involved past the synchronization primitive,
specifically some lookup in an associative container like a hash map
or linked list (of pairs).

2- A magical hardware instruction which, at a minimum, will guarantee
that this core's writes will finish and go to main memory, then force
all other core's to invalidate their caches of this core's writes.

3- Make all other core's perform a synchronization primitive, but only
at the first call. I don't know for sure any good way to do this.
Waiting until other threads hit a threading library call seems like a
bad idea. Maybe you could use a hardware interrupt to cause all cores
to execute a signal handler which will do the required synchronization
primitive. This seems possible, but it would require co-operation of
the entity which knows about all threads and the hardware specific
interrupts, a.k.a. the compiler.

Either way, the Java references I've found claim to be able to do a
mutually exclusive / thread-safe initialization of a lazy-initialized
singleton, and that once initialized, no call to get the singleton
will have any synchronization. I don't know if the reference I have
lies, and they actually use a read memory barrier on each call but not
a full blown lock, or if JVMs actually do this without any hardware
synchronization primitive. Either way, it seems like it's possible,
but would require the compiler to provide the implementation.

Thus I ask again, will C++0x give me this? Unfortunately, I assume no,
and it's too late for adding new material to C++0x, so maybe this new
compiler support could be added to the TR2.

--
[ See LINK for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

 
gpderetta
PostPosted: Thu Aug 21, 2008 12:57 pm    Post subject: Re: Singleton overview
       
On Aug 20, 10:58 pm, JoshuaMaur...@gmail.com wrote:
Quote:
Specifically, I was reviewing the Java idiom which prompted my
(potentially) failed attempt at C++ code. Let me write out the Java
idiom to help explain what I want in C++.

public class SingletonGetter {
public static Singleton getSingleton() {
return Singleton.instance;
}

}

The Java spec guarantees that the Singleton class will not be loaded
until the first function call to getSingleton(). Moreover, either the
Java spec or the spec for some specific JVM guarantees that only the
first call (or first X calls) will have any synchronization. The
remaining majority of calls will not.


IIRC C++0x will guarantee thread safe initialization of statics. There
is no guarantee on the efficiency of the implementation, but the paper
with that proposal included an algorithm that only needs a load
dependant memory barrier (i.e. a plain load on most architectures) on
the fast path, so it is likely that most implementations will follow
the paper (IIRC gcc uses a global mutex ATM).

<snip>
Quote:
I see this possible in several ways:

1- Thread Specific Storage. Now, unless there is some magical hardware
support for this which I don't know, we're still looking at a
synchronization primitive of some kind to access Thread Specific
Storage, at least a read memory barrier, so I hope they don't mean
this. (Amusingly enough, it's the chicken and the egg. We could have
Thread Specific Storage without synchronization overhead if we solve
this problem, but we can't use Thread Specific Storage to solve this
problem without synchronization.) Either way, there's still going to
be additional computation involved past the synchronization primitive,
specifically some lookup in an associative container like a hash map
or linked list (of pairs).


TSS is usually implemented with a dedicated register that points to a
per thread memory block: you access variables as offsets from the
beginning of the block (the details are architecture/Os specific). No
need for synchronization primitives at all.

But I fail to see how TSS will help you.

Quote:
2- A magical hardware instruction which, at a minimum, will guarantee
that this core's writes will finish and go to main memory, then force
all other core's to invalidate their caches of this core's writes.


Such operations usually require inter-cpu interrupts and are usually
privileged (in some cases unix signals can help).

<snip>
Quote:
Thus I ask again, will C++0x give me this? Unfortunately, I assume no,
and it's too late for adding new material to C++0x, so maybe this new
compiler support could be added to the TR2.

Actually thread safe initialization of singletons should already be in
the draft. Failing that, you could implement it yourself with load
dependent memory barriers (assuming those get into the draft too).

BTW, TSS will also be in C++0x (and has been available as an extension
for a long time in many C++ compilers).

--
gpd

--
[ See LINK for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

 
Chris M. Thomasson
PostPosted: Thu Aug 21, 2008 8:55 pm    Post subject: Re: Singleton overview
       
<JoshuaMaurice@gmail.com> wrote in message
news:d367aced-dc4c-4e68-a233-fc0a23b1a093@79g2000hsk.googlegroups.com...
Quote:
Specifically, I was reviewing the Java idiom which prompted my
(potentially) failed attempt at C++ code. Let me write out the Java
idiom to help explain what I want in C++.

public class SingletonGetter {
public static Singleton getSingleton() {
return Singleton.instance;
}
}

The Java spec guarantees that the Singleton class will not be loaded
until the first function call to getSingleton(). Moreover, either the
Java spec or the spec for some specific JVM guarantees that only the
first call (or first X calls) will have any synchronization. The
remaining majority of calls will not.


The C++ code I had in the first post of this thread was an attempt to
mimic that. I used a changeable function pointer, very much like how
the JVM can just change a virtual function pointer in the virtual
function table to point to an unsynchronized version. However, with
the hardware primitives I know, every other core will have to run at
least one hardware synchronization primitive at the correct point to
guarantee the correct ordering of writes.

I see this possible in several ways:

1- Thread Specific Storage. Now, unless there is some magical hardware
support for this which I don't know, we're still looking at a
synchronization primitive of some kind to access Thread Specific
Storage, at least a read memory barrier, so I hope they don't mean
this. (Amusingly enough, it's the chicken and the egg. We could have
Thread Specific Storage without synchronization overhead if we solve
this problem, but we can't use Thread Specific Storage to solve this
problem without synchronization.) Either way, there's still going to
be additional computation involved past the synchronization primitive,
specifically some lookup in an associative container like a hash map
or linked list (of pairs).

[...]

Have you looked at the disassembly of TLS/TSS on popular systems? On
Windows, its only a FS register load with offset. The data-structure's are
all per-thread; no sync. I don't have time to properly respond to the rest
of your writings. However, it seems like you really don't understand membars
all that much. Before you respond, examine the disassembly of an access to a
variable declared as `__thread' or your compilers equlivant. Then dissemble
`TlsGetValue', or `pthread_getspecific()' on an OS with a native pthread
impl...

You don't seem to know the difference between a #LoadLoad membar and a
data-dependant load membar; anyway... Well, the latter, and even the former,
are implied on virtually every arch out there except the Alpha. x86
non-temporal stores aside for a moment indeed.

Please study the membars involves with the RCU algorithm for starters.

--
[ See LINK for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

 
Guest
PostPosted: Fri Aug 22, 2008 6:58 am    Post subject: Re: Singleton overview
       
On Aug 21, 1:55 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
Quote:
JoshuaMaur...@gmail.com> wrote in message
1- Thread Specific Storage. Now, unless there is some magical hardware
support for this which I don't know, we're still looking at a
synchronization primitive of some kind to access Thread Specific
Storage, at least a read memory barrier, so I hope they don't mean
this. (Amusingly enough, it's the chicken and the egg. We could have
Thread Specific Storage without synchronization overhead if we solve
this problem, but we can't use Thread Specific Storage to solve this
problem without synchronization.) Either way, there's still going to
be additional computation involved past the synchronization primitive,
specifically some lookup in an associative container like a hash map
or linked list (of pairs).

[...]

Have you looked at the disassembly of TLS/TSS on popular systems? On
Windows, its only a FS register load with offset. The data-structure's are
all per-thread; no sync. I don't have time to properly respond to the rest
of your writings. However, it seems like you really don't understand membars
all that much. Before you respond, examine the disassembly of an access to a
variable declared as `__thread' or your compilers equlivant. Then dissemble
`TlsGetValue', or `pthread_getspecific()' on an OS with a native pthread
impl...

Admittingly, no, I haven't looked at the assembly generated for
accessing thread local storage. I just did a google, and I found
LINK

It seems my initial assumptions about how thread local storage could
be implemented are partially correct. Specifically, for the general
case of dynamic linking, the current thread has a pointer to a vector
of pointers, each element pointing to a thread local storage section
of a (dynamically loaded) module.

When you first dynamically load a module with thread local storage, as
the paper describes, your options are to stop all threads then
allocate the storage, or to lazy allocate the storage.

Stopping all threads and executing an arbitrary function is one way
out I proposed.

Contrary to my suppositions, the lazy initialize approach requires no
any memory barriers at all on all accesses. However, even in this
case, the dynamic loader must allocate some data structures for each
currently existing thread when a dynamic module is loaded, presumably
stopping all execution for a short period of time.


Both methods require stopping all threads for a short period of time
to update a structure specific to each thread and performing the
necessary synchronization to guarantee other cores see this in the
right order, if only to allocate the element in vector of pointers
which each point to a block thread specific storage for a module. (The
actual memory allocation of the thread local storage of the module for
each thread may be delayed.)



Now, let me analyze both approaches:

An average call to my implementation would be:
1- fetch the function pointer
2- jump to the value of the function pointer
3- the code of the function fetches another global memory location,
the pointer to my singleton.
4- return that pointer

The average call with thread specific storage approach would be (using
only single level indirection fetching with an offset [Sorry, I don't
know the correct lingo, bear with me]):
1- fetch the pointer to vector of pointers to blocks of thread local
storage of modules
... presumably accessible using the stack pointer or something
2- fetch (vector of pointers pointer)[offset of which dynamic module
it's in]
... this is a pointer to my singleton, (or null if the first access
of this thread, but we're discussing average case)

It seems I was mistaken in the cost of thread local storage. It is
still more expensive than a regular load, but barely. By adding a some
cost whenever you load a dynamic library and a little cost whenever
you allocate a thread, it can be done quite efficiently, just 1 more
dependent load than a regular memory access. My approach involves
calling a (non-inlineable) function, which is most assuredly worse.


On Aug 21, 1:55 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
Quote:
You don't seem to know the difference between a #LoadLoad membar and a
data-dependant load membar; anyway... Well, the latter, and even the former,
are implied on virtually every arch out there except the Alpha. x86
non-temporal stores aside for a moment indeed.

Please study the membars involves with the RCU algorithm for starters.

I admit that I don't fully understand when only a data dependency
barrier is required and not a full read barrier, and various other
intricacies of the DEC Alpha's memory model, but I am familiar enough
to say all data dependency barriers can be replaced safely with read
barriers, and I also saw no need to further complicate an already
arcane topic.

--
[ See LINK for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

Page 1 of 1 .:.

Google
 
Webnews.only-4-geeks.com

Windows Update | C++ | C | PHP | JavaScript | Photoshop | Programming | Windows 2000 | Python | Windows XP | Object | Flash | Flash - ActionScript | Paint Shop Pro | Excel | PowerPoint | Access | Word | Windows 98 | Internet Explorer 6.0 | CorelDraw12 | Java | XML | asm x86 | Linux Mandrake | Linux RedHat | Outlook |  | news from newsgroups |_ | s

Web Templates

Awesome Website Templates ©

Linki zakładki Nienasycenie - Anna Maria Jopek ford Internet w Anglii reklama