|  | Code timing |  | |
| | | Martin Eisenberg |  |
| Posted: Mon Aug 18, 2008 9:42 am Post subject: Code timing |  |
| |  | |
Hi folks!
I wrote the C++ program at the end of this post to get an idea of the relative speeds of various mathematical operations. Now I'm wondering about two features of the results. Here's an example output from an Athlon 800 machine:
| Timing 2000 runs of 1000 calls | | div : 0.39 s 0.000195 ms/call | mul : 0.38 s 0.00019 ms/call | add : 0.39 s 0.000195 ms/call | null: 0.33 s 0.000165 ms/call | 1.2e+007 Repeat (y/n)? y | div : 0.39 s 0.000195 ms/call | mul : 0.38 s 0.00019 ms/call | add : 0.38 s 0.00019 ms/call | null: 0.88 s 0.00044 ms/call | 1.38e+007 Repeat (y/n)? n
Now--
1) The "null" time is always the same as for add/mul on the first run but about double that subsequently, as seen above. All other numbers behave consistently. Why might that be?
2) I expected division to be rather slower than add/mul. Sometimes the numbers are indeed higher but only by about 10% and in a minority of runs. Am I observing latency instead of throughput, which is what I'm interested in? I tried feeding dummy (the local one) back into compute() but, range issues aside, the times didn't change.
Thanks, Martin
--------------------------------- #include <iostream> #include <ostream> #include <iomanip> #include <vector> #include <cstdlib> #include <ctime> #include <cmath>
using namespace std;
const int KBlockSize = 1000, KRunCount = 2000; const int KCallCount = KBlockSize * KRunCount;
void printClocks(const char* name, clock_t clocks) { float sec = float(clocks) / CLOCKS_PER_SEC; cout << right << setprecision(3); cout << name << ": " << setw( << sec << " s " << setw( << sec * 1000 / KCallCount << " ms/call\n"; }
double dummy = 0;
template<class Float, class Compute> class Runner { const char* name_; clock_t clocks_; public: Runner(const char* name) : name_(name) { vector<Float> v(KBlockSize + 1); for(int i = 0; i <= KBlockSize; ++i) v[i] = 1e-7 + rand() / Float(RAND_MAX); Compute compute; Float dummy = 0; clock_t clocks = -clock(); for(int r = 0; r < KRunCount; ++r) { for(int i = 0; i < KBlockSize; ++i) dummy += compute(v[i], v[i+1]); } clocks += clock(); ::dummy += dummy; clocks_ = clocks; } ~Runner() { printClocks(name_, clocks_); } };
struct Null { template<class Float> Float operator() (Float x, Float y) { return x; } };
struct Add { template<class Float> Float operator() (Float x, Float y) { return x + y; } };
struct Mul { template<class Float> Float operator() (Float x, Float y) { return x * y; } };
struct Div { template<class Float> Float operator() (Float x, Float y) { return x / y; } };
int main() { srand(time(0)); cout << "Timing " << KRunCount << " runs of " << KBlockSize << " calls\n\n"; char repeat; do { dummy = 0; { Runner<float, Null> r0("null"); Runner<float, Add > r1("add "); Runner<float, Mul > r2("mul "); Runner<float, Div > r3("div "); } cout << dummy << " Repeat (y/n)? "; } while(cin >> repeat && repeat != 'n'); return 0; }
// end of code |
| |
| | | Pascal J. Bourguignon |  |
| Posted: Mon Aug 18, 2008 12:13 pm Post subject: Re: Code timing |  |
| |  | |
Martin Eisenberg <martin.eisenberg@udo.edu> writes:
| Quote: | Hi folks!
I wrote the C++ program at the end of this post to get an idea of the relative speeds of various mathematical operations. Now I'm wondering about two features of the results. Here's an example output from an Athlon 800 machine:
| Timing 2000 runs of 1000 calls | | div : 0.39 s 0.000195 ms/call | mul : 0.38 s 0.00019 ms/call | add : 0.39 s 0.000195 ms/call | null: 0.33 s 0.000165 ms/call | 1.2e+007 Repeat (y/n)? y | div : 0.39 s 0.000195 ms/call | mul : 0.38 s 0.00019 ms/call | add : 0.38 s 0.00019 ms/call | null: 0.88 s 0.00044 ms/call | 1.38e+007 Repeat (y/n)? n
Now--
1) The "null" time is always the same as for add/mul on the first run but about double that subsequently, as seen above. All other numbers behave consistently. Why might that be?
2) I expected division to be rather slower than add/mul. Sometimes the numbers are indeed higher but only by about 10% and in a minority of runs. Am I observing latency instead of throughput, which is what I'm interested in? I tried feeding dummy (the local one) back into compute() but, range issues aside, the times didn't change.
|
The function / method call times, and the cache fetch times (in a multiprocessing environment, you may get interrupted for processing for other processes, so your caches can be emptied at any time), are several orders of magnitude greater than a simple operation like add or divide.
Benchmarks are really idiotic, and more and more with time...
-- __Pascal Bourguignon__ |
| |
| | | Harold Aptroot |  |
| Posted: Mon Aug 18, 2008 3:28 pm Post subject: Re: Code timing |  |
| |  | |
"Martin Eisenberg" <martin.eisenberg@udo.edu> wrote in message news:6gt5gaFh3fk8U1@mid.uni-berlin.de...
| Quote: | Hi folks!
I wrote the C++ program at the end of this post to get an idea of the relative speeds of various mathematical operations. Now I'm wondering about two features of the results. Here's an example output from an Athlon 800 machine:
| Timing 2000 runs of 1000 calls | | div : 0.39 s 0.000195 ms/call | mul : 0.38 s 0.00019 ms/call | add : 0.39 s 0.000195 ms/call | null: 0.33 s 0.000165 ms/call | 1.2e+007 Repeat (y/n)? y | div : 0.39 s 0.000195 ms/call | mul : 0.38 s 0.00019 ms/call | add : 0.38 s 0.00019 ms/call | null: 0.88 s 0.00044 ms/call | 1.38e+007 Repeat (y/n)? n
Now--
1) The "null" time is always the same as for add/mul on the first run but about double that subsequently, as seen above. All other numbers behave consistently. Why might that be?
2) I expected division to be rather slower than add/mul. Sometimes the numbers are indeed higher but only by about 10% and in a minority of runs. Am I observing latency instead of throughput, which is what I'm interested in? I tried feeding dummy (the local one) back into compute() but, range issues aside, the times didn't change.
|
(snipped)
<unimportant part skip if in a hurry>
The function calls themselves may take more time than their bodies (depending on optimisation) Also, the loops have to be unrolled a Lot to not make a difference Random is much slower than anything else - generate an array of random numbers first and only after that run the loop over it Actually, don't run a loop over a large array of random numbers.. its timing depends too much on ram speeds and caching behaviour
</unimportant>
The latency and throughput of all instructions are known though, and the throughput of most instructions is atleast* 1 per cc there days (pipelined) but there are exceptions See LINK
*: meaning there can be more than 1 per cc |
| |
| | | Harold Aptroot |  |
| Posted: Mon Aug 18, 2008 3:29 pm Post subject: Re: Code timing |  |
| Quote: | The latency and throughput of all instructions are known though, and the throughput of most instructions is atleast* 1 per cc there days (pipelined) but there are exceptions See LINK
*: meaning there can be more than 1 per cc
|
also go here: LINK |
| |
| | | santosh |  |
| Posted: Mon Aug 18, 2008 5:18 pm Post subject: Re: Code timing |  |
Pascal J. Bourguignon wrote:
<snip>
| Quote: | Benchmarks are really idiotic, and more and more with time...
|
Usually yes. But when they are necessary, I often conduct them with all non-essential tasks killed and the program under test given realtime priority. |
| |
| | | Pascal J. Bourguignon |  |
| Posted: Mon Aug 18, 2008 7:15 pm Post subject: Re: Code timing |  |
santosh <santosh.k83@gmail.com> writes:
| Quote: | Pascal J. Bourguignon wrote:
snip
Benchmarks are really idiotic, and more and more with time...
Usually yes. But when they are necessary, I often conduct them with all non-essential tasks killed and the program under test given realtime priority.
|
Nonetheless. My remark is aimed at the thicker and thicker layers of JIT and pipeline and reordering and what not you find in newest high end processors. For example, if you try to benchmark the four operations by encapsulating inside a function/method call, all you get most probably will be the time needed to flush the call (JSR) and return (RET) pipelines. The operation itself will have a lot of time to be done in parallel during the restoring of registers and return.
-- __Pascal Bourguignon__ LINK
"I have challenged the entire quality assurance team to a Bat-Leth contest. They will not concern us again." |
| |
| | | Martin Eisenberg |  |
| Posted: Mon Aug 18, 2008 7:47 pm Post subject: Re: Code timing |  |
| |  | |
Harold Aptroot wrote:
| Quote: | The function calls themselves may take more time than their bodies (depending on optimisation) Also, the loops have to be unrolled a Lot to not make a difference Random is much slower than anything else - generate an array of random numbers first and only after that run the loop over it Actually, don't run a loop over a large array of random numbers.. its timing depends too much on ram speeds and caching behaviour
|
I suppose I should be thankful that someone responds in earnest, but somehow... The idea of posting code is that people take note of what it actually does -- like pre-generating data, repeating often enough that results stabilize (on this system), accounting for fixed costs, you get the picture.
Thank you, without irony this time!
Martin
-- I think perhaps the most important problem is that we are trying to understand the fundamental workings of the universe via a language devised for telling one another where the best fruit is. --Terry Pratchett |
| |
| | | Harold Aptroot |  |
| Posted: Mon Aug 18, 2008 10:01 pm Post subject: Re: Code timing |  |
| |  | |
"Martin Eisenberg" <martin.eisenberg@udo.edu> wrote in message news:6gu8vfFhoel0U1@mid.uni-berlin.de...
| Quote: | Harold Aptroot wrote:
The function calls themselves may take more time than their bodies (depending on optimisation) Also, the loops have to be unrolled a Lot to not make a difference Random is much slower than anything else - generate an array of random numbers first and only after that run the loop over it Actually, don't run a loop over a large array of random numbers.. its timing depends too much on ram speeds and caching behaviour
I suppose I should be thankful that someone responds in earnest, but somehow... The idea of posting code is that people take note of what it actually does -- like pre-generating data, repeating often enough that results stabilize (on this system), accounting for fixed costs, you get the picture.
See LINK
Thank you, without irony this time!
Martin
|
I'm sorry .. I must admit I only gave it a quick skim.. Ah well maybe the warning will be good for other readers eh? |
| |
| | | santosh |  |
| Posted: Tue Aug 19, 2008 7:28 am Post subject: Re: Code timing |  |
| |  | |
Pascal J. Bourguignon wrote:
| Quote: | santosh <santosh.k83@gmail.com> writes:
Pascal J. Bourguignon wrote:
snip
Benchmarks are really idiotic, and more and more with time...
Usually yes. But when they are necessary, I often conduct them with all non-essential tasks killed and the program under test given realtime priority.
Nonetheless. My remark is aimed at the thicker and thicker layers of JIT and pipeline and reordering and what not you find in newest high end processors. For example, if you try to benchmark the four operations by encapsulating inside a function/method call, all you get most probably will be the time needed to flush the call (JSR) and return (RET) pipelines. The operation itself will have a lot of time to be done in parallel during the restoring of registers and return.
|
I agree that benchmarking code fragments is becoming increasingly unreliable but there is still something to be said for profiling entire applications to discover bottlenecks, despite the fact that code speed is given exaggerated importance. |
| |
| | | Martin Eisenberg |  |
| Posted: Tue Aug 19, 2008 12:40 pm Post subject: Re: Code timing |  |
Harold Aptroot wrote:
| Quote: | I'm sorry .. I must admit I only gave it a quick skim.. Ah well maybe the warning will be good for other readers eh?
|
OK, never mind Do you (or anyone) happen to know of similar, good tables for floating-point instructions?
Martin
-- The biggest difference between time and space is that you can't reuse time. --Merrick Furst |
| |
| Page 1 of 2 .:. Goto page 1, 2 Next | |
|
|