50 ways to fill your vector ...
"The problem is all inside your head" she said to me
"The answer is easy if you take it logically"
— Paul Simon, 50 ways to leave your lover
So recently I tweaked around with these newfangled C++11 initializer lists and created an EasyHack to use them to initialize property sequences in a readable way. This caused a short exchange on the LibreOffice mailing list, which I assumed had its part in motivating Stephans interesting post "On filling a vector". For all the points being made (also in the quick follow up on IRC), I wondered how much the theoretical "can use a move constructor" discussed etc. really meant when the C++ is translated to e.g. GENERIC, then GIMPLE, then amd64 assembler, then to the internal RISC instructions of the CPU -- with multiple levels of caching in addition.
So I quickly wrote the following (thanks so much for C++11 having the nice std::chrono
now).
data.hxx:
#include <vector>
struct Data {
Data();
Data(int a);
int m_a;
};
void DoSomething(std::vector<Data>&);
data.cxx:
#include "data.hxx"
// noop in different compilation unit to prevent optimizing out what we want to measure
void DoSomething(std::vector<Data>&) {};
Data::Data() : m_a(4711) {};
Data::Data(int a) : m_a(a+4711) {};
main.cxx:
#include "data.hxx"
#include <iostream>
#include <vector>
#include <chrono>
#include <functional>
void A1(long count) {
while(--count) {
std::vector<Data> vec { Data(), Data(), Data() };
DoSomething(vec);
}
}
void A2(long count) {
while(--count) {
std::vector<Data> vec { {}, {}, {} };
DoSomething(vec);
}
}
void A3(long count) {
while(--count) {
std::vector<Data> vec { 0, 0, 0 };
DoSomething(vec);
}
}
void B1(long count) {
while(--count) {
std::vector<Data> vec;
vec.reserve(3);
vec.push_back(Data());
vec.push_back(Data());
vec.push_back(Data());
DoSomething(vec);
}
}
void B2(long count) {
while(--count) {
std::vector<Data> vec;
vec.reserve(3);
vec.push_back({});
vec.push_back({});
vec.push_back({});
DoSomething(vec);
}
}
void B3(long count) {
while(--count) {
std::vector<Data> vec;
vec.reserve(3);
vec.push_back(0);
vec.push_back(0);
vec.push_back(0);
DoSomething(vec);
}
}
void C1(long count) {
while(--count) {
std::vector<Data> vec;
vec.reserve(3);
vec.emplace_back(Data());
vec.emplace_back(Data());
vec.emplace_back(Data());
DoSomething(vec);
}
}
void C3(long count) {
while(--count) {
std::vector<Data> vec;
vec.reserve(3);
vec.emplace_back(0);
vec.emplace_back(0);
vec.emplace_back(0);
DoSomething(vec);
}
}
double benchmark(const char* name, std::function<void (long)> testfunc, const long count) {
const auto start = std::chrono::system_clock::now();
testfunc(count);
const auto end = std::chrono::system_clock::now();
const std::chrono::duration<double> delta = end-start;
std::cout << count << " " << name << " iterations took " << delta.count() << " seconds." << std::endl;
return delta.count();
}
int main(int, char**) {
long count = 10000000;
while(benchmark("A1", &A1, count) < 60l)
count <<= 1;
std::cout << "Going with " << count << " iterations." << std::endl;
benchmark("A1", &A1, count);
benchmark("A2", &A2, count);
benchmark("A3", &A3, count);
benchmark("B1", &B1, count);
benchmark("B2", &B2, count);
benchmark("B3", &B3, count);
benchmark("C1", &C1, count);
benchmark("C3", &C3, count);
return 0;
}
Makefile:
CFLAGS?=-O2
main: main.o data.o
g++ -o $@ $^
%.o: %.cxx data.hxx
g++ $(CFLAGS) -std=c++11 -o $@ -c $<
Note the object here is small and trivial to copy as one would expect from objects passed around as values (as expensive to copy objects mostly can be passed around with a std::shared_ptr
). So what did this measure? Here are the results:
Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2
) compiled with gcc 4.8.3 without inline constructors:
implementation / CFLAGS | -Os | -O2 | -O3 | -O3 -march=... |
A1 | 89.1 s | 79.0 s | 78.9 s | 78.9 s |
A2 | 89.1 s | 78.1 s | 78.0 s | 80.5 s |
A3 | 90.0 s | 78.9 s | 78.8 s | 79.3 s |
B1 | 103.6 s | 97.8 s | 79.0 s | 78.0 s |
B2 | 99.4 s | 95.6 s | 78.5 s | 78.0 s |
B3 | 107.4 s | 90.9 s | 79.7 s | 79.9 s |
C1 | 99.4 s | 94.4 s | 78.0 s | 77.9 s |
C3 | 98.9 s | 100.7 s | 78.1 s | 81.7 s |
-march=core-avx2
) compiled with gcc 4.8.3 with inline constructors:
implementation / CFLAGS | -Os | -O2 | -O3 | -O3 -march=... |
A1 | 85.6 s | 74.7 s | 74.6 s | 74.6 s |
A2 | 85.3 s | 74.6 s | 73.7 s | 74.5 s |
A3 | 91.6 s | 73.8 s | 74.4 s | 74.5 s |
B1 | 93.4 s | 90.2 s | 72.8 s | 72.0 s |
B2 | 93.7 s | 88.3 s | 72.0 s | 73.7 s |
B3 | 97.6 s | 88.3 s | 72.8 s | 72.0 s |
C1 | 93.4 s | 88.3 s | 72.0 s | 73.7 s |
C3 | 96.2 s | 88.3 s | 71.9 s | 73.7 s |
-march=...
is at best neutral: The measured times do not change much in general, they only even slightly improve performance in five out of 16 cases, and the two cases with the most significant change in performance (over 3%) are actually hurting the performance. So for the rest of this post,-march=...
will be ignored. Sorry gentooers. ;)- There is no silver bullet with regard to the different implementations: A1, A2 and A3 are the faster implementations when not inlining constructors and using
-Os
or-O2
(the quickest A* is ~10% faster than the quickest B*/C*). However when inlining constructors and using -O3, the same implementations are the slowest (by 2.4%). - Most common release builds are still done with
-O2
these days. For those, using initializer lists (A1/A2/A3) seem too have a significant edge over the alternatives, whether constructors are inlined or not. This is in contrast to the conclusions made from "constructor counting", which assumed these to be slow because of additional calls needed. - The numbers printed in bold are either the quickest implementation in a build scenario or one that is within 1.5% of the quickest implementation. A1 and A2 are sharing the title here by being in that group five times each.
- With constructors inlined, everything in the loop except
DoSomething()
could be inline. It seems to me that the compiler could -- at least in theory -- figure out that it is asked the same thing in all cases. Namely, reserve space for three ints on the heap, fill them each with 4711 and make the::std::vector<int>
data structure on the stack reflect that, then hand that to theDoSomething()
function that you know nothing about. If the compiler would figure that out, it would take the same time for all implementations. This doesnt happen either on-O2
(differ by ~18% from quickest to slowest) nor on-O3
(differ by ~3.6%).
And hey, if you want to run the tests above on other platforms or compilers, I would be interested in results!
Note: I did these runs for each scenario only once, thus no standard deviation is given. In general, they seemed to be rather stable, but this being wallclock measurements, one or the other might be outliers. caveat emptor.
Originally published on 2015-03-02 22:47:07 on wordpress.