Posts RSS Comments RSS Del.icio.us Digg Technorati Blinklist Furl reddit 66 Posts and Comments till now
This wordpress theme is downloaded from wordpress themes website.

Archive for the 'C++' Category

C++ member function pointer, parent class, template, std::function

Today I ran into a C++ scenario where I found std::function to be useful.  I wanted to create a std::queue of member function pointers to functions that are children of the same parent class.  For example:

class Parent
{
public:
    Parent() {};
    virtual ~Parent() {};
};
class ChildA : public Parent
{
public:
    ChildA() : Parent() {};
    ~ChildA() {};
    void Execute01() {printf("A");};
    void Execute02() {printf("B");};
    void Execute03() {printf("C");};
};
class ChildB : public Parent
{
public:
    void Execute01() {printf("D");};
    void Execute02() {printf("E");};
    void Execute03() {printf("F");};
};

To keep it simple, this examples shows two child classes (ChildA, ChildB), but I wanted it to work for lots more child classes.  I wanted to push() member function pointers to ChildA::Execute01() etc into a std::queue…  But I ran into an issue.  The declaration for a member function pointer requires the class to be specified, and the execution requires an instance of the class the be specified.  So the following snippet is valid:

typedef void (ChildA::*funcPtr)(); 
std::queue<funcPtr> theQueue; 
ChildA childA; 
theQueue.push(&ChildA::Execute01); 
funcPtr fn = theQueue.front(); 
(childA.*fn)();

However, I can’t push() a pointer to a ChildB member function into this same std::queue.  If I try referencing the parent class as follows:

typedef void (Parent::*funcPtr)();
std::queue<funcPtr> theQueue;
ChildA childA;
theQueue.push(&ChildA::Execute01);
funcPtr fn = theQueue.front();
(childA.*fn)();

Then Visual Studio 2013 gives me an error:

Error    9    error C2664: ‘void std::queue<funcPtr,std::deque<_Ty,std::allocator<_Ty>>>::push(void (__cdecl Parent::* const &)(void))’ : cannot convert argument 1 from ‘void (__cdecl ChildB::* )(void)’ to ‘void (__cdecl Parent::* &&)(void)’

Ignoring C++14’s variable templates, there are two types of templates – function templates and class templates…  So we can’t just use a template on the function pointer’s typedef.  Instead, my next solution was to make a queue of Delegate’s and use a template to define the delegate class’s children:

class Delegate
{
public:
  Delegate() {};
  virtual ~Delegate() {};
  virtual void ExeFunc() = 0;
};
template<class T>
class DelegateT : public Delegate
{
public:
  DelegateT(T* obj, void (T::*func)()) { m_obj = obj; m_func = func; };
  ~DelegateT() {};
  void ExeFunc() { (m_obj->*m_func)(); }
protected:
  void (T::*m_func)();
  T* m_obj;
};

And here’s an example usage:

std::queue<Delegate *> theQueue;
ChildA childA;
ChildB childB;
theQueue.push(new DelegateT<ChildA>(&childA, &ChildA::Execute01));
theQueue.push(new DelegateT<ChildB>(&childB, &ChildB::Execute01));
while (!theQueue.empty())
{
  Delegate *del = theQueue.front();
  del->ExeFunc();
  theQueue.pop();
  delete del;
}

Finally I ran into a more modern solution (but not that modern seeing as it’s supported by Visual Studio 2010) – use std::function:

ChildA childA;
ChildB childB;
std::queue<std::function<void()>> theQueue;
theQueue.push(std::bind(&ChildA::Execute01, childA));
theQueue.push(std::bind(&ChildB::Execute01, childB));
while (!theQueue.empty())
{
  std::function<void()> func = theQueue.front();
  func();
  theQueue.pop();
}

C++ view assembly (auto-vectorizer) (MSVC) (clang/LLVM)

This link ( http://bit.ly/1yLKB8O ) describes three ways to view assembly code in the Visual Studio debugger

1) While stepping your code in Visual Studio debugger, do (ctrl+alt+d) or right click -> Go To Disassembly
image

2) Right-click project settings -> C/C++ -> Output Files -> ASM List Location. The *.asm or *.cod file will be output to your build output folder such as “x64\Release”.
image

3) Compile the program and use any third-party debugger. You can use OllyDbg or WinDbg for this. Also you can use IDA (interactive disassembler).

Next I’ll show an example where I want to investigate what Visual Studio (2010 vs. 2013) gives us for some C++ code. I’m especially curious about SSE and AVX (aka Auto-Vectorizing).  Here is the code:

static inline float mul_f32(float s0, float s1)
{
return s0 * s1;
}
...
if (exec == 0xFFFFFFFFFFFFFFFF)
{
    //#pragma loop(no_vector)
    for (int t = 0; t < 64; t++)
    {
        dst0[t] = mul_f32(src0[t], src1[t]);
    }
}
else
{
    for (int t = 0; exec != 0; t++, exec >>= 1)
    {
        if ((exec & 1) != 0)
        {
            dst0[t] = mul_f32(src0[t], src1[t]);
        }
    }
}


The above code snippet is reasonably friendly to Auto-Vectorization.  It’s important to note that relying on the compiler for Auto-Vectorization can (in some cases) give you better performance than writing your own (in addition to being more portable, more concise, more readable).  However, you still need to write your C++ code in such a way that the compiler will do a good job of Auto-Vectorizing it.

To micro-optimize your code, you don’t necessarily need to write assembly code yourself.  In fact, you can often get better performance by letting the compiler do its job, but it can help to understand what the compiler is doing (what it’s capable of, the assembly code it outputs, etc).  By (looking at the generated assembly code) and (profiling or performance testing your code), you can experiment with (different C++ code) and (different compilers and compiler flags).

For this post, I made two Visual Studio solutions (.sln files) – one for 2010, one for 2013.

I compiled the same C++ code with (Release x64) and different options for /arch:AVX, see ( http://bit.ly/15PRdqW ).  All x64 CPU’s have SSE2, so the default of “Not Set” uses SSE2 (same as /arch:SSE2).  To sanity check this, you can compile the same code with “Not Set” and with “SSE2”, then compare the resulting assembly code (it should be the same) (aside: binary md5sum is different even if you recompile the same code, http://bit.ly/1zAYlFM).  VS2010 GUI doesn’t show “/arch:AVX” or “/arch:AVX2”, but you can still add them under Configuration Properties -> C/C++ -> Command Line.

image image

When I enable AVX instructions, it compiles as expected.  When I run the resulting exe on my AMD FX-8120 (has AVX), it runs to completion.  However, when I run it on my Phenom II X6 1090T (does not have AVX), I get “Illegal Instruction”.  By stepping the assembly code in the VS debugger, I can see the illegal instruction:

image image

00007FF704601293  vmovaps     xmmword ptr [rax-28h],xmm6

As described here ( http://intel.ly/1CdHj1r ), this is an AVX instruction:

image

image

I wrapped my code with a simple timer to test performance on the AMD FX-8120 host (Timer class from http://bit.ly/1DhMTfG):

Timer t;
t.Start();
uint64 execReset = exec;
for (int x = 0; x < 99999999; x++)
{
    exec = execReset;
    if (exec == 0xFFFFFFFFFFFFFFFF)
    {
        //#pragma loop(no_vector)
        for (int t = 0; t < 64; t++)
        {
            dst0[t] = mul_f32(src0[t], src1[t]);
        }
    }
    else
    {
        for (int t = 0; exec != 0; t++, exec >>= 1)
        {
            if ((exec & 1) != 0)
            {
                dst0[t] = mul_f32(src0[t], src1[t]);
            }
        }
    }
}
t.Stop();
std::cout << t.Elapsed() / 1000.0 << ", ";
 

I ran it with three input values for exec: 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFE, 0x1111111111111111.  The results were:

vs2010 NotSet: 3.188, 7.404, 7.072 
vs2010 AVX   : 3.19 , 7.418, 7.063 
vs2010 AVX2  : cl : Command line warning D9002: ignoring unknown option '/arch:AVX2' 
vs2013 NotSet: 0.998, 6.187, 6.624 
vs2013 AVX   : 0.985, 6.193, 6.621 
vs2013 AVX2  : 2.899, 6.231, 7.013 
vs2013 NotSet with #pragma loop (no_vector): 3.235, 7.399, 7.065
 

So for my simple test case, /arch:SSE2 vs. /arch:AVX does not matter.  /arch:AVX2 seems broken.  vs2010 vs. vs2013 does matter (a little for less-than-all exec bits set) (a lot for all exec bits set, due to auto-vectorization, see http://bit.ly/1yOmxSF).

AVX extends SIMD registers from 128 bits XMM to 256 bits YMM (so 32×4 to 32×8) (aside: AVX-512 has 512 bit, aka 32×16, ZMM regs).  According to (http://bit.ly/1zBGApL), vs2013 uses XMM (128 bits) while GCC uses YMM (256 bits) for the same auto-vectorized code.  Looking at the assembly code for /arch:AVX, I see that it’s still using XMM registers.  So if the vs2013 auto-vectorizer was using 256 bit YMM registers, we may get additional performance gains (of course the host CPU needs AVX).

How can we use 256-bit YMM registers?  Ideas (http://bit.ly/1y7gFjz):

A) Use a different compiler (or a newer version of the same compiler)

B) Use C++ intrinsics

C) Use vectorized libraries

D) Use OpenCL

How to deal with the issue of different host CPUs?  Eg my Phenom II X6 1090T doesn’t have AVX, while AMD FX-8120 does:

A) Compile multiple versions of the same code.  This could be multiple binaries (#ifdef’s) or multiple code paths in the same binary (run-time conditionals are okay, eg as long as you don’t go down an AVX code path on a non-AVX host).

B) Compile byte code (such as LLVM IR) offline, then at run-time we JIT to host instructions (differently depending on the host CPU)

I tried a quick implementation for SSE instrinsics compiled with Visual Studio 2013:

__m128 rA, rB, rC;
if (exec == 0xFFFFFFFFFFFFFFFF)
{
    for (int t = 0; t < 64; t += 4)
    {
        rA = _mm_load_ps(&src0[t]);
        rB = _mm_load_ps(&src1[t]);
        rC = _mm_mul_ps(rA, rB);
        _mm_store_ps(&dst0[t], rC);
    }
}

With /arch not set, the result was 2.061 sec, which is slower than when I let Visual Studio auto-vectorize for me.

Next I tried a quick implementation for AVX intrinsics (256 bit).  Note you need to align your memory or you may get an error:

float* dyn = (float *)_mm_malloc(width * depth * sizeof(float), 32);

Here is my quick AVX (256 bit) implementation:

__m256 rA_AVX, rB_AVX, rC_AVX;
if (exec == 0xFFFFFFFFFFFFFFFF)
{
    for (int t = 0; t < 64; t += 8)
    {
        rA_AVX = _mm256_load_ps(&src0[t]);
        rB_AVX = _mm256_load_ps(&src1[t]);
        rC_AVX = _mm256_mul_ps(rA_AVX, rB_AVX);
        _mm256_store_ps(&dst0[t], rC_AVX);
    }
}

Of course I get “Illegal Instruction” if I try to run it on my Phenom II X6 1090T (does not have AVX).  However, if I run it on my AMD FX-8120 (has AVX), then it runs to completion.  How’s the performance?  It gave me 0.889 sec.  That’s faster than the 128-bit Auto-Vectorize from Visual Studio 2013 (0.998 sec).  However, I expected it to be more faster.  Our 128-bit intrinsics ran significantly slower than the auto-vectorize 128-bit, so maybe an auto-vectorize 256-bit version would be faster (than my 256-bit instrinsics)?  Most likely there is room to optimize this further, at least for an AVX CPU.

Based on the above experiments, one strategy might be to have a default path that uses SSE2 based auto-vectorize, and an alternate AVX intrinsics path that uses AVX intrinsics (iff the host CPU has AVX).  For picking between these two code paths, there is a conditional for whether to use our SSE2 auto-vectorize or AVX intrinsics version.

The highest level we could do the conditional would be to compile two entire sets of separate binaries for SSE2 vs. AVX.  The drawback is that we’d have two copies of every binary.  Then we either ship both copies, or we tell users to bother with picking the right one.

On the other extreme, we could do a run-time conditional right before each of these loops.  However, depending on our code base, we might be doing the same runtime conditional (check if host has AVX) a ridiculous number of times.

So for a larger program we should write our C++ code such that we both (minimize redundant binaries) and (minimize the runtime conditionals).

Edit 2015/01/31: I decided to try clang (LLVM) on this example code to see the performance.  The clang home page’s downloads for Windows included a limited 32-bit binary, or building the source code with CMake and Visual Studio.  Instead, I found a more convenient download here (http://bit.ly/1EVlWzx).  I used the following to compile my above C code with clang:

clang fooLLVM.c -o fooLLVM.exe –O3

The –O3 flag is a simple way to turn on aggressive optimizations (without it, my fooLLVM.exe was very slow!)

When I ran this, clang figured out my timer loop did not need to run 99999999 times – it only needed to run once since the inner loop was setting dst0[t].  So I had to change my code to do this:

dst0[t] += mul_f32_LLVM(src0[t], src1[t]);

With that change, vs2013 Not Set (SSE2) gave me 1.493 sec, and vs2013 AVX gave me 1.401 sec.  LLVM gave me 1.617 sec (slower than vs2013).  Next I tried this:

clang fooLLVM.c -o fooLLVM.exe -O3 -mllvm -force-vector-width=8 => 1.185 sec (faster than vs2013) (also runs fine without an AVX CPU)

Next I tried the code with SSE intrinsics.  To make the comparison fair, I needed to modify the code to do the extra add:

rA = _mm_load_ps(&src0[t]); 
rB = _mm_load_ps(&src1[t]); 
rC = _mm_load_ps(&dst0[t]); 
rD = _mm_add_ps(rD, _mm_mul_ps(rA, rB)); 
_mm_store_ps(&dst0[t], rD);


clang++ fooLLVM.c -o fooLLVM.exe -O3 => 0.836 sec

Unlike with vs2013, intrinsics was faster than auto-vectorization.

Next I tried the code with AVX intrinsics:

rA_AVX = _mm256_load_ps(&src0[t]); 
rB_AVX = _mm256_load_ps(&src1[t]); 
rC_AVX = _mm256_load_ps(&dst0[t]); 
rD_AVX = _mm256_add_ps(rC_AVX, _mm256_mul_ps(rA_AVX, rB_AVX)); 
_mm256_store_ps(&dst0[t], rD_AVX);

clang++ fooLLVM.c -o fooLLVM.exe -O3 -march=corei7-avx => 1.452 sec (requires CPU with AVX)

clang++ fooLLVM.c -o fooLLVM.exe -O3 -mavx => 1.447 sec (requires CPU with AVX)

I was disappointed that AVX intrinsics was slower than clang SSE intrinsics.  Apparently this is an open bug, see (http://bit.ly/1yPVOka) (http://bit.ly/1696FQh) (http://bit.ly/1696FQh) (http://bit.ly/1wLqylA).  I was also disappointed that clang doesn’t seem to have an option for AVX auto-vectorization yet.

The above numbers were on my AMD FX-8120.  On my Phenom II X6 1090T, the autovectorize version with (-mllvm -force-vector-width=8) was actually faster than the SSE intrinsic version (1.604 sec vs. 1.41 sec).

Summary (of my simple test case):

* auto-vectorization for AVX support is lacking for both Visual Studio 2013 and LLVM (clang) as of 2015/01/31

* AVX intrinsics was fastest for vs2013, but brokenly slow for LLVM

* vs2013 auto-vectorize was a lot faster than SSE intrinsics

* LLVM SSE intrinsics was faster than auto-vectorize

* I got slightly faster results with LLVM than vs2013 for (auto-vectorize, SSE intrinsics)

Conclusion:

* LLVM SSE intrinsics for non-AVX CPU is faster than auto-vectorize (even though it’s the reverse for vs2013)

* LLVM AVX intrinsics are brokenly slow, and auto-vectorization doesn’t exist, as of 2015/01/31

* don’t trust these conclusions, because I only tested two hosts with a tiny test program that isn’t close enough to a real use case