Posts RSS Comments RSS Del.icio.us Digg Technorati Blinklist Furl reddit 66 Posts and Comments till now
This wordpress theme is downloaded from wordpress themes website.

Archive for January, 2015 (2015/01)

C++ view assembly (auto-vectorizer) (MSVC) (clang/LLVM)

This link ( http://bit.ly/1yLKB8O ) describes three ways to view assembly code in the Visual Studio debugger

1) While stepping your code in Visual Studio debugger, do (ctrl+alt+d) or right click -> Go To Disassembly
image

2) Right-click project settings -> C/C++ -> Output Files -> ASM List Location. The *.asm or *.cod file will be output to your build output folder such as “x64\Release”.
image

3) Compile the program and use any third-party debugger. You can use OllyDbg or WinDbg for this. Also you can use IDA (interactive disassembler).

Next I’ll show an example where I want to investigate what Visual Studio (2010 vs. 2013) gives us for some C++ code. I’m especially curious about SSE and AVX (aka Auto-Vectorizing).  Here is the code:

static inline float mul_f32(float s0, float s1)
{
return s0 * s1;
}
...
if (exec == 0xFFFFFFFFFFFFFFFF)
{
    //#pragma loop(no_vector)
    for (int t = 0; t < 64; t++)
    {
        dst0[t] = mul_f32(src0[t], src1[t]);
    }
}
else
{
    for (int t = 0; exec != 0; t++, exec >>= 1)
    {
        if ((exec & 1) != 0)
        {
            dst0[t] = mul_f32(src0[t], src1[t]);
        }
    }
}


The above code snippet is reasonably friendly to Auto-Vectorization.  It’s important to note that relying on the compiler for Auto-Vectorization can (in some cases) give you better performance than writing your own (in addition to being more portable, more concise, more readable).  However, you still need to write your C++ code in such a way that the compiler will do a good job of Auto-Vectorizing it.

To micro-optimize your code, you don’t necessarily need to write assembly code yourself.  In fact, you can often get better performance by letting the compiler do its job, but it can help to understand what the compiler is doing (what it’s capable of, the assembly code it outputs, etc).  By (looking at the generated assembly code) and (profiling or performance testing your code), you can experiment with (different C++ code) and (different compilers and compiler flags).

For this post, I made two Visual Studio solutions (.sln files) – one for 2010, one for 2013.

I compiled the same C++ code with (Release x64) and different options for /arch:AVX, see ( http://bit.ly/15PRdqW ).  All x64 CPU’s have SSE2, so the default of “Not Set” uses SSE2 (same as /arch:SSE2).  To sanity check this, you can compile the same code with “Not Set” and with “SSE2”, then compare the resulting assembly code (it should be the same) (aside: binary md5sum is different even if you recompile the same code, http://bit.ly/1zAYlFM).  VS2010 GUI doesn’t show “/arch:AVX” or “/arch:AVX2”, but you can still add them under Configuration Properties -> C/C++ -> Command Line.

image image

When I enable AVX instructions, it compiles as expected.  When I run the resulting exe on my AMD FX-8120 (has AVX), it runs to completion.  However, when I run it on my Phenom II X6 1090T (does not have AVX), I get “Illegal Instruction”.  By stepping the assembly code in the VS debugger, I can see the illegal instruction:

image image

00007FF704601293  vmovaps     xmmword ptr [rax-28h],xmm6

As described here ( http://intel.ly/1CdHj1r ), this is an AVX instruction:

image

image

I wrapped my code with a simple timer to test performance on the AMD FX-8120 host (Timer class from http://bit.ly/1DhMTfG):

Timer t;
t.Start();
uint64 execReset = exec;
for (int x = 0; x < 99999999; x++)
{
    exec = execReset;
    if (exec == 0xFFFFFFFFFFFFFFFF)
    {
        //#pragma loop(no_vector)
        for (int t = 0; t < 64; t++)
        {
            dst0[t] = mul_f32(src0[t], src1[t]);
        }
    }
    else
    {
        for (int t = 0; exec != 0; t++, exec >>= 1)
        {
            if ((exec & 1) != 0)
            {
                dst0[t] = mul_f32(src0[t], src1[t]);
            }
        }
    }
}
t.Stop();
std::cout << t.Elapsed() / 1000.0 << ", ";
 

I ran it with three input values for exec: 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFE, 0x1111111111111111.  The results were:

vs2010 NotSet: 3.188, 7.404, 7.072 
vs2010 AVX   : 3.19 , 7.418, 7.063 
vs2010 AVX2  : cl : Command line warning D9002: ignoring unknown option '/arch:AVX2' 
vs2013 NotSet: 0.998, 6.187, 6.624 
vs2013 AVX   : 0.985, 6.193, 6.621 
vs2013 AVX2  : 2.899, 6.231, 7.013 
vs2013 NotSet with #pragma loop (no_vector): 3.235, 7.399, 7.065
 

So for my simple test case, /arch:SSE2 vs. /arch:AVX does not matter.  /arch:AVX2 seems broken.  vs2010 vs. vs2013 does matter (a little for less-than-all exec bits set) (a lot for all exec bits set, due to auto-vectorization, see http://bit.ly/1yOmxSF).

AVX extends SIMD registers from 128 bits XMM to 256 bits YMM (so 32×4 to 32×8) (aside: AVX-512 has 512 bit, aka 32×16, ZMM regs).  According to (http://bit.ly/1zBGApL), vs2013 uses XMM (128 bits) while GCC uses YMM (256 bits) for the same auto-vectorized code.  Looking at the assembly code for /arch:AVX, I see that it’s still using XMM registers.  So if the vs2013 auto-vectorizer was using 256 bit YMM registers, we may get additional performance gains (of course the host CPU needs AVX).

How can we use 256-bit YMM registers?  Ideas (http://bit.ly/1y7gFjz):

A) Use a different compiler (or a newer version of the same compiler)

B) Use C++ intrinsics

C) Use vectorized libraries

D) Use OpenCL

How to deal with the issue of different host CPUs?  Eg my Phenom II X6 1090T doesn’t have AVX, while AMD FX-8120 does:

A) Compile multiple versions of the same code.  This could be multiple binaries (#ifdef’s) or multiple code paths in the same binary (run-time conditionals are okay, eg as long as you don’t go down an AVX code path on a non-AVX host).

B) Compile byte code (such as LLVM IR) offline, then at run-time we JIT to host instructions (differently depending on the host CPU)

I tried a quick implementation for SSE instrinsics compiled with Visual Studio 2013:

__m128 rA, rB, rC;
if (exec == 0xFFFFFFFFFFFFFFFF)
{
    for (int t = 0; t < 64; t += 4)
    {
        rA = _mm_load_ps(&src0[t]);
        rB = _mm_load_ps(&src1[t]);
        rC = _mm_mul_ps(rA, rB);
        _mm_store_ps(&dst0[t], rC);
    }
}

With /arch not set, the result was 2.061 sec, which is slower than when I let Visual Studio auto-vectorize for me.

Next I tried a quick implementation for AVX intrinsics (256 bit).  Note you need to align your memory or you may get an error:

float* dyn = (float *)_mm_malloc(width * depth * sizeof(float), 32);

Here is my quick AVX (256 bit) implementation:

__m256 rA_AVX, rB_AVX, rC_AVX;
if (exec == 0xFFFFFFFFFFFFFFFF)
{
    for (int t = 0; t < 64; t += 8)
    {
        rA_AVX = _mm256_load_ps(&src0[t]);
        rB_AVX = _mm256_load_ps(&src1[t]);
        rC_AVX = _mm256_mul_ps(rA_AVX, rB_AVX);
        _mm256_store_ps(&dst0[t], rC_AVX);
    }
}

Of course I get “Illegal Instruction” if I try to run it on my Phenom II X6 1090T (does not have AVX).  However, if I run it on my AMD FX-8120 (has AVX), then it runs to completion.  How’s the performance?  It gave me 0.889 sec.  That’s faster than the 128-bit Auto-Vectorize from Visual Studio 2013 (0.998 sec).  However, I expected it to be more faster.  Our 128-bit intrinsics ran significantly slower than the auto-vectorize 128-bit, so maybe an auto-vectorize 256-bit version would be faster (than my 256-bit instrinsics)?  Most likely there is room to optimize this further, at least for an AVX CPU.

Based on the above experiments, one strategy might be to have a default path that uses SSE2 based auto-vectorize, and an alternate AVX intrinsics path that uses AVX intrinsics (iff the host CPU has AVX).  For picking between these two code paths, there is a conditional for whether to use our SSE2 auto-vectorize or AVX intrinsics version.

The highest level we could do the conditional would be to compile two entire sets of separate binaries for SSE2 vs. AVX.  The drawback is that we’d have two copies of every binary.  Then we either ship both copies, or we tell users to bother with picking the right one.

On the other extreme, we could do a run-time conditional right before each of these loops.  However, depending on our code base, we might be doing the same runtime conditional (check if host has AVX) a ridiculous number of times.

So for a larger program we should write our C++ code such that we both (minimize redundant binaries) and (minimize the runtime conditionals).

Edit 2015/01/31: I decided to try clang (LLVM) on this example code to see the performance.  The clang home page’s downloads for Windows included a limited 32-bit binary, or building the source code with CMake and Visual Studio.  Instead, I found a more convenient download here (http://bit.ly/1EVlWzx).  I used the following to compile my above C code with clang:

clang fooLLVM.c -o fooLLVM.exe –O3

The –O3 flag is a simple way to turn on aggressive optimizations (without it, my fooLLVM.exe was very slow!)

When I ran this, clang figured out my timer loop did not need to run 99999999 times – it only needed to run once since the inner loop was setting dst0[t].  So I had to change my code to do this:

dst0[t] += mul_f32_LLVM(src0[t], src1[t]);

With that change, vs2013 Not Set (SSE2) gave me 1.493 sec, and vs2013 AVX gave me 1.401 sec.  LLVM gave me 1.617 sec (slower than vs2013).  Next I tried this:

clang fooLLVM.c -o fooLLVM.exe -O3 -mllvm -force-vector-width=8 => 1.185 sec (faster than vs2013) (also runs fine without an AVX CPU)

Next I tried the code with SSE intrinsics.  To make the comparison fair, I needed to modify the code to do the extra add:

rA = _mm_load_ps(&src0[t]); 
rB = _mm_load_ps(&src1[t]); 
rC = _mm_load_ps(&dst0[t]); 
rD = _mm_add_ps(rD, _mm_mul_ps(rA, rB)); 
_mm_store_ps(&dst0[t], rD);


clang++ fooLLVM.c -o fooLLVM.exe -O3 => 0.836 sec

Unlike with vs2013, intrinsics was faster than auto-vectorization.

Next I tried the code with AVX intrinsics:

rA_AVX = _mm256_load_ps(&src0[t]); 
rB_AVX = _mm256_load_ps(&src1[t]); 
rC_AVX = _mm256_load_ps(&dst0[t]); 
rD_AVX = _mm256_add_ps(rC_AVX, _mm256_mul_ps(rA_AVX, rB_AVX)); 
_mm256_store_ps(&dst0[t], rD_AVX);

clang++ fooLLVM.c -o fooLLVM.exe -O3 -march=corei7-avx => 1.452 sec (requires CPU with AVX)

clang++ fooLLVM.c -o fooLLVM.exe -O3 -mavx => 1.447 sec (requires CPU with AVX)

I was disappointed that AVX intrinsics was slower than clang SSE intrinsics.  Apparently this is an open bug, see (http://bit.ly/1yPVOka) (http://bit.ly/1696FQh) (http://bit.ly/1696FQh) (http://bit.ly/1wLqylA).  I was also disappointed that clang doesn’t seem to have an option for AVX auto-vectorization yet.

The above numbers were on my AMD FX-8120.  On my Phenom II X6 1090T, the autovectorize version with (-mllvm -force-vector-width=8) was actually faster than the SSE intrinsic version (1.604 sec vs. 1.41 sec).

Summary (of my simple test case):

* auto-vectorization for AVX support is lacking for both Visual Studio 2013 and LLVM (clang) as of 2015/01/31

* AVX intrinsics was fastest for vs2013, but brokenly slow for LLVM

* vs2013 auto-vectorize was a lot faster than SSE intrinsics

* LLVM SSE intrinsics was faster than auto-vectorize

* I got slightly faster results with LLVM than vs2013 for (auto-vectorize, SSE intrinsics)

Conclusion:

* LLVM SSE intrinsics for non-AVX CPU is faster than auto-vectorize (even though it’s the reverse for vs2013)

* LLVM AVX intrinsics are brokenly slow, and auto-vectorization doesn’t exist, as of 2015/01/31

* don’t trust these conclusions, because I only tested two hosts with a tiny test program that isn’t close enough to a real use case

YuGiOh GUI in C# (2006)

I wrote this nifty GUI in 2006 (shortly after college) for playing the YuGiOh card game.

This basic GUI prototype is pretty simple – it doesn’t have any AI, or networking, or OpenGL graphics.  It doesn’t enforce any of the complex card-specific rules logic.  It doesn’t display the card graphics (it just displays the card’s name).  What it does do is…  It reads a deck from a text file.  It shuffles the deck randomly and allows you to draw cards from the top of the deck, and to move a card to a specified game location (in the specified position).  It also has a simple integer number generator (for rolling dice, coin flips, etc).  When a card is placed on the field, the text boxes are intentionally editable so you can track things like how many counters are on a card.

This basic functionality was very useful to enable two use cases.  The first use case is that I could experiment making decks without using physical cards – to create a deck, I just type the card names in a simple text file.  I could also manually simulate playing two decks against each other.  The other use case is that I could play with someone over instant message and track the state of their game and my game (yes I actually did this) (the person on the other side could be using an actual physical deck instead my program).  So this was a great example of a tool that was low effort to write, but very useful.  And it could (obviously) be expanded to do a lot more.

One more feature worth mention is that although this is a GUI application, you can also play the entire thing in console mode using custom scripting commands (this is similar to how AMD SimNow works).  This could be extended to record and playback a session, though we’d need to save random number seeds for anything random (eg shuffling the deck).

Screen shot of the initial GUI:

image

Screen shot after a bit of playing:

image

To place selected card from hand, you right-click the zone:

image

Or if a card is already placed, you can reposition or move it, by right-clicking on the card:

image

An example deck list text file:

# http://www.professional-events.com/decks/UDE/YGO!/RegionalsButler090206.htm
# 14
[Monsters]
3 Lava Golem
3 Stealth Bird
3 Des Koala
2 Des Lacooda
1 Sangan
1 Spirit Reaper
1 Morphing Jar
# 13
[Spells]
3 Chain Energy
3 Wave-Motion Cannon
2 Nightmare’s Steelcage
2 Messenger of Peace
1 Swords of Revealing Light
1 Level Limit – Area B
1 Graceful Charity
# 13
[Traps]
3 Secret Barrel
3 Just Desserts
2 Ojama Trio
1 Ring of Destruction
1 Magic Cylinder
1 Gravity Bind
1 Ceasefire
1 Torrential Tribute

Old Side Projects Posted Now (Years Later)

My idea for this new blog category is to describe some older small side projects (exclusively, or at least primarily, done outside of work) (programming of course) that I wrote long before I created this blog.  I’ve done a lot of small side projects just to play with stuff and learn a little, so I probably have 100’s (or at least over 100) in the past 10 years, especially if we include before that (ie college and high school).

Because I’ve jumped around to different small side projects, someone reading this might think I am scatter-brained, lacking focus, or emphasizing quantity instead of quality.  But keep in mind I have a full-time job as a software engineer, and these old projects I plan to post will just be projects I did randomly for fun outside of my job, or for some specific practical use.

The real take-away is that I like programming so much that even after working 40-80 hours a week at my actual job (which specific focus areas), I will still spend a lot of nights and weekends and vacation time playing with other programming stuff that I don’t get exposed to at work (eg web stuff like WordPress & Adobe Flex, C#, Java, image processing, mobile development, automation & bots), or that I get at work but want more (eg OpenGL & DirectX, game engines, Qt, cross-platform development).  Also, these side projects are in addition to side learning (reading books, reading articles & papers, watching lectures & tutorials).

The trigger for my motivation to create this category is just that I was looking through old files and I found a small YuGiOh program I wrote in C# probably in 2006.  So I will at least post about that.  I don’t know what else I will end up posting, but we will see.  A part of me, as I write this, is already saying to myself – “writing this post is itself a small side project that is wasting time you could be using for something else” Open-mouthed smile

It’s a little weird that it’s 2015 and my first post in this category will be about something I wrote in 2006.  So that’s why I am putting these years old side projects in a separate category.  WordPress is probably not the right tool for this, because it organizes posts based on the date I posted them.  It would make more sense to organize them based on when I wrote the project and other attributes about the project.  Or maybe I will modify the posted date to an approximation of when I wrote the project, instead of when I wrote the post.  Whatever I do, I don’t want to waste a lot of time on it.

Edit: I’ve decided to merge this category with the scripting & automation category where I just did a post about using Sikuli for a simple Game of War “bot”, so this will include both old and recent small side projects…  So the emphasis is more on the type of small side project rather than when I wrote it.