(Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. Example on the Godbolt compiler explorer with -O3 -march=native -ffast-math including a reduction (array sum) which is scalar without -ffast-math. Reduction loops (like sum of an array) will need OpenMP or -ffast-math to treat FP math as associative and vectorize. To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use -march=native. I am not entirely sure whether it is worth tracking it in great detail here. Over time, as compilers change, options and compiler output will change. Most of the following was written by Peter Cordes, who could have just written a new answer. ICC defaults to optimization enabled + fast-math.) (Clang enables auto-vectorization at -O2. Modern versions of GCC enable -ftree-vectorize at -O3 so just use that in GCC4.x and later: gcc -O3 -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5 That's the default for 64-bit but not 32-bit.) For 32-bit code use -mfpmath=sse as well. Note that -msse is also a possibility, but it will only vectorize loops Giving a log of loops that have been vectorized: gcc -O2 -ftree-vectorize -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5 In summary, the following options will work for x86 chips with SSE2, While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now: The original page offers details on getting gcc to automatically vectorize
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |