summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--libraries/atlas/AMD64K10h64SSE3.tgzbin0 -> 11038 bytes
-rw-r--r--libraries/atlas/README11
-rw-r--r--libraries/atlas/README.SLACKWARE100
-rwxr-xr-xlibraries/atlas/atlas.SlackBuild229
-rw-r--r--libraries/atlas/atlas.info10
-rw-r--r--libraries/atlas/atlas.patch5072
-rw-r--r--libraries/atlas/slack-desc19
7 files changed, 5441 insertions, 0 deletions
diff --git a/libraries/atlas/AMD64K10h64SSE3.tgz b/libraries/atlas/AMD64K10h64SSE3.tgz
new file mode 100644
index 0000000000..727f3748db
--- /dev/null
+++ b/libraries/atlas/AMD64K10h64SSE3.tgz
Binary files differ
diff --git a/libraries/atlas/README b/libraries/atlas/README
new file mode 100644
index 0000000000..d7d3b30931
--- /dev/null
+++ b/libraries/atlas/README
@@ -0,0 +1,11 @@
+ATLAS (Automatically Tuned Linear Algebra Software) is an ongoing
+research effort focusing on applying empirical techniques in order to
+provide portable performance. At present, it provides C and Fortran77
+interfaces to a portably efficient BLAS implementation, as well as a few
+routines from LAPACK.
+
+This requires blas, and it conflicts with cblas (only one of atlas
+and cblas may be installed at any given time). Take care with LAPACK
+(see notes 3 & 4 in README.SLACKWARE).
+
+You need to read over README.SLACKWARE *before* building this.
diff --git a/libraries/atlas/README.SLACKWARE b/libraries/atlas/README.SLACKWARE
new file mode 100644
index 0000000000..3d7d9be243
--- /dev/null
+++ b/libraries/atlas/README.SLACKWARE
@@ -0,0 +1,100 @@
+ATLAS (Automatically Tuned Linear Algebra Software) is an ongoing
+research effort focusing on applying empirical techniques in order to
+provide portable performance. At present, it provides C and Fortran77
+interfaces to a portably efficient BLAS implementation, as well as a few
+routines from LAPACK.
+
+IMPORTANT NOTES:
+
+1) Please note that the present SlackBuild for ATLAS does by no means
+ try to take into account all configuration/build issues of ATLAS.
+ Nevertheless, the relevant patches mentioned in the ATLAS Errata
+ are applied.
+
+2) The script takes advantage of the fact that the compilers shipped with
+ Slackware should be OK. It also assumes that you are installing on an x86
+ or x86_64 platform. If you decide to use other compilers or install on
+ another platform, you are unfortunately on your own and welcome to suggest
+ improvements or patches to this SlackBuild. Moreover, there is no "post
+ install" tuning performed.
+
+3) ATLAS does not conflict with the reference netlib BLAS (see also note 6).
+ Nevertheless, if ATLAS got installed successfully you should consider removing
+ netlib BLAS and (re)compiling every BLAS dependent package (starting with
+ LAPACK) against ATLAS. Otherwise you may not have much gain from installing
+ ATLAS and may even get into problems (see next note).
+
+4) There is a strong interaction between ATLAS and LAPACK. If you want to install
+ ATLAS just for testing and avoid problems with LAPACK you are urged to make
+ use of the SYS_DESTDIR variable as explained later. Otherwise consider the
+ following:
+ a) It is not recommended to install LAPACK just along ATLAS, i.e. without building
+ it against ATLAS. Moreover, if LAPACK is already installed you have to first
+ remove it and later on build it against ATLAS.
+ b) If ATLAS+LAPACK doesn't work for you, just stick with (netlib) BLAS+LAPACK.
+ Netlib BLAS is also available as a SlackBuild.
+ c) If ATLAS+LAPACK is installed you have to recompile and reinstall LAPACK after
+ each ATLAS upgrade.
+
+5) ATLAS conflicts with cblas.
+
+6) You have to have netlib BLAS installed before you install ATLAS. As stated
+ above, you should consider removing it from your system afterwards.
+
+INSTALLATION DETAILS:
+
+1) Make sure CPU throttling is off before starting the install. This is
+ important, since ATLAS has to tune itself.
+
+2) For the same reason, keep the load on the system as low as possible
+ while building ATLAS.
+
+3) There are a few extra variables which you may want or need
+ to give appropriate values when calling the atlas.SlackBuild:
+ MAX_MALLOC, REF_BLAS, USE_ARCH_DEFAULTS, SYS_DESTDIR and
+ DEFAULT_DOCS.
+
+ MAX_MALLOC is for adjusting the maximal size IN BYTES(!) that ATLAS
+ is allowed to allocate. According to the ATLAS errata, a too small
+ value may strongly reduce threaded performance. The default value
+ within this SlackBuild corresponds to 256MB. (The default value in
+ the ATLAS source corresponds to 64MB.)
+
+ REF_BLAS defaults to the full path to the netlib BLAS library as
+ installed from the appropriate SlackBuilds.org script. If you have
+ the netlib BLAS elsewhere, you have to set the appropriate
+ value to this variable.
+
+ USE_ARCH_DEFAULTS defaults to "yes", which means that the library
+ will be optimized by trying to take into account former builds done
+ on a similar machine. Thus ATLAS will use predefined optimizations
+ if available. This may reduce (much) the compilation time but may
+ not give you the best result if you don't use the same compiler
+ version (gcc 4.2) as the ATLAS author.
+ Please note that with this variable set to "no", or if there are no
+ known optimizations for your machine ATLAS compilation lasts for
+ about three hours! Take a nap :-)
+
+ SYS_DESTDIR is set by default to "/usr" and is the system destination
+ directory. When installing the package produced by this SlackBuild,
+ ATLAS's files will be written to $SYS_DESTDIR/include,
+ $SYS_DESTDIR/include/atlas and $SYS_DESTDIR/lib (or lib64).
+ Documentation files are written to /usr/doc/atlas-$VERSION if not
+ otherwise stated (see below).
+ You may want to change the value of SYS_DESTDIR to avoid conflicts (see
+ IMPORTANT NOTES above). IMPORTANT: SYS_DESTDIR has to have an absolute
+ path as value.
+
+ DEFAULT_DOCS has the default value "yes", which means that docs go
+ to /usr/doc/atlas-$VERSION, but you may want to let the docs to
+ go to $SYS_DESTDIR/doc/atlas-$VERSION. For this, just set this
+ variable to something like "no".
+
+ All these settings may be done the usual way on the command line when
+ calling this SlackBuild, you do not have to edit the script.
+
+If you also installed the LAPACK linked against ATLAS, consider the following:
+"IMPORTANT: If you are actually updating this library, i.e. ATLAS, you MUST also
+rebuild and reinstall LAPACK, even if there is no update available for LAPACK!
+Otherwise you end up with an broken/incomplete LAPACK library!
+
diff --git a/libraries/atlas/atlas.SlackBuild b/libraries/atlas/atlas.SlackBuild
new file mode 100755
index 0000000000..595b20b2b6
--- /dev/null
+++ b/libraries/atlas/atlas.SlackBuild
@@ -0,0 +1,229 @@
+#!/bin/sh
+
+# Slackware build script for ATLAS
+
+# Copyright 2010 Serban Udrea <s.udrea@gsi.de>
+# All rights reserved.
+#
+# Redistribution and use of this script, with or without modification,
+# is permitted provided that the following conditions are met:
+#
+# 1. Redistributions of this script must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# THIS SOFTWARE IS PROVIDED BY THE AUTHOR ''AS IS'' AND ANY EXPRESS OR
+# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,
+# INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+# IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+
+PRGNAM=atlas
+VERSION=${VERSION:-3.8.3}
+ARCH=${ARCH:-i486}
+BUILD=${BUILD:-1}
+TAG=${TAG:-_SBo}
+
+CWD=$(pwd)
+TMP=${TMP:-/tmp/SBo}
+PKG=$TMP/package-$PRGNAM
+OUTPUT=${OUTPUT:-/tmp}
+
+if [ "$ARCH" = "i486" ]; then
+ SLKCFLAGS="-O2 -march=i486 -mtune=i686"
+ LIBDIRSUFFIX=""
+ BITSize="32" # Specifically for ATLAS
+elif [ "$ARCH" = "i686" ]; then
+ SLKCFLAGS="-O2 -march=i686 -mtune=i686"
+ LIBDIRSUFFIX=""
+ BITSize="32" # Specifically for ATLAS
+elif [ "$ARCH" = "x86_64" ]; then
+ SLKCFLAGS="-O2 -fPIC"
+ LIBDIRSUFFIX="64"
+ BITSize="64" # Specifically for ATLAS
+fi
+
+# You may change this to adjust the maximal size IN BYTES(!) that ATLAS
+# is allowed to allocate. According to the ATLAS errata, a too small
+# value may strongly reduce threaded performance. The default value
+# here is 256MB. (The default value in the ATLAS source is 64MB.)
+#
+MAX_MALLOC=${MAX_MALLOC:-268435456}
+
+# If you don't want to use architectural defaults set the following to
+# something like "no".
+USE_ARCH_DEFAULTS=${USE_ARCH_DEFAULTS:-yes}
+
+# The path to a reference BLAS library. By default it is assumed that you
+# have installed the netlib BLAS reference using the appropriate slackbuild
+# from slackbuilds.org. If this is not the case, you have to run this script
+# with another value for REF_BLAS.
+REF_BLAS=${REF_BLAS:-/usr/lib${LIBDIRSUFFIX}/libblas.a}
+
+# Let's do a little check (that we deal with a regular file we can read).
+[ -f "$REF_BLAS" -a -r "$REF_BLAS" ] || \
+{ echo "ERROR: Wrong path to reference BLAS library, exiting! " && exit 1; }
+
+# This is the system destination directory. When installing the
+# package produced by this script, ATLAS's files will be written to
+# $SYS_DESTDIR/include, $SYS_DESTDIR/include/atlas, $SYS_DESTDIR/lib
+# or $SYS_DESTDIR/lib64 ond appropriate platforms, etc.
+# Nevertheless, by default the documentation files go to
+# /usr/doc/$PRGNAM-$VERSION. You may change this through the variable
+# DEFAULT_DOCS, see below.
+#
+SYS_DESTDIR=${SYS_DESTDIR:-/usr}
+
+# Check if SYS_DESTDIR is an absolute path. If not, exit with error.
+# NOTE: The $ is used because echo adds a \n at the end of the string.
+echo $SYS_DESTDIR | grep -vE '/\.\./|/\.\.$' | grep -qE '^/' || \
+{ echo "ERROR: The system destination directory has no absolute path!" \
+&& echo " The value of SYS_DESTDIR is $SYS_DESTDIR" \
+&& echo " Please set it properly! " \
+&& exit 1; }
+
+# You may want to have the documentation files installed under
+# $SYS_DESTDIR/doc/$PRGNAM-$VERSION not /usr/doc/$PRGNAM-$VERSION.
+# To achieve this just set the following variable to something like
+# "no".
+#
+DEFAULT_DOCS=${DEFAULT_DOCS:-yes}
+
+# The build directory to be created within the source directory of
+# ATLAS.
+BLDdir=BuildDir
+
+# Get the CPU frequency for good timing.
+CPU_FREQ="$(cat /proc/cpuinfo |grep "cpu MHz"| head -n 1| cut -d ":" -s -f2| tr -d [:blank:])"
+
+set -e # Exit on most errors
+
+rm -rf $PKG
+mkdir -p $TMP $PKG $OUTPUT
+
+cd $TMP
+rm -rf $PRGNAM-$VERSION
+tar xvf $CWD/${PRGNAM}${VERSION}.tar.bz2
+mv ATLAS $PRGNAM-$VERSION
+cd $PRGNAM-$VERSION
+
+chown -R root:root .
+
+find . \
+ \( -perm 777 -o -perm 775 -o -perm 711 -o -perm 555 -o -perm 511 \) \
+ -exec chmod 755 {} \; -o \
+ \( -perm 666 -o -perm 664 -o -perm 600 -o -perm 444 -o -perm 440 -o -perm 400 \) \
+ -exec chmod 644 {} \;
+
+# Make changes as suggested in the atlas errata.
+cat $CWD/atlas.patch | sed -e s%XXX_MaxMalloc_XXX%$MAX_MALLOC% | patch -p1
+
+# If architectural defaults are to be used, copy the file mentioned in the errata
+# to the architectural defaults directory.
+case "$USE_ARCH_DEFAULTS" in
+ [yY]|[yY][eE]|[yY][eE][sS]) cp "$CWD/AMD64K10h64SSE3.tgz" CONFIG/ARCHS; USE_ARCH_DEFAULTS="1" ;;
+ *) USE_ARCH_DEFAULTS="0" ;;
+esac
+
+mkdir -p $BLDdir
+cd $BLDdir
+
+# Configure atlas.
+# IMPORTANT: Here we assume that we are on a x86 machine (be it 32 or 64 bits)
+# and gcc or icc is the compiler to be used. This should be presently
+# a reasonable assumption with Slackware. Under other circumstances
+# "-DPentiumCPS=$CPU_FREQ" has to be exchanged with "-DWALL".
+#
+../configure -Si archdef "$USE_ARCH_DEFAULTS" -b "$BITSize" -D c \
+-DPentiumCPS="$CPU_FREQ" -Fa alg -fPIC
+
+# NOTES ON THE FLAGS FOR CONFIGURE
+#
+# -Si archdef "$USE_ARCH_DEFAULTS" means that we ignore or not architectural defaults depending
+# upon the value of "$USE_ARCH_DEFAULTS".
+# -b "$BITSize" tells ATLAS about the platform's bitsize, 32 or 64.
+# -D c -DPentiumCPS="$CPU_FREQ" is for achieving good timing on x86 platforms with gcc or icc.
+# -Fa alg -fPIC is for beeing able to create dynamic libs too.
+
+# The next two variables are set and their values are finally saved
+# for using them to compile lapack.
+# Remember the compiler name.
+ATLAS_COMPILER="$(grep "F77 =" Make.inc | cut -d "=" -f1 --complement)"
+
+# Remember the fortran compilation flags.
+ATLAS_F77FLAGS="$(grep "F77FLAGS =" Make.inc | cut -d "=" -f1 --complement)"
+
+# Set the path to the reference BLAS.
+sed -i -e '/^ \+BLASlib/s%BLASlib = .*%BLASlib = '"$REF_BLAS"% \
+ Make.inc
+
+make
+make check
+
+# If parallel libraries have been compiled check them too.
+if [ -f lib/libptcblas.a ]; then
+ make ptcheck
+ PARALLEL_LIBS="yes" # We will use this when creating dynamic libs.
+fi
+
+# Install the static libs created during the build process.
+make install DESTDIR=$PKG$SYS_DESTDIR
+
+# Go to the ATLAS $BLDdir/lib directory and try to create and install
+# the dynamic libraries.
+# NOTE: The test for the presence of static parallel libs and the command to actually build the
+# shared parallel libs are connected by a logical OR to make sure that the subshell
+# does not exit with non-zero error code just because static parallel libs didn't
+# get built. Therefore the test is successful if the variable PARALLEL_LIBS is unset or
+# empty, i.e. when no static parallel libs got built.
+( cd lib && make shared && \
+ { [ "${PARALLEL_LIBS}1" = "1" ] || make ptshared; } && \
+ cp -p *.so "$PKG$SYS_DESTDIR/lib"
+)
+
+find $PKG | xargs file | grep -e "executable" -e "shared object" | grep ELF \
+ | cut -f 1 -d : | xargs strip --strip-unneeded 2> /dev/null || true
+
+# This is probably the easiest way to make sure that we install in the
+# proper place.
+if [ ! -z $LIBDIRSUFFIX ]; then
+ mv $PKG$SYS_DESTDIR/lib $PKG$SYS_DESTDIR/lib${LIBDIRSUFFIX}
+fi
+
+# Create the doc directory for atlas and populate it.
+case "$DEFAULT_DOCS" in
+ [nN]|[nN][oO]) DOC_DIR="$PKG$SYS_DESTDIR/doc/$PRGNAM-$VERSION" ;;
+ *) DOC_DIR="$PKG/usr/doc/$PRGNAM-$VERSION" ;;
+esac
+
+mkdir -p $DOC_DIR
+cp -a ../INSTALL.txt ../README ../doc $DOC_DIR
+
+# The following makefiles may be needed to merge atlas and lapack.
+mkdir $DOC_DIR/MAKEFILES
+cp -p Make.inc $DOC_DIR/MAKEFILES
+cp -p lib/Makefile $DOC_DIR/MAKEFILES/Makefile.lib
+
+# Create a file with the build flags for atlas. Needed to merge
+# ATLAS and LAPACK. The LAPACK SlackBuild will just have to source
+# this file to find out the compiler used for ATLAS and the build
+# flags.
+echo "ATLAS_COMPILER=\"$ATLAS_COMPILER\"" > "$DOC_DIR/SETTINGS"
+echo "ATLAS_F77FLAGS=\"$ATLAS_F77FLAGS\"" >> "$DOC_DIR/SETTINGS"
+sed -i -e s'%=" %="%' "$DOC_DIR/SETTINGS" # Remove the extra space after the "=" sign
+echo "ATLAS_NOOPT=\"-O0\" #Eventually add more options within the quotes." >> "$DOC_DIR/SETTINGS"
+
+# Add the Slackbuild script and README.SLACKWARE to the docs.
+cat $CWD/$PRGNAM.SlackBuild > $DOC_DIR/$PRGNAM.SlackBuild
+cat $CWD/README.SLACKWARE > $DOC_DIR/README.SLACKWARE
+
+mkdir -p $PKG/install
+cat $CWD/slack-desc > $PKG/install/slack-desc
+
+cd "$PKG"
+/sbin/makepkg -l y -c n $OUTPUT/$PRGNAM-$VERSION-$ARCH-$BUILD$TAG.${PKGTYPE:-tgz}
diff --git a/libraries/atlas/atlas.info b/libraries/atlas/atlas.info
new file mode 100644
index 0000000000..d223fcc27d
--- /dev/null
+++ b/libraries/atlas/atlas.info
@@ -0,0 +1,10 @@
+PRGNAM="atlas"
+VERSION="3.8.3"
+HOMEPAGE="http://math-atlas.sourceforge.net/"
+DOWNLOAD="http://downloads.sourceforge.net/math-atlas/atlas3.8.3.tar.bz2"
+MD5SUM="6c13be94a87178e7582111c08e9503bc"
+DOWNLOAD_x86_64=""
+MD5SUM_x86_64=""
+MAINTAINER="Serban Udrea"
+EMAIL="S.Udrea@gsi.de"
+APPROVED="rworkman"
diff --git a/libraries/atlas/atlas.patch b/libraries/atlas/atlas.patch
new file mode 100644
index 0000000000..dea4dcc0b2
--- /dev/null
+++ b/libraries/atlas/atlas.patch
@@ -0,0 +1,5072 @@
+diff -rupN ATLAS/CONFIG/src/backend/archinfo_x86.c atlas-3.8.3/CONFIG/src/backend/archinfo_x86.c
+--- ATLAS/CONFIG/src/backend/archinfo_x86.c 2009-02-18 19:47:37.000000000 +0100
++++ atlas-3.8.3/CONFIG/src/backend/archinfo_x86.c 2009-11-12 13:47:23.777451677 +0100
+@@ -320,7 +320,7 @@ enum MACHTYPE Chip2Mach(enum CHIP chip,
+ iret = IntP4;
+ break;
+ case 3:
+- case 4:
++ case 4: ; case 6:
+ iret = IntP4E;
+ break;
+ default:
+diff -rupN ATLAS/include/atlas_lvl3.h atlas-3.8.3/include/atlas_lvl3.h
+--- ATLAS/include/atlas_lvl3.h 2009-02-18 19:47:35.000000000 +0100
++++ atlas-3.8.3/include/atlas_lvl3.h 2009-11-12 13:52:49.308496090 +0100
+@@ -126,7 +126,7 @@
+ #define CPAT Mjoin(C_ATL_, PRE);
+
+ #ifndef ATL_MaxMalloc
+- #define ATL_MaxMalloc 67108864
++ #define ATL_MaxMalloc XXX_MaxMalloc_XXX
+ #endif
+
+ typedef void (*MAT2BLK)(int, int, const TYPE*, int, TYPE*, const SCALAR);
+diff -rupN ATLAS/src/blas/gemm/ATL_cmmJITcp.c atlas-3.8.3/src/blas/gemm/ATL_cmmJITcp.c
+--- ATLAS/src/blas/gemm/ATL_cmmJITcp.c 2009-02-18 19:47:44.000000000 +0100
++++ atlas-3.8.3/src/blas/gemm/ATL_cmmJITcp.c 2009-11-12 12:44:34.816529051 +0100
+@@ -268,7 +268,8 @@ static void Mjoin(PATL,mmK)
+ {
+ NBmm0 = NBmm1 = NBmmX = Mjoin(PATLU,pKBmm);
+ if (SCALAR_IS_ZERO(beta))
+- Mjoin(PATL,gezero)(M, N, C, ldc);
++ /* Mjoin(PATL,gezero)(M, N, C, ldc); */
++ { Mjoin(PATLU,gezero)(M, N, pC, ldpc); Mjoin(PATLU,gezero)(M, N, pC+ipc, ldpc); }
+ }
+ if (nblk)
+ {
+diff -rupN ATLAS/src/blas/gemm/ATL_gereal2cplx.c atlas-3.8.3/src/blas/gemm/ATL_gereal2cplx.c
+--- ATLAS/src/blas/gemm/ATL_gereal2cplx.c 2009-02-18 19:47:44.000000000 +0100
++++ atlas-3.8.3/src/blas/gemm/ATL_gereal2cplx.c 2009-11-12 12:49:49.331651677 +0100
+@@ -43,7 +43,53 @@ void Mjoin(PATL,gereal2cplx)
+ const int ldc2 = (ldc-M)<<1;
+ int i, j;
+
+- if (ialp == ATL_rzero && ibet == ATL_rzero)
++/*
++ * Cannot read C if BETA is 0
++ */
++ if (rbet == ATL_rzero && ibet == ATL_rzero)
++ {
++ if (ialp == ATL_rzero) /* alpha is a real number */
++ {
++ if (ralp == ATL_rone) /* alpha = 1.0 */
++ {
++ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2)
++ {
++ for (i=0; i < M; i++, C += 2)
++ {
++ *C = R[i];
++ C[1] = I[i];
++ }
++ }
++ }
++ else
++ {
++ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2)
++ {
++ for (i=0; i < M; i++, C += 2)
++ {
++ *C = ralp * R[i];
++ C[1] = ralp * I[i];
++ }
++ }
++ }
++ }
++ else /* alpha is a complex number */
++ {
++ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2)
++ {
++ for (i=0; i < M; i++, C += 2)
++ {
++ ra = R[i]; ia = I[i];
++ C[0] = ralp * ra - ialp * ia;
++ C[1] = ralp * ia + ialp * ra;
++ }
++ }
++ }
++ }
++/*
++ * If alpha and beta are both real numbers
++ */
++ else if (ialp == ATL_rzero && ibet == ATL_rzero)
+ {
+ if (ralp == ATL_rone && rbet == ATL_rone)
+ {
+diff -rupN ATLAS/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c atlas-3.8.3/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c
+--- ATLAS/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c 2009-02-18 19:48:26.000000000 +0100
++++ atlas-3.8.3/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c 2009-11-12 12:35:50.453038827 +0100
+@@ -27,6 +27,13 @@
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
++#if KB > 84
++ #error "KB cannot exceed 84!"
++#endif
++#if (KB/4)*4 != KB
++ #error "KB must be a multiple of 4!"
++#endif
++
+ #ifndef ATL_GAS_x8664
+ #error "This kernel requires x86-64 assembly!"
+ #endif
+@@ -58,25 +65,25 @@
+ * Integer register usage shown be these defines
+ */
+ #define pA %rcx
+-#define pA10 %rbx
+-#define ldab %rbp
+-#define mldab %rdx
++#define pA10 %rbx
++#define ldab %rbp
++#define mldab %rdx
+ #define mldab5 %rax
+ #define pB %rdi
+ #define pC %rsi
+ #define incCn %r10
+ #define stM %r9
+ #define stN %r11
+-#define pfA %r8
+-#define pA5 pA
+-#define pB0 pB
++#define pfA %r8
++#define pA5 pA
++#define pB0 pB
+ #if MB == 0
+- #define stM0 %r12
+- #define incAm %r13
++ #define stM0 %r12
++ #define incAm %r13
+ #endif
+ /* rax used in 32/64 conversion */
+
+-#define NBso (KB*4)
++#define NBso (KB*4)
+ #define MBKBso (MB*KB*4)
+ #define NB2so (NBso+NBso)
+ #define NB3so (NBso+NBso+NBso)
+@@ -95,22 +102,22 @@
+ /*
+ * SSE2 register usage shown be these defines
+ */
+-#define rA0 %xmm0
+-#define rB0 %xmm1
+-#define rC0 %xmm2
+-#define rC1 %xmm3
+-#define rC2 %xmm4
+-#define rC3 %xmm5
+-#define rC4 %xmm6
+-#define rC5 %xmm7
+-#define rC6 %xmm8
+-#define rC7 %xmm9
+-#define rC8 %xmm10
+-#define rC9 %xmm11
+-#define rC10 %xmm12
+-#define rC11 %xmm13
+-#define rC12 %xmm14
+-#define rC13 %xmm15
++#define rA0 %xmm0
++#define rB0 %xmm1
++#define rC0 %xmm2
++#define rC1 %xmm3
++#define rC2 %xmm4
++#define rC3 %xmm5
++#define rC4 %xmm6
++#define rC5 %xmm7
++#define rC6 %xmm8
++#define rC7 %xmm9
++#define rC8 %xmm10
++#define rC9 %xmm11
++#define rC10 %xmm12
++#define rC11 %xmm13
++#define rC12 %xmm14
++#define rC13 %xmm15
+ /*
+ * Prefetch defines
+ */
+@@ -127,99 +134,99 @@
+ #if MB != 0
+ #define incAm $MBKBso-NB14so+176
+ #endif
+- .text
++ .text
+ .global ATL_asmdecor(ATL_USERMM)
+ ATL_asmdecor(ATL_USERMM):
+ /*
+ * Save callee-saved iregs
+ */
+- movq %rbp, -8(%rsp)
+- movq %rbx, -16(%rsp)
++ movq %rbp, -8(%rsp)
++ movq %rbx, -16(%rsp)
+ #if MB == 0
+- movq %r12, -32(%rsp)
+- movq %r13, -40(%rsp)
++ movq %r12, -32(%rsp)
++ movq %r13, -40(%rsp)
+ #endif
+ #ifdef BETAX
+ #define BOF -56
+- movss %xmm1, BOF(%rsp)
+- movss %xmm1, BOF+4(%rsp)
+- movss %xmm1, BOF+8(%rsp)
+- movss %xmm1, BOF+12(%rsp)
++ movss %xmm1, BOF(%rsp)
++ movss %xmm1, BOF+4(%rsp)
++ movss %xmm1, BOF+8(%rsp)
++ movss %xmm1, BOF+12(%rsp)
+ #endif
+ /*
+ * pA already comes in right reg
+ * Initialize pB = B; pC = C; NBso = NB * sizeof;
+ */
+- movq %rsi, stN
+- movq %rdi, %rax
+- movq 16(%rsp), pC
+- prefC((pC))
+- prefC(64(pC))
+- movq %r9, pB
+- prefB((pB))
+- prefB(64(pB))
+- movq %rax, stM
++ movq %rsi, stN
++ movq %rdi, %rax
++ movq 16(%rsp), pC
++ prefC((pC))
++ prefC(64(pC))
++ movq %r9, pB
++ prefB((pB))
++ prefB(64(pB))
++ movq %rax, stM
+ /*
+ * stM = pA + NBNBso; stN = pB + NBNBso;
+ */
+ #if MB == 0
+- movq stM, pfA
+- imulq $NBso, pfA
+- prefB(128(pB))
+- movq pfA, incAm
+- addq pA5, pfA
+- addq $176-NB14so, incAm
++ movq stM, pfA
++ imulq $NBso, pfA
++ prefB(128(pB))
++ movq pfA, incAm
++ addq pA5, pfA
++ addq $176-NB14so, incAm
+ #else
+- movq $MBKBso, pfA
+- addq pA5, pfA
+- prefB(128(pB))
++ movq $MBKBso, pfA
++ addq pA5, pfA
++ prefB(128(pB))
+ #endif
+ /*
+ * convert ldc to 64 bits, and then set incCn = (ldc - MB)*sizeof
+ */
+- movl 24(%rsp), %eax
+- cltq
+- movq %rax, incCn
+- subq stM, incCn
+- addq $14, incCn
++ movl 24(%rsp), %eax
++ cltq
++ movq %rax, incCn
++ subq stM, incCn
++ addq $14, incCn
+ #ifdef SREAL
+- shl $2, incCn
++ shl $2, incCn
+ #else
+- shl $3, incCn
+- prefC(128(pC))
+- prefC(192(pC))
++ shl $3, incCn
++ prefC(128(pC))
++ prefC(192(pC))
+ #endif
+ /*
+ * Find M/14 if MB is not set
+ */
+ #if MB == 0
+- cmp $84, stM
+- jne MB_LT84
+-/* movq $84/14, stM */
+- movq $6, stM
++ cmp $84, stM
++ jne MB_LT84
++/* movq $84/14, stM */
++ movq $6, stM
+ MBFOUND:
+- subq $1, stM
+- movq stM, stM0
++ subq $1, stM
++ movq stM, stM0
+ #endif
+- addq $120, pA5
+- addq $120, pB0
+- movq $KB*4, ldab
+- movq $-KB*5*4, mldab5
+- movq $-KB*4, mldab
+- subq mldab5, pA5
+- lea KB*4(pA5, ldab,4), pA10
+-/* movq $NB, stN */
++ addq $120, pA5
++ addq $120, pB0
++ movq $KB*4, ldab
++ movq $-KB*5*4, mldab5
++ movq $-KB*4, mldab
++ subq mldab5, pA5
++ lea KB*4(pA5, ldab,4), pA10
++/* movq $NB, stN */
+
+ UNLOOP:
+ #if MB == 0
+- movq stM0, stM
+- cmp $0, stM
+- je MLAST
++ movq stM0, stM
++ cmp $0, stM
++ je MLAST
+ #else
+ #ifdef ATL_DivAns
+- movq $ATL_DivAns-1, stM
++ movq $ATL_DivAns-1, stM
+ #else
+- movq $MB/14-1, stM
++ movq $MB/14-1, stM
+ #endif
+ #endif
+ #if MB == 0 || MB > 14
+@@ -227,992 +234,992 @@ UMLOOP:
+ /*
+ * rC[0-13] = pC[0-13] * beta
+ */
+- ALIGN16
++ ALIGN16
+ /*UKLOOP: */
+ #ifdef BETA1
+- movaps 0-120(pA10,mldab5,2), rC0
+- movaps 0-120(pB0), rB0
+- mulps rB0, rC0
+- addss (pC), rC0
+- movaps 0-120(pA5, mldab,4), rC1
+- mulps rB0, rC1
+- addss CMUL(4)(pC), rC1
+- movaps 0-120(pA10, mldab,8), rC2
+- mulps rB0, rC2
+- addss CMUL(8)(pC), rC2
+- movaps 0-120(pA5, mldab,2), rC3
+- mulps rB0, rC3
+- addss CMUL(12)(pC), rC3
+- movaps 0-120(pA5, mldab), rC4
+- mulps rB0, rC4
+- addss CMUL(16)(pC), rC4
+- movaps 0-120(pA5), rC5
+- mulps rB0, rC5
+- addss CMUL(20)(pC), rC5
+- movaps 0-120(pA5, ldab), rC6
+- mulps rB0, rC6
+- addss CMUL(24)(pC), rC6
+- movaps 0-120(pA5, ldab,2), rC7
+- mulps rB0, rC7
+- addss CMUL(28)(pC), rC7
+- movaps 0-120(pA10, mldab,2), rC8
+- mulps rB0, rC8
+- addss CMUL(32)(pC), rC8
+- movaps 0-120(pA5,ldab,4), rC9
+- mulps rB0, rC9
+- addss CMUL(36)(pC), rC9
+- movaps 0-120(pA10), rC10
+- mulps rB0, rC10
+- addss CMUL(40)(pC), rC10
+- movaps 0-120(pA10,ldab), rC11
+- mulps rB0, rC11
+- addss CMUL(44)(pC), rC11
+- movaps 0-120(pA10,ldab,2), rC12
+- mulps rB0, rC12
+- addss CMUL(48)(pC), rC12
+- movaps 0-120(pA5,ldab,8), rC13
+- mulps rB0, rC13
+- addss CMUL(52)(pC), rC13
++ movaps 0-120(pA10,mldab5,2), rC0
++ movaps 0-120(pB0), rB0
++ mulps rB0, rC0
++ addss (pC), rC0
++ movaps 0-120(pA5, mldab,4), rC1
++ mulps rB0, rC1
++ addss CMUL(4)(pC), rC1
++ movaps 0-120(pA10, mldab,8), rC2
++ mulps rB0, rC2
++ addss CMUL(8)(pC), rC2
++ movaps 0-120(pA5, mldab,2), rC3
++ mulps rB0, rC3
++ addss CMUL(12)(pC), rC3
++ movaps 0-120(pA5, mldab), rC4
++ mulps rB0, rC4
++ addss CMUL(16)(pC), rC4
++ movaps 0-120(pA5), rC5
++ mulps rB0, rC5
++ addss CMUL(20)(pC), rC5
++ movaps 0-120(pA5, ldab), rC6
++ mulps rB0, rC6
++ addss CMUL(24)(pC), rC6
++ movaps 0-120(pA5, ldab,2), rC7
++ mulps rB0, rC7
++ addss CMUL(28)(pC), rC7
++ movaps 0-120(pA10, mldab,2), rC8
++ mulps rB0, rC8
++ addss CMUL(32)(pC), rC8
++ movaps 0-120(pA5,ldab,4), rC9
++ mulps rB0, rC9
++ addss CMUL(36)(pC), rC9
++ movaps 0-120(pA10), rC10
++ mulps rB0, rC10
++ addss CMUL(40)(pC), rC10
++ movaps 0-120(pA10,ldab), rC11
++ mulps rB0, rC11
++ addss CMUL(44)(pC), rC11
++ movaps 0-120(pA10,ldab,2), rC12
++ mulps rB0, rC12
++ addss CMUL(48)(pC), rC12
++ movaps 0-120(pA5,ldab,8), rC13
++ mulps rB0, rC13
++ addss CMUL(52)(pC), rC13
+ #else
+- movaps 0-120(pA10,mldab5,2), rC0
+- movaps 0-120(pB0), rC13
+- mulps rC13, rC0
+- movaps 0-120(pA5, mldab,4), rC1
+- mulps rC13, rC1
+- movaps 0-120(pA10, mldab,8), rC2
+- mulps rC13, rC2
+- movaps 0-120(pA5, mldab,2), rC3
+- mulps rC13, rC3
+- movaps 0-120(pA5, mldab), rC4
+- mulps rC13, rC4
+- movaps 0-120(pA5), rC5
+- mulps rC13, rC5
+- movaps 0-120(pA5, ldab), rC6
+- mulps rC13, rC6
+- movaps 0-120(pA5, ldab,2), rC7
+- mulps rC13, rC7
+- movaps 0-120(pA10, mldab,2), rC8
+- mulps rC13, rC8
+- movaps 0-120(pA5,ldab,4), rC9
+- mulps rC13, rC9
+- movaps 0-120(pA10), rC10
+- mulps rC13, rC10
+- movaps 0-120(pA10,ldab), rC11
+- mulps rC13, rC11
+- movaps 0-120(pA10,ldab,2), rC12
+- mulps rC13, rC12
+- mulps 0-120(pA5,ldab,8), rC13
++ movaps 0-120(pA10,mldab5,2), rC0
++ movaps 0-120(pB0), rC13
++ mulps rC13, rC0
++ movaps 0-120(pA5, mldab,4), rC1
++ mulps rC13, rC1
++ movaps 0-120(pA10, mldab,8), rC2
++ mulps rC13, rC2
++ movaps 0-120(pA5, mldab,2), rC3
++ mulps rC13, rC3
++ movaps 0-120(pA5, mldab), rC4
++ mulps rC13, rC4
++ movaps 0-120(pA5), rC5
++ mulps rC13, rC5
++ movaps 0-120(pA5, ldab), rC6
++ mulps rC13, rC6
++ movaps 0-120(pA5, ldab,2), rC7
++ mulps rC13, rC7
++ movaps 0-120(pA10, mldab,2), rC8
++ mulps rC13, rC8
++ movaps 0-120(pA5,ldab,4), rC9
++ mulps rC13, rC9
++ movaps 0-120(pA10), rC10
++ mulps rC13, rC10
++ movaps 0-120(pA10,ldab), rC11
++ mulps rC13, rC11
++ movaps 0-120(pA10,ldab,2), rC12
++ mulps rC13, rC12
++ mulps 0-120(pA5,ldab,8), rC13
+ #endif
+
+ #if KB > 4
+- movaps 16-120(pA10,mldab5,2), rA0
+- movaps 16-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 16-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 16-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 16-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 16-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 16-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 16-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 16-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 16-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 16-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 16-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 16-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 16-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 16-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 16-120(pA10,mldab5,2), rA0
++ movaps 16-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 16-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 16-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 16-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 16-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 16-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 16-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 16-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 16-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 16-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 16-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 16-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 16-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 16-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 8
+- movaps 32-120(pA10,mldab5,2), rA0
+- movaps 32-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 32-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 32-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 32-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 32-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 32-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 32-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 32-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 32-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 32-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 32-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 32-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 32-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 32-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 32-120(pA10,mldab5,2), rA0
++ movaps 32-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 32-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 32-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 32-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 32-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 32-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 32-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 32-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 32-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 32-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 32-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 32-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 32-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 32-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 12
+- movaps 48-120(pA10,mldab5,2), rA0
+- movaps 48-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 48-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 48-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 48-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 48-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 48-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 48-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 48-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 48-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 48-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 48-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 48-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 48-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 48-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 48-120(pA10,mldab5,2), rA0
++ movaps 48-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 48-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 48-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 48-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 48-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 48-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 48-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 48-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 48-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 48-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 48-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 48-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 48-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 48-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 16
+- movaps 64-120(pA10,mldab5,2), rA0
+- movaps 64-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 64-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 64-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 64-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 64-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 64-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 64-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 64-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 64-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 64-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 64-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 64-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 64-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 64-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 64-120(pA10,mldab5,2), rA0
++ movaps 64-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 64-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 64-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 64-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 64-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 64-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 64-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 64-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 64-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 64-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 64-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 64-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 64-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 64-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 20
+- movaps 80-120(pA10,mldab5,2), rA0
+- movaps 80-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 80-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 80-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 80-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 80-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 80-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 80-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 80-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 80-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 80-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 80-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 80-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 80-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 80-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 80-120(pA10,mldab5,2), rA0
++ movaps 80-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 80-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 80-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 80-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 80-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 80-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 80-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 80-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 80-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 80-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 80-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 80-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 80-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 80-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 24
+- movaps 96-120(pA10,mldab5,2), rA0
+- movaps 96-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 96-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 96-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 96-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 96-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 96-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 96-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 96-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 96-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 96-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 96-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 96-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 96-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 96-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 96-120(pA10,mldab5,2), rA0
++ movaps 96-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 96-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 96-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 96-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 96-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 96-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 96-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 96-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 96-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 96-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 96-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 96-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 96-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 96-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 28
+- movaps 112-120(pA10,mldab5,2), rA0
+- movaps 112-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 112-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 112-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 112-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 112-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 112-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 112-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 112-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 112-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 112-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 112-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 112-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 112-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 112-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 112-120(pA10,mldab5,2), rA0
++ movaps 112-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 112-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 112-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 112-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 112-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 112-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 112-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 112-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 112-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 112-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 112-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 112-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 112-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 112-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+ #ifndef SREAL
+- pref2((pfA))
+- pref2(64(pfA))
++ pref2((pfA))
++ pref2(64(pfA))
+ #endif
+
+ #if KB > 32
+- movaps 128-120(pA10,mldab5,2), rA0
+- movaps 128-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 128-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 128-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 128-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 128-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 128-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 128-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 128-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 128-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 128-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 128-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 128-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 128-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 128-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 128-120(pA10,mldab5,2), rA0
++ movaps 128-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 128-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 128-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 128-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 128-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 128-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 128-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 128-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 128-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 128-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 128-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 128-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 128-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 128-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 36
+- movaps 144-120(pA10,mldab5,2), rA0
+- movaps 144-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 144-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 144-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 144-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 144-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 144-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 144-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 144-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 144-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 144-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 144-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 144-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 144-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 144-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 144-120(pA10,mldab5,2), rA0
++ movaps 144-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 144-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 144-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 144-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 144-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 144-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 144-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 144-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 144-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 144-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 144-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 144-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 144-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 144-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 40
+- movaps 160-120(pA10,mldab5,2), rA0
+- movaps 160-120(pB0), rB0
+- mulps rB0, rA0
+- addq $176, pB0
+- addps rA0, rC0
+- movaps 160-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 160-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 160-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 160-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 160-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 160-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 160-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 160-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 160-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 160-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 160-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 160-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addq $176, pA10
+- addps rA0, rC12
+- mulps 160-120(pA5,ldab,8), rB0
+- addps rB0, rC13
+- addq $176, pA5
++ movaps 160-120(pA10,mldab5,2), rA0
++ movaps 160-120(pB0), rB0
++ mulps rB0, rA0
++ addq $176, pB0
++ addps rA0, rC0
++ movaps 160-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 160-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 160-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 160-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 160-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 160-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 160-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 160-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 160-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 160-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 160-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 160-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addq $176, pA10
++ addps rA0, rC12
++ mulps 160-120(pA5,ldab,8), rB0
++ addps rB0, rC13
++ addq $176, pA5
+ #else
+- addq $176, pB0
+- addq $176, pA10
+- addq $176, pA5
++ addq $176, pB0
++ addq $176, pA10
++ addq $176, pA5
+ #endif
+
+ #if KB > 44
+- movaps 0-120(pA10,mldab5,2), rA0
+- movaps 0-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 0-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 0-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 0-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 0-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 0-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 0-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 0-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 0-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 0-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 0-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 0-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 0-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 0-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 0-120(pA10,mldab5,2), rA0
++ movaps 0-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 0-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 0-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 0-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 0-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 0-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 0-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 0-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 0-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 0-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 0-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 0-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 0-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 0-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 48
+- movaps 16-120(pA10,mldab5,2), rA0
+- movaps 16-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 16-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 16-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 16-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 16-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 16-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 16-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 16-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 16-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 16-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 16-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 16-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 16-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 16-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 16-120(pA10,mldab5,2), rA0
++ movaps 16-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 16-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 16-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 16-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 16-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 16-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 16-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 16-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 16-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 16-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 16-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 16-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 16-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 16-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 52
+- movaps 32-120(pA10,mldab5,2), rA0
+- movaps 32-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 32-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 32-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 32-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 32-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 32-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 32-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 32-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 32-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 32-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 32-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 32-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 32-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 32-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 32-120(pA10,mldab5,2), rA0
++ movaps 32-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 32-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 32-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 32-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 32-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 32-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 32-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 32-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 32-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 32-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 32-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 32-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 32-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 32-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 56
+- movaps 48-120(pA10,mldab5,2), rA0
+- movaps 48-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 48-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 48-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 48-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 48-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 48-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 48-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 48-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 48-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 48-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 48-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 48-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 48-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 48-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 48-120(pA10,mldab5,2), rA0
++ movaps 48-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 48-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 48-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 48-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 48-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 48-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 48-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 48-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 48-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 48-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 48-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 48-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 48-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 48-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 60
+- movaps 64-120(pA10,mldab5,2), rA0
+- movaps 64-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 64-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 64-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 64-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 64-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 64-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 64-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 64-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 64-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 64-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 64-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 64-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 64-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 64-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 64-120(pA10,mldab5,2), rA0
++ movaps 64-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 64-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 64-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 64-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 64-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 64-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 64-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 64-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 64-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 64-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 64-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 64-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 64-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 64-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 64
+- movaps 80-120(pA10,mldab5,2), rA0
+- movaps 80-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 80-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 80-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 80-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 80-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 80-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 80-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 80-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 80-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 80-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 80-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 80-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 80-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 80-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 80-120(pA10,mldab5,2), rA0
++ movaps 80-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 80-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 80-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 80-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 80-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 80-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 80-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 80-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 80-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 80-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 80-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 80-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 80-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 80-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 68
+- movaps 96-120(pA10,mldab5,2), rA0
+- movaps 96-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 96-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 96-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 96-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 96-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 96-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 96-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 96-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 96-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 96-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 96-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 96-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 96-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 96-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 96-120(pA10,mldab5,2), rA0
++ movaps 96-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 96-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 96-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 96-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 96-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 96-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 96-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 96-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 96-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 96-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 96-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 96-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 96-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 96-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 72
+- movaps 112-120(pA10,mldab5,2), rA0
+- movaps 112-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 112-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 112-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 112-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 112-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 112-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 112-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 112-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 112-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 112-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 112-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 112-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 112-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 112-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 112-120(pA10,mldab5,2), rA0
++ movaps 112-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 112-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 112-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 112-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 112-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 112-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 112-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 112-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 112-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 112-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 112-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 112-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 112-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 112-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 76
+- movaps 128-120(pA10,mldab5,2), rA0
+- movaps 128-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 128-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 128-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 128-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 128-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 128-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 128-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 128-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 128-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 128-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 128-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 128-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 128-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 128-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 128-120(pA10,mldab5,2), rA0
++ movaps 128-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 128-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 128-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 128-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 128-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 128-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 128-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 128-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 128-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 128-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 128-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 128-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 128-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 128-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 80
+- movaps 144-120(pA10,mldab5,2), rA0
+- movaps 144-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 144-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 144-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 144-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 144-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 144-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 144-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 144-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 144-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 144-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 144-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 144-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 144-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 144-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 144-120(pA10,mldab5,2), rA0
++ movaps 144-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 144-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 144-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 144-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 144-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 144-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 144-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 144-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 144-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 144-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 144-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 144-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 144-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 144-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ /*UKLOOP */
+@@ -1220,234 +1227,234 @@ UMLOOP:
+ * Get these bastard things summed up correctly
+ */
+
+- /* rC0 = c0a c0b c0c c0d */
+- /* rC1 = c1a c1b c1c c1d */
+- /* rC2 = c2a c2b c2c c2d */
+- /* rC3 = c3a c3b c3c c3d */
++ /* rC0 = c0a c0b c0c c0d */
++ /* rC1 = c1a c1b c1c c1d */
++ /* rC2 = c2a c2b c2c c2d */
++ /* rC3 = c3a c3b c3c c3d */
+ /* */
+- movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */
+- prefC((pC))
+- prefC(64(pC))
+- movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */
+- unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */
+- unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */
+- unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */
+- movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */
+- unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */
+- movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */
+- movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */
+- movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */
+- addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */
+- movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */
+- movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */
+- movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */
+- addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */
+- movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */
+- addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */
+-
+-
+- /* rC4 = c4a c4b c4c c4d */
+- /* rC5 = c5a c5b c5c c5d */
+- /* rC6 = c6a c6b c6c c6d */
+- /* rC7 = c7a c7b c7c c7d */
+- /* rC8 = c08a c08b c08c c08d */
+- /* rC9 = c09a c09b c09c c09d */
+- /* rC10 = c10a c10b c10c c10d */
+- /* rC11 = c11a c11b c11c c11d */
+- /* rC12 = c12a c12b c12c c12d */
+- /* rC13 = c13a c13b c13c c13d */
++ movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */
++ prefC((pC))
++ prefC(64(pC))
++ movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */
++ unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */
++ unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */
++ unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */
++ movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */
++ unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */
++ movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */
++ movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */
++ movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */
++ addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */
++ movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */
++ movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */
++ movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */
++ addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */
++ movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */
++ addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */
++
++
++ /* rC4 = c4a c4b c4c c4d */
++ /* rC5 = c5a c5b c5c c5d */
++ /* rC6 = c6a c6b c6c c6d */
++ /* rC7 = c7a c7b c7c c7d */
++ /* rC8 = c08a c08b c08c c08d */
++ /* rC9 = c09a c09b c09c c09d */
++ /* rC10 = c10a c10b c10c c10d */
++ /* rC11 = c11a c11b c11c c11d */
++ /* rC12 = c12a c12b c12c c12d */
++ /* rC13 = c13a c13b c13c c13d */
+ /* */
+- movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */
+- prefC(128(pC))
++ movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */
++ prefC(128(pC))
+ #ifdef SREAL
+- pref2((pfA))
++ pref2((pfA))
+ #else
+- prefC(192(pC))
++ prefC(192(pC))
+ #endif
+- movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */
+- movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */
+- unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */
+- unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */
+- unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */
+- unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */
+- unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */
+- movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */
+- unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */
+- movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */
+- movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */
+- unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */
+- movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */
+- movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */
+- addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */
++ movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */
++ movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */
++ unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */
++ unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */
++ unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */
++ unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */
++ unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */
++ movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */
++ unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */
++ movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */
++ movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */
++ unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */
++ movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */
++ movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */
++ addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */
+ #ifdef BETAX
+ #ifdef SREAL
+- movups (pC), rA0
+- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
+- movups 16(pC), rC4
+- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
+- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
+- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
+- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
+- movups 32(pC), rC5
+- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
+- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
+- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
+- movlps 48(pC), rC1
+- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
+- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
+- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
+- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
+- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
+- pref2(64(pfA))
+- mulps BOF(%rsp), rA0
+- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
+- mulps BOF(%rsp), rC4
+- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
+- mulps BOF(%rsp), rC5
+- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
+- mulps BOF(%rsp), rC1
++ movups (pC), rA0
++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
++ movups 16(pC), rC4
++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
++ movups 32(pC), rC5
++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
++ movlps 48(pC), rC1
++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
++ pref2(64(pfA))
++ mulps BOF(%rsp), rA0
++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
++ mulps BOF(%rsp), rC4
++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
++ mulps BOF(%rsp), rC5
++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
++ mulps BOF(%rsp), rC1
+
+ /* */
+
+- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
+- addps rA0, rC3
+- addq $68, pfA
+- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
+- addps rC4, rC7
+- addps rC5, rC11
+- addps rC1, rC12
++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
++ addps rA0, rC3
++ addq $68, pfA
++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
++ addps rC4, rC7
++ addps rC5, rC11
++ addps rC1, rC12
+ #else /* BETA = X, complex type */
+- movups (pC), rA0
+- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
+- movups 16(pC), rC4
+- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
+- shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */
+- movups 32(pC), rC4 /* rC4 = c4 X c5 X */
+- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
+- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
+- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
+- movups 48(pC), rC5 /* rC5 = c6 X c7 X */
+- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
+- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
+- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
+- shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */
+- movups 64(pC), rC5 /* rC5 = c8 X c9 X */
+- movups 80(pC), rC1 /* rC1 = c10 X c11 X */
+- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
+- shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */
+- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
+- movss 96(pC), rC1
+- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
+- movss 104(pC), rB0
+- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
+- unpcklps rB0, rC1
+- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
+- prefC(256(pC))
+- mulps BOF(%rsp), rA0
+- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
+- mulps BOF(%rsp), rC4
+- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
+- mulps BOF(%rsp), rC5
+- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
+- mulps BOF(%rsp), rC1
++ movups (pC), rA0
++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
++ movups 16(pC), rC4
++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
++ shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */
++ movups 32(pC), rC4 /* rC4 = c4 X c5 X */
++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
++ movups 48(pC), rC5 /* rC5 = c6 X c7 X */
++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
++ shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */
++ movups 64(pC), rC5 /* rC5 = c8 X c9 X */
++ movups 80(pC), rC1 /* rC1 = c10 X c11 X */
++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
++ shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */
++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
++ movss 96(pC), rC1
++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
++ movss 104(pC), rB0
++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
++ unpcklps rB0, rC1
++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
++ prefC(256(pC))
++ mulps BOF(%rsp), rA0
++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
++ mulps BOF(%rsp), rC4
++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
++ mulps BOF(%rsp), rC5
++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
++ mulps BOF(%rsp), rC1
+
+ /* */
+
+- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
+- addps rA0, rC3
+- prefC(192(pC))
+- addq $68, pfA
+- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
+- addps rC4, rC7
+- addps rC5, rC11
+- addps rC1, rC12
++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
++ addps rA0, rC3
++ prefC(192(pC))
++ addq $68, pfA
++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
++ addps rC4, rC7
++ addps rC5, rC11
++ addps rC1, rC12
+ #endif
+
+ #else
+- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
+- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
+- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
+- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
+- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
+- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
+- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
+- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
+- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
+- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
+- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
+- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
+- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
+ #ifdef SREAL
+- pref2(64(pfA))
++ pref2(64(pfA))
+ #else
+- prefC(256(pC))
++ prefC(256(pC))
+ #endif
+- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
+- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
+- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
+
+ /* */
+
+- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
+ #ifndef SREAL
+- prefC(192(pC))
++ prefC(192(pC))
+ #endif
+- addq $68, pfA
+- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
++ addq $68, pfA
++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
+
+ #endif
+ /*
+ * Write results back to C; pC += 14;
+ */
+ #ifdef SREAL
+- movups rC3, (pC)
+- movups rC7, 16(pC)
+- movups rC11, 32(pC)
+- movlps rC12, 48(pC)
+- addq $56, pC
++ movups rC3, (pC)
++ movups rC7, 16(pC)
++ movups rC11, 32(pC)
++ movlps rC12, 48(pC)
++ addq $56, pC
+ #else
+- movss rC3, (pC)
+- movss rC7, 32(pC)
+- movhlps rC3, rC0
+- movhlps rC7, rC6
+- movss rC0, 16(pC)
+- movss rC6, 48(pC)
+- shufps $0x55, rC3, rC3
+- shufps $0x55, rC7, rC7
+- movss rC3, 8(pC)
+- movss rC7, 40(pC)
+- shufps $0x55, rC0, rC0
+- shufps $0x55, rC6, rC6
+- movss rC0, 24(pC)
+- movss rC6, 56(pC)
+-
+- movss rC11, 64(pC)
+- movhlps rC11, rC2
+- movss rC12, 96(pC)
+- movss rC2, 80(pC)
+- shufps $0x55, rC11, rC11
+- shufps $0x55, rC12, rC12
+- movss rC11, 72(pC)
+- shufps $0x55, rC2, rC2
+- movss rC12, 104(pC)
+- movss rC2, 88(pC)
++ movss rC3, (pC)
++ movss rC7, 32(pC)
++ movhlps rC3, rC0
++ movhlps rC7, rC6
++ movss rC0, 16(pC)
++ movss rC6, 48(pC)
++ shufps $0x55, rC3, rC3
++ shufps $0x55, rC7, rC7
++ movss rC3, 8(pC)
++ movss rC7, 40(pC)
++ shufps $0x55, rC0, rC0
++ shufps $0x55, rC6, rC6
++ movss rC0, 24(pC)
++ movss rC6, 56(pC)
++
++ movss rC11, 64(pC)
++ movhlps rC11, rC2
++ movss rC12, 96(pC)
++ movss rC2, 80(pC)
++ shufps $0x55, rC11, rC11
++ shufps $0x55, rC12, rC12
++ movss rC11, 72(pC)
++ shufps $0x55, rC2, rC2
++ movss rC12, 104(pC)
++ movss rC2, 88(pC)
+
+- addq $112, pC
++ addq $112, pC
+ #endif
+ /*
+ * Write results back to C
+ */
+- addq $NB14so-176, pA5
+- addq $NB14so-176, pA10
+- subq $176, pB0
++ addq $NB14so-176, pA5
++ addq $NB14so-176, pA10
++ subq $176, pB0
+ /*
+ * pC += 14; pA += 14*NB; pB -= NB;
+ */
+ /*
+ * while (pA != stM);
+ */
+- subq $1, stM
+- jne UMLOOP
++ subq $1, stM
++ jne UMLOOP
+ #endif
+
+ /*
+@@ -1459,994 +1466,994 @@ MLAST:
+ #endif
+ /*UKLOOP: */
+ #ifdef BETA1
+- movaps 0-120(pA10,mldab5,2), rC0
+- movaps 0-120(pB0), rB0
+- mulps rB0, rC0
+- addss (pC), rC0
+- movaps 0-120(pA5, mldab,4), rC1
+- mulps rB0, rC1
+- addss CMUL(4)(pC), rC1
+- movaps 0-120(pA10, mldab,8), rC2
+- mulps rB0, rC2
+- addss CMUL(8)(pC), rC2
+- movaps 0-120(pA5, mldab,2), rC3
+- mulps rB0, rC3
+- addss CMUL(12)(pC), rC3
+- movaps 0-120(pA5, mldab), rC4
+- mulps rB0, rC4
+- addss CMUL(16)(pC), rC4
+- movaps 0-120(pA5), rC5
+- mulps rB0, rC5
+- addss CMUL(20)(pC), rC5
+- movaps 0-120(pA5, ldab), rC6
+- mulps rB0, rC6
+- addss CMUL(24)(pC), rC6
+- movaps 0-120(pA5, ldab,2), rC7
+- mulps rB0, rC7
+- addss CMUL(28)(pC), rC7
+- movaps 0-120(pA10, mldab,2), rC8
+- mulps rB0, rC8
+- addss CMUL(32)(pC), rC8
+- movaps 0-120(pA5,ldab,4), rC9
+- mulps rB0, rC9
+- addss CMUL(36)(pC), rC9
+- movaps 0-120(pA10), rC10
+- mulps rB0, rC10
+- addss CMUL(40)(pC), rC10
+- movaps 0-120(pA10,ldab), rC11
+- mulps rB0, rC11
+- addss CMUL(44)(pC), rC11
+- movaps 0-120(pA10,ldab,2), rC12
+- mulps rB0, rC12
+- addss CMUL(48)(pC), rC12
+- movaps 0-120(pA5,ldab,8), rC13
+- mulps rB0, rC13
+- addss CMUL(52)(pC), rC13
++ movaps 0-120(pA10,mldab5,2), rC0
++ movaps 0-120(pB0), rB0
++ mulps rB0, rC0
++ addss (pC), rC0
++ movaps 0-120(pA5, mldab,4), rC1
++ mulps rB0, rC1
++ addss CMUL(4)(pC), rC1
++ movaps 0-120(pA10, mldab,8), rC2
++ mulps rB0, rC2
++ addss CMUL(8)(pC), rC2
++ movaps 0-120(pA5, mldab,2), rC3
++ mulps rB0, rC3
++ addss CMUL(12)(pC), rC3
++ movaps 0-120(pA5, mldab), rC4
++ mulps rB0, rC4
++ addss CMUL(16)(pC), rC4
++ movaps 0-120(pA5), rC5
++ mulps rB0, rC5
++ addss CMUL(20)(pC), rC5
++ movaps 0-120(pA5, ldab), rC6
++ mulps rB0, rC6
++ addss CMUL(24)(pC), rC6
++ movaps 0-120(pA5, ldab,2), rC7
++ mulps rB0, rC7
++ addss CMUL(28)(pC), rC7
++ movaps 0-120(pA10, mldab,2), rC8
++ mulps rB0, rC8
++ addss CMUL(32)(pC), rC8
++ movaps 0-120(pA5,ldab,4), rC9
++ mulps rB0, rC9
++ addss CMUL(36)(pC), rC9
++ movaps 0-120(pA10), rC10
++ mulps rB0, rC10
++ addss CMUL(40)(pC), rC10
++ movaps 0-120(pA10,ldab), rC11
++ mulps rB0, rC11
++ addss CMUL(44)(pC), rC11
++ movaps 0-120(pA10,ldab,2), rC12
++ mulps rB0, rC12
++ addss CMUL(48)(pC), rC12
++ movaps 0-120(pA5,ldab,8), rC13
++ mulps rB0, rC13
++ addss CMUL(52)(pC), rC13
+ #else
+- movaps 0-120(pA10,mldab5,2), rC0
+- movaps 0-120(pB0), rC13
+- mulps rC13, rC0
+- movaps 0-120(pA5, mldab,4), rC1
+- mulps rC13, rC1
+- movaps 0-120(pA10, mldab,8), rC2
+- mulps rC13, rC2
+- movaps 0-120(pA5, mldab,2), rC3
+- mulps rC13, rC3
+- movaps 0-120(pA5, mldab), rC4
+- mulps rC13, rC4
+- movaps 0-120(pA5), rC5
+- mulps rC13, rC5
+- movaps 0-120(pA5, ldab), rC6
+- mulps rC13, rC6
+- movaps 0-120(pA5, ldab,2), rC7
+- mulps rC13, rC7
+- movaps 0-120(pA10, mldab,2), rC8
+- mulps rC13, rC8
+- movaps 0-120(pA5,ldab,4), rC9
+- mulps rC13, rC9
+- movaps 0-120(pA10), rC10
+- mulps rC13, rC10
+- movaps 0-120(pA10,ldab), rC11
+- mulps rC13, rC11
+- movaps 0-120(pA10,ldab,2), rC12
+- mulps rC13, rC12
+- mulps 0-120(pA5,ldab,8), rC13
++ movaps 0-120(pA10,mldab5,2), rC0
++ movaps 0-120(pB0), rC13
++ mulps rC13, rC0
++ movaps 0-120(pA5, mldab,4), rC1
++ mulps rC13, rC1
++ movaps 0-120(pA10, mldab,8), rC2
++ mulps rC13, rC2
++ movaps 0-120(pA5, mldab,2), rC3
++ mulps rC13, rC3
++ movaps 0-120(pA5, mldab), rC4
++ mulps rC13, rC4
++ movaps 0-120(pA5), rC5
++ mulps rC13, rC5
++ movaps 0-120(pA5, ldab), rC6
++ mulps rC13, rC6
++ movaps 0-120(pA5, ldab,2), rC7
++ mulps rC13, rC7
++ movaps 0-120(pA10, mldab,2), rC8
++ mulps rC13, rC8
++ movaps 0-120(pA5,ldab,4), rC9
++ mulps rC13, rC9
++ movaps 0-120(pA10), rC10
++ mulps rC13, rC10
++ movaps 0-120(pA10,ldab), rC11
++ mulps rC13, rC11
++ movaps 0-120(pA10,ldab,2), rC12
++ mulps rC13, rC12
++ mulps 0-120(pA5,ldab,8), rC13
+ #endif
+
+ #if KB > 4
+- movaps 16-120(pA10,mldab5,2), rA0
+- movaps 16-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 16-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 16-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 16-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 16-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 16-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 16-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 16-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 16-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 16-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 16-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 16-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 16-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 16-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 16-120(pA10,mldab5,2), rA0
++ movaps 16-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 16-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 16-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 16-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 16-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 16-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 16-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 16-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 16-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 16-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 16-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 16-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 16-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 16-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 8
+- movaps 32-120(pA10,mldab5,2), rA0
+- movaps 32-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 32-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 32-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 32-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 32-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 32-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 32-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 32-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 32-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 32-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 32-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 32-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 32-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 32-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 32-120(pA10,mldab5,2), rA0
++ movaps 32-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 32-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 32-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 32-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 32-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 32-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 32-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 32-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 32-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 32-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 32-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 32-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 32-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 32-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 12
+- movaps 48-120(pA10,mldab5,2), rA0
+- movaps 48-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 48-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 48-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 48-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 48-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 48-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 48-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 48-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 48-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 48-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 48-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 48-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 48-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 48-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 48-120(pA10,mldab5,2), rA0
++ movaps 48-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 48-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 48-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 48-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 48-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 48-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 48-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 48-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 48-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 48-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 48-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 48-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 48-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 48-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 16
+- movaps 64-120(pA10,mldab5,2), rA0
+- movaps 64-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 64-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 64-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 64-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 64-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 64-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 64-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 64-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 64-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 64-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 64-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 64-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 64-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 64-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 64-120(pA10,mldab5,2), rA0
++ movaps 64-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 64-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 64-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 64-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 64-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 64-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 64-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 64-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 64-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 64-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 64-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 64-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 64-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 64-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 20
+- movaps 80-120(pA10,mldab5,2), rA0
+- movaps 80-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 80-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 80-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 80-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 80-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 80-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 80-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 80-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 80-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 80-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 80-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 80-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 80-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 80-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 80-120(pA10,mldab5,2), rA0
++ movaps 80-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 80-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 80-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 80-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 80-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 80-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 80-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 80-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 80-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 80-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 80-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 80-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 80-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 80-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 24
+- movaps 96-120(pA10,mldab5,2), rA0
+- movaps 96-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 96-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 96-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 96-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 96-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 96-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 96-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 96-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 96-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 96-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 96-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 96-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 96-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 96-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 96-120(pA10,mldab5,2), rA0
++ movaps 96-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 96-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 96-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 96-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 96-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 96-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 96-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 96-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 96-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 96-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 96-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 96-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 96-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 96-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 28
+- movaps 112-120(pA10,mldab5,2), rA0
+- movaps 112-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 112-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 112-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 112-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 112-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 112-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 112-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 112-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 112-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 112-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 112-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 112-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 112-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 112-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 112-120(pA10,mldab5,2), rA0
++ movaps 112-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 112-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 112-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 112-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 112-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 112-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 112-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 112-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 112-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 112-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 112-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 112-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 112-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 112-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 32
+- movaps 128-120(pA10,mldab5,2), rA0
+- movaps 128-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 128-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 128-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 128-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 128-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 128-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 128-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 128-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 128-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 128-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 128-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 128-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 128-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 128-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 128-120(pA10,mldab5,2), rA0
++ movaps 128-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 128-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 128-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 128-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 128-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 128-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 128-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 128-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 128-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 128-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 128-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 128-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 128-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 128-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 36
+- movaps 144-120(pA10,mldab5,2), rA0
+- movaps 144-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 144-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 144-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 144-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 144-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 144-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 144-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 144-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 144-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 144-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 144-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 144-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 144-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 144-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 144-120(pA10,mldab5,2), rA0
++ movaps 144-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 144-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 144-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 144-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 144-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 144-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 144-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 144-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 144-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 144-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 144-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 144-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 144-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 144-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+- prefB((pB,ldab))
+- prefB(64(pB,ldab))
++ prefB((pB,ldab))
++ prefB(64(pB,ldab))
+
+ #if KB > 40
+- movaps 160-120(pA10,mldab5,2), rA0
+- movaps 160-120(pB0), rB0
+- mulps rB0, rA0
+- addq $176, pB0
+- addps rA0, rC0
+- movaps 160-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 160-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 160-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 160-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 160-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 160-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 160-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 160-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 160-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 160-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 160-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 160-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addq $176, pA10
+- addps rA0, rC12
+- mulps 160-120(pA5,ldab,8), rB0
+- addps rB0, rC13
+- addq $176, pA5
++ movaps 160-120(pA10,mldab5,2), rA0
++ movaps 160-120(pB0), rB0
++ mulps rB0, rA0
++ addq $176, pB0
++ addps rA0, rC0
++ movaps 160-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 160-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 160-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 160-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 160-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 160-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 160-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 160-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 160-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 160-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 160-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 160-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addq $176, pA10
++ addps rA0, rC12
++ mulps 160-120(pA5,ldab,8), rB0
++ addps rB0, rC13
++ addq $176, pA5
+ #else
+- addq $176, pB0
+- addq $176, pA10
+- addq $176, pA5
++ addq $176, pB0
++ addq $176, pA10
++ addq $176, pA5
+ #endif
+
+ #if KB > 44
+- movaps 0-120(pA10,mldab5,2), rA0
+- movaps 0-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 0-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 0-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 0-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 0-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 0-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 0-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 0-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 0-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 0-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 0-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 0-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 0-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 0-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 0-120(pA10,mldab5,2), rA0
++ movaps 0-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 0-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 0-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 0-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 0-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 0-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 0-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 0-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 0-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 0-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 0-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 0-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 0-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 0-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 48
+- movaps 16-120(pA10,mldab5,2), rA0
+- movaps 16-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 16-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 16-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 16-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 16-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 16-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 16-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 16-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 16-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 16-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 16-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 16-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 16-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 16-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 16-120(pA10,mldab5,2), rA0
++ movaps 16-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 16-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 16-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 16-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 16-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 16-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 16-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 16-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 16-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 16-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 16-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 16-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 16-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 16-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 52
+- movaps 32-120(pA10,mldab5,2), rA0
+- movaps 32-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 32-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 32-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 32-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 32-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 32-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 32-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 32-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 32-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 32-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 32-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 32-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 32-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 32-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 32-120(pA10,mldab5,2), rA0
++ movaps 32-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 32-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 32-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 32-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 32-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 32-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 32-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 32-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 32-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 32-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 32-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 32-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 32-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 32-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 56
+- movaps 48-120(pA10,mldab5,2), rA0
+- movaps 48-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 48-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 48-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 48-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 48-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 48-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 48-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 48-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 48-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 48-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 48-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 48-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 48-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 48-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 48-120(pA10,mldab5,2), rA0
++ movaps 48-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 48-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 48-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 48-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 48-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 48-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 48-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 48-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 48-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 48-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 48-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 48-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 48-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 48-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 60
+- movaps 64-120(pA10,mldab5,2), rA0
+- movaps 64-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 64-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 64-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 64-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 64-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 64-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 64-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 64-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 64-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 64-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 64-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 64-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 64-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 64-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 64-120(pA10,mldab5,2), rA0
++ movaps 64-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 64-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 64-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 64-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 64-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 64-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 64-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 64-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 64-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 64-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 64-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 64-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 64-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 64-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+- prefB(128-176(pB,ldab))
+- prefB(192-176(pB,ldab))
++ prefB(128-176(pB,ldab))
++ prefB(192-176(pB,ldab))
+
+ #if KB > 64
+- movaps 80-120(pA10,mldab5,2), rA0
+- movaps 80-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 80-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 80-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 80-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 80-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 80-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 80-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 80-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 80-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 80-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 80-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 80-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 80-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 80-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 80-120(pA10,mldab5,2), rA0
++ movaps 80-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 80-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 80-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 80-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 80-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 80-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 80-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 80-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 80-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 80-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 80-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 80-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 80-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 80-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 68
+- movaps 96-120(pA10,mldab5,2), rA0
+- movaps 96-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 96-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 96-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 96-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 96-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 96-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 96-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 96-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 96-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 96-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 96-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 96-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 96-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 96-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 96-120(pA10,mldab5,2), rA0
++ movaps 96-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 96-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 96-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 96-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 96-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 96-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 96-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 96-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 96-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 96-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 96-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 96-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 96-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 96-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 72
+- movaps 112-120(pA10,mldab5,2), rA0
+- movaps 112-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 112-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 112-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 112-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 112-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 112-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 112-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 112-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 112-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 112-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 112-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 112-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 112-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 112-120(pA5,ldab,8), rB0
+- prefC((pC))
+- prefC((pC,incCn))
+- addps rB0, rC13
++ movaps 112-120(pA10,mldab5,2), rA0
++ movaps 112-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 112-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 112-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 112-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 112-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 112-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 112-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 112-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 112-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 112-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 112-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 112-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 112-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 112-120(pA5,ldab,8), rB0
++ prefC((pC))
++ prefC((pC,incCn))
++ addps rB0, rC13
+ #else
+- prefC((pC))
+- prefC((pC,incCn))
++ prefC((pC))
++ prefC((pC,incCn))
+ #endif
+
+ #if KB > 76
+- movaps 128-120(pA10,mldab5,2), rA0
+- movaps 128-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 128-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 128-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 128-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 128-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 128-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 128-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 128-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 128-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 128-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 128-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 128-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 128-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 128-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 128-120(pA10,mldab5,2), rA0
++ movaps 128-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 128-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 128-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 128-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 128-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 128-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 128-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 128-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 128-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 128-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 128-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 128-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 128-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 128-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ #if KB > 80
+- movaps 144-120(pA10,mldab5,2), rA0
+- movaps 144-120(pB0), rB0
+- mulps rB0, rA0
+- addps rA0, rC0
+- movaps 144-120(pA5, mldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC1
+- movaps 144-120(pA10, mldab,8), rA0
+- mulps rB0, rA0
+- addps rA0, rC2
+- movaps 144-120(pA5, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC3
+- movaps 144-120(pA5, mldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC4
+- movaps 144-120(pA5), rA0
+- mulps rB0, rA0
+- addps rA0, rC5
+- movaps 144-120(pA5, ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC6
+- movaps 144-120(pA5, ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC7
+- movaps 144-120(pA10, mldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC8
+- movaps 144-120(pA5,ldab,4), rA0
+- mulps rB0, rA0
+- addps rA0, rC9
+- movaps 144-120(pA10), rA0
+- mulps rB0, rA0
+- addps rA0, rC10
+- movaps 144-120(pA10,ldab), rA0
+- mulps rB0, rA0
+- addps rA0, rC11
+- movaps 144-120(pA10,ldab,2), rA0
+- mulps rB0, rA0
+- addps rA0, rC12
+- mulps 144-120(pA5,ldab,8), rB0
+- addps rB0, rC13
++ movaps 144-120(pA10,mldab5,2), rA0
++ movaps 144-120(pB0), rB0
++ mulps rB0, rA0
++ addps rA0, rC0
++ movaps 144-120(pA5, mldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC1
++ movaps 144-120(pA10, mldab,8), rA0
++ mulps rB0, rA0
++ addps rA0, rC2
++ movaps 144-120(pA5, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC3
++ movaps 144-120(pA5, mldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC4
++ movaps 144-120(pA5), rA0
++ mulps rB0, rA0
++ addps rA0, rC5
++ movaps 144-120(pA5, ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC6
++ movaps 144-120(pA5, ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC7
++ movaps 144-120(pA10, mldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC8
++ movaps 144-120(pA5,ldab,4), rA0
++ mulps rB0, rA0
++ addps rA0, rC9
++ movaps 144-120(pA10), rA0
++ mulps rB0, rA0
++ addps rA0, rC10
++ movaps 144-120(pA10,ldab), rA0
++ mulps rB0, rA0
++ addps rA0, rC11
++ movaps 144-120(pA10,ldab,2), rA0
++ mulps rB0, rA0
++ addps rA0, rC12
++ mulps 144-120(pA5,ldab,8), rB0
++ addps rB0, rC13
+ #endif
+
+ /*UKLOOP */
+@@ -2454,202 +2461,202 @@ MLAST:
+ * Get these bastard things summed up correctly
+ */
+
+- /* rC0 = c0a c0b c0c c0d */
+- /* rC1 = c1a c1b c1c c1d */
+- /* rC2 = c2a c2b c2c c2d */
+- /* rC3 = c3a c3b c3c c3d */
++ /* rC0 = c0a c0b c0c c0d */
++ /* rC1 = c1a c1b c1c c1d */
++ /* rC2 = c2a c2b c2c c2d */
++ /* rC3 = c3a c3b c3c c3d */
+ /* */
+- movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */
+- prefC(64(pC,incCn))
+- prefB(256-176(pB,ldab))
+- movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */
+- unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */
+- unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */
+- unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */
+- movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */
+- unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */
+- movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */
+- movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */
+- movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */
+- addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */
+- movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */
+- movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */
+- movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */
+- addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */
+- movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */
+- addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */
+-
+-
+- /* rC4 = c4a c4b c4c c4d */
+- /* rC5 = c5a c5b c5c c5d */
+- /* rC6 = c6a c6b c6c c6d */
+- /* rC7 = c7a c7b c7c c7d */
+- /* rC8 = c08a c08b c08c c08d */
+- /* rC9 = c09a c09b c09c c09d */
+- /* rC10 = c10a c10b c10c c10d */
+- /* rC11 = c11a c11b c11c c11d */
+- /* rC12 = c12a c12b c12c c12d */
+- /* rC13 = c13a c13b c13c c13d */
++ movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */
++ prefC(64(pC,incCn))
++ prefB(256-176(pB,ldab))
++ movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */
++ unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */
++ unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */
++ unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */
++ movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */
++ unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */
++ movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */
++ movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */
++ movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */
++ addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */
++ movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */
++ movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */
++ movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */
++ addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */
++ movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */
++ addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */
++
++
++ /* rC4 = c4a c4b c4c c4d */
++ /* rC5 = c5a c5b c5c c5d */
++ /* rC6 = c6a c6b c6c c6d */
++ /* rC7 = c7a c7b c7c c7d */
++ /* rC8 = c08a c08b c08c c08d */
++ /* rC9 = c09a c09b c09c c09d */
++ /* rC10 = c10a c10b c10c c10d */
++ /* rC11 = c11a c11b c11c c11d */
++ /* rC12 = c12a c12b c12c c12d */
++ /* rC13 = c13a c13b c13c c13d */
+ /* */
+- movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */
+- movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */
+- movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */
+- unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */
+- unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */
+- unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */
+- unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */
+- unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */
+- movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */
+- unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */
+- movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */
+- movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */
+- unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */
+- movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */
+- movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */
+- addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */
++ movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */
++ movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */
++ movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */
++ unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */
++ unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */
++ unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */
++ unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */
++ unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */
++ movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */
++ unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */
++ movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */
++ movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */
++ unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */
++ movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */
++ movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */
++ addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */
+ #ifdef BETAX
+ #ifdef SREAL
+- movups (pC), rA0
+- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
+- movups 16(pC), rC4
+- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
+- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
+- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
+- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
+- movups 32(pC), rC5
+- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
+- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
+- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
+- movlps 48(pC), rC1
+- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
+- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
+- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
+- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
+- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
+- mulps BOF(%rsp), rA0
+- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
+- mulps BOF(%rsp), rC4
+- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
+- mulps BOF(%rsp), rC5
+- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
+- mulps BOF(%rsp), rC1
++ movups (pC), rA0
++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
++ movups 16(pC), rC4
++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
++ movups 32(pC), rC5
++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
++ movlps 48(pC), rC1
++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
++ mulps BOF(%rsp), rA0
++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
++ mulps BOF(%rsp), rC4
++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
++ mulps BOF(%rsp), rC5
++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
++ mulps BOF(%rsp), rC1
+
+ /* */
+
+- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
+- addps rA0, rC3
+- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
+- addps rC4, rC7
+- addps rC5, rC11
+- prefB(320-176(pB,ldab))
+- addps rC1, rC12
++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
++ addps rA0, rC3
++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
++ addps rC4, rC7
++ addps rC5, rC11
++ prefB(320-176(pB,ldab))
++ addps rC1, rC12
+ #else /* BETA = X, complex type */
+- movups (pC), rA0
+- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
+- movups 16(pC), rC4
+- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
+- shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */
+- movups 32(pC), rC4 /* rC4 = c4 X c5 X */
+- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
+- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
+- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
+- movups 48(pC), rC5 /* rC5 = c6 X c7 X */
+- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
+- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
+- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
+- shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */
+- movups 64(pC), rC5 /* rC5 = c8 X c9 X */
+- movups 80(pC), rC1 /* rC1 = c10 X c11 X */
+- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
+- shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */
+- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
+- movss 96(pC), rC1
+- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
+- movss 104(pC), rB0
+- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
+- unpcklps rB0, rC1
+- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
+- mulps BOF(%rsp), rA0
+- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
+- mulps BOF(%rsp), rC4
+- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
+- mulps BOF(%rsp), rC5
+- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
+- mulps BOF(%rsp), rC1
++ movups (pC), rA0
++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
++ movups 16(pC), rC4
++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
++ shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */
++ movups 32(pC), rC4 /* rC4 = c4 X c5 X */
++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
++ movups 48(pC), rC5 /* rC5 = c6 X c7 X */
++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
++ shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */
++ movups 64(pC), rC5 /* rC5 = c8 X c9 X */
++ movups 80(pC), rC1 /* rC1 = c10 X c11 X */
++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
++ shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */
++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
++ movss 96(pC), rC1
++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
++ movss 104(pC), rB0
++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
++ unpcklps rB0, rC1
++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
++ mulps BOF(%rsp), rA0
++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
++ mulps BOF(%rsp), rC4
++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
++ mulps BOF(%rsp), rC5
++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
++ mulps BOF(%rsp), rC1
+
+ /* */
+
+- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
+- addps rA0, rC3
+- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
+- addps rC4, rC7
+- addps rC5, rC11
+- prefB(320-176(pB,ldab))
+- addps rC1, rC12
++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
++ addps rA0, rC3
++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
++ addps rC4, rC7
++ addps rC5, rC11
++ prefB(320-176(pB,ldab))
++ addps rC1, rC12
+ #endif
+
+ #else
+- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
+- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
+- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
+- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
+- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
+- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
+- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
+- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
+- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
+- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
+- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
+- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
+- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
+- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
+- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
+- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */
++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */
++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */
++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */
++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */
++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */
++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */
++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */
++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */
++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */
++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */
++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */
++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */
++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */
++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */
++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */
+
+ /* */
+
+- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
+- prefB(320-176(pB,ldab))
+- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */
++ prefB(320-176(pB,ldab))
++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */
+
+ #endif
+ /*
+ * Write results back to C; pC += 14;
+ */
+ #ifdef SREAL
+- movups rC3, (pC)
+- movups rC7, 16(pC)
+- movups rC11, 32(pC)
+- movlps rC12, 48(pC)
+-/* addq $56, pC */
++ movups rC3, (pC)
++ movups rC7, 16(pC)
++ movups rC11, 32(pC)
++ movlps rC12, 48(pC)
++/* addq $56, pC */
+ #else
+- movss rC3, (pC)
+- movss rC7, 32(pC)
+- movhlps rC3, rC0
+- movhlps rC7, rC6
+- movss rC0, 16(pC)
+- movss rC6, 48(pC)
+- shufps $0x55, rC3, rC3
+- shufps $0x55, rC7, rC7
+- movss rC3, 8(pC)
+- movss rC7, 40(pC)
+- shufps $0x55, rC0, rC0
+- shufps $0x55, rC6, rC6
+- movss rC0, 24(pC)
+- movss rC6, 56(pC)
+-
+- movss rC11, 64(pC)
+- movhlps rC11, rC2
+- movss rC12, 96(pC)
+- movss rC2, 80(pC)
+- shufps $0x55, rC11, rC11
+- shufps $0x55, rC12, rC12
+- movss rC11, 72(pC)
+- shufps $0x55, rC2, rC2
+- movss rC12, 104(pC)
+- movss rC2, 88(pC)
++ movss rC3, (pC)
++ movss rC7, 32(pC)
++ movhlps rC3, rC0
++ movhlps rC7, rC6
++ movss rC0, 16(pC)
++ movss rC6, 48(pC)
++ shufps $0x55, rC3, rC3
++ shufps $0x55, rC7, rC7
++ movss rC3, 8(pC)
++ movss rC7, 40(pC)
++ shufps $0x55, rC0, rC0
++ shufps $0x55, rC6, rC6
++ movss rC0, 24(pC)
++ movss rC6, 56(pC)
++
++ movss rC11, 64(pC)
++ movhlps rC11, rC2
++ movss rC12, 96(pC)
++ movss rC2, 80(pC)
++ shufps $0x55, rC11, rC11
++ shufps $0x55, rC12, rC12
++ movss rC11, 72(pC)
++ shufps $0x55, rC2, rC2
++ movss rC12, 104(pC)
++ movss rC2, 88(pC)
+
+-/* addq $112, pC */
++/* addq $112, pC */
+ #endif
+ /*
+ * Write results back to C
+@@ -2660,55 +2667,55 @@ MLAST:
+ /*
+ * while (pA != stM);
+ */
+-/* subq $1, stM */
+-/* jne UMLOOP */
++/* subq $1, stM */
++/* jne UMLOOP */
+ /*
+ * pC += 14; pA += 14*NB; pB -= NB;
+ */
+-/* subq $MBKBso-NB14so+176, pA5 */
+-/* subq $MBKBso-NB14so+176, pA10 */
+- subq incAm, pA5
+- subq incAm, pA10
+- addq $NBso-176, pB0
++/* subq $MBKBso-NB14so+176, pA5 */
++/* subq $MBKBso-NB14so+176, pA10 */
++ subq incAm, pA5
++ subq incAm, pA10
++ addq $NBso-176, pB0
+ /*
+ * while (pA != stM);
+ */
+-/* subq $1, stM */
+-/* jne UMLOOP */
++/* subq $1, stM */
++/* jne UMLOOP */
+ /*
+ * pC += incCn; pA -= NBNB; pB += NB;
+ */
+- addq incCn, pC
++ addq incCn, pC
+ /*
+ * while (pB != stN);
+ */
+- sub $1, stN
+- jne UNLOOP
++ sub $1, stN
++ jne UNLOOP
+
+ /*
+ * Restore callee-saved iregs
+ */
+ DONE:
+- movq -8(%rsp), %rbp
+- movq -16(%rsp), %rbx
++ movq -8(%rsp), %rbp
++ movq -16(%rsp), %rbx
+ #if MB == 0
+- movq -32(%rsp), %r12
+- movq -40(%rsp), %r13
++ movq -32(%rsp), %r12
++ movq -40(%rsp), %r13
+ #endif
+- ret
++ ret
+ #if MB == 0
+ MB_LT84:
+- cmp $70, stM
+- jne MB_LT70
+-/* movq $70/14, stM */
+- movq $5, stM
+- jmp MBFOUND
++ cmp $70, stM
++ jne MB_LT70
++/* movq $70/14, stM */
++ movq $5, stM
++ jmp MBFOUND
+ MB_LT70:
+- cmp $56, stM
+- jne MB_LT56
+-/* movq $56/14, stM */
+- movq $4, stM
+- jmp MBFOUND
++ cmp $56, stM
++ jne MB_LT56
++/* movq $56/14, stM */
++ movq $4, stM
++ jmp MBFOUND
+ MB_LT56:
+ cmp $42, stM
+ jne MB_LT42
+diff -rupN ATLAS/tune/blas/level1/scalsrch.c atlas-3.8.3/tune/blas/level1/scalsrch.c
+--- ATLAS/tune/blas/level1/scalsrch.c 2009-02-18 19:48:25.000000000 +0100
++++ atlas-3.8.3/tune/blas/level1/scalsrch.c 2009-11-12 13:45:48.141174024 +0100
+@@ -747,7 +747,7 @@ void GenMainRout(char pre, int n, int *i
+ /*
+ * Handle all special alpha cases
+ */
+- fprintf(fpout, "%sif ( SCALAR_IS_ZERO(alpha) )\n", spc);
++ /* fprintf(fpout, "%sif ( SCALAR_IS_ZERO(alpha) )\n", spc);
+ fprintf(fpout, "%s{\n", spc);
+ if (pre == 'c' || pre == 'z')
+ {
+@@ -756,7 +756,7 @@ void GenMainRout(char pre, int n, int *i
+ }
+ else fprintf(fpout, "%s Mjoin(PATL,set)(N, ATL_rzero, X, incx);\n", spc);
+ fprintf(fpout, "%s return;\n", spc);
+- fprintf(fpout, "%s}\n", spc);
++ fprintf(fpout, "%s}\n", spc); */
+ GenAlphCase(pre, spc, fpout, 1, n, ix, iy, ia, ib);
+ GenAlphCase(pre, spc, fpout, -1, n, ix, iy, ia, ib);
+ if (pre == 'c' || pre == 'z')
diff --git a/libraries/atlas/slack-desc b/libraries/atlas/slack-desc
new file mode 100644
index 0000000000..bed245ddbe
--- /dev/null
+++ b/libraries/atlas/slack-desc
@@ -0,0 +1,19 @@
+# HOW TO EDIT THIS FILE:
+# The "handy ruler" below makes it easier to edit a package description. Line
+# up the first '|' above the ':' following the base package name, and the '|'
+# on the right side marks the last column you can put a character in. You
+# must make exactly 11 lines for the formatting to be correct. It's also
+# customary to leave one space after the ':'.
+
+ |-----handy-ruler------------------------------------------------------|
+atlas: ATLAS (Automatically Tuned Linear Algebra Software)
+atlas:
+atlas: This is ATLAS (Automatically Tuned Linear Algebra Software), an
+atlas: ongoing research effort focusing on applying empirical techniques in
+atlas: order to provide portable performance. At present, it provides C and
+atlas: Fortran77 interfaces to a portably efficient BLAS implementation as
+atlas: well as a few routines from LAPACK.
+atlas:
+atlas: Homepage: http://math-atlas.sourceforge.net/
+atlas:
+atlas: