ref: 077e719dfbf9bf2582bed80026251cc0d108c16e
parent: 1eb373945455f1ba03fa1b221529d74ca2a778ad
author: cinap_lenrek <cinap_lenrek@felloff.net>
date: Sun Nov 19 19:10:35 EST 2017
libsec: write optimized _chachablock() function for amd64 / sse2 doing 4 quarterround's in parallel using 128-bit vector registers. for second round shuffle the columns and then shuffle back. code is rather obvious. only trick here is for the first quaterround PSHUFLW/PSHUFHW is used to swap the halfwords for the <<<16 rotation.