A crc-64 implementation, using pclmul where possible. More...

#include "CxxUtils/crc64.h"
#include "CxxUtils/AthUnlikelyMacros.h"
#include <stdio.h>

Include dependency graph for crc64.cxx:

Classes
class	CxxUtils::CRCTable
	Precomputed tables and constants for the CRC calculation. More...

Namespaces
	CxxUtils

Functions
void	CxxUtils::deleteCRCTable (CxxUtils::CRCTable *table)
	Delete a CRCTable object. More...

std::unique_ptr< CRCTable >	CxxUtils::makeCRCTable (uint64_t p, uint64_t initial=0xffffffffffffffff)
	Initialize CRC tables and constants. More...

uint64_t	CxxUtils::crc64_bytewise (const CRCTable &table, const char *data, size_t data_len)
	Find the CRC-64 of a string, using a byte-by-byte algorithm. More...

uint64_t	CxxUtils::crc64_bytewise (const char *data, size_t data_len)
	Find the CRC-64 of a string, using a byte-by-byte algorithm, with the default CRC. More...

uint64_t	CxxUtils::crc64_bytewise (const std::string &s)
	Find the CRC-64 of a string, using a byte-by-byte algorithm, with the default CRC. More...

uint64_t	CxxUtils::crc64 (const CRCTable &table, const char *data, size_t data_len)
	Find the CRC-64 of a string,. More...

uint64_t	CxxUtils::crc64 (const char *data, size_t data_len)
	Find the CRC-64 of a string, with the default CRC. More...

uint64_t	CxxUtils::crc64 (const std::string &s)
	Find the CRC-64 of a string, using the default polynomial. More...

uint64_t	CxxUtils::crc64addint (uint64_t crc, uint64_t x)
	Extend a previously-calculated CRC to include an int. More...

std::string	CxxUtils::crc64format (uint64_t crc)
	Format a CRC-64 as a string. More...

std::string	CxxUtils::crc64digest (const std::string &str)
	Return a CRC-64 digest of a string. More...

Variables
const CRCTable	CxxUtils::defaultCRCTable (0xad93d23594c935a9)

Detailed Description

A crc-64 implementation, using pclmul where possible.

Author: scott snyder snyde.nosp@m.r@bn.nosp@m.l.gov

Date: May, 2018 This is an implementation of the crc64 calculation, as used in Athena, that takes advantage of the x86_64 pclmul instruction where possible.

This implements the algorithm described in this reference:

Gopal, et al., `‘Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction,’' Intel White Paper 323102 (2009). http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf https://github.com/intel/soft-crc

An alternate implementation of this is here: https://github.com/rawrunprotected/crc

and a useful general reference for CRC calculations is: https://zlib.net/crc_v3.txt.

In general, given an arbitrarily long message M expressed as a single binary number, the CRC is calculated as

CRC = 2^w M mod P

where the arithmetic is done in a ‘carryless’ fashion; that is, ‘addition’ becomes the xor operation (and subtraction is thus the same as addition). Here, w is the number of bits in the desired result; the most-significant bit of P should be 2^w. (This is like appending w 0's to the end of M.) One often uses language in which M and P are considered to be polynomials, one term per bit in the binary expansion; and hence P is usually called the ‘polynomial’. This comes from considering the arithmetic as an instance of a Galois field. However, if one is simply trying to understand a CRC algorithm, this language often obscures more that it clarifies, so we will usually not use it here (except for calling P the polynomial).

As a concrete example, the CRC of 0x96 with a polynomial of 0x16 may be calculated like this

100101100000

10110

010011

10110

0010100

10110

00010000

10110

0110

So the remainder, and the CRC, is 6. Note that this is nothing more than binary long division with xor taking the place of addition and subtraction.

One can process more than one bit at a time in the division using a lookup table with a number of entries equal to 2 to the number of bits (see crc_v3.txt for details). Implementations commonly do a byte at a time, and that is what is implemented here as the non-vectorized version. One can use larger sizes than 8 bits, but memory requirements soon become prohibitive.

The implementation used here speeds up the CRC calculation by using the pclmul instruction, which does a carryless multiplication of two 64-bit quantities, producing a 127-bit result. This allows processing 128 bits at a time.

A few other subtleties are worth mentioning. First, when the value of p is written, the initial 1 bit is often dropped. So for the example above, we were calculating a 4-bit CRC with a p of 0x16, but in practice one might drop the initial 1 on p and write it as 0x6 (with the bit width understood). Second, as stated here, the CRC value is insensitive to the number of leading 0's on M. This is usually not a desirable feature, so the CRC calculation is very often initialized with something other than 0. A value of all bits set is common, and that is what is used as a default here. Third, the bits in each byte are sometimes reversed before the CRC calculation. This originated due to the way a UART serial interface works, but has spread to many cases that don't have anything to do with serial ports. The implementation here is such a reversed CRC (simply because that's the way the original bytewise implementation we're trying to match worked).

The way a CRC calculation works is by iteratively reducing the length of the message in such a way that the CRC does not change. For example, suppose we have a message M that is split into two parts, A of T bits, and B of U bits. Thus, M = 2^U A + B. Then suppose we find a shorter string A' with T' bits, such that A mod P = A' mod P. It is then easy to show that the string M' = 2^U A' + B has the same CRC:

M mod P = [(2^U mod P)(A mod P)] mod P + B mod P M' mod P = [(2^U mod P)(A' mod P)] mod P + B mod P

But A mod P = A' mod P, so therefore M mod P = M' mod P.

The algorithm here consists of two parts: a ‘folding’ step that is applied iteratively, and a final ‘reduction’ step.

For a folding step, we take a block of 256 bits (A in the above example) and reduce it to a block of 128 bits (A' in the above). Suppose we have a 256-bit A that's divided into a 64-bit H, a 64-bit L, and a 128-bit G:

A = 2^192 H + 2^128 L + G

Then C = A mod P = [(2^192 mod P) (H mod P)] mod P + [(2^128 mod P) (L mod P)] mod P+ G mod P

Since H and L are 64-bits and P has the 2^64 bit implicitly set, H mod P = H, and similarly for L. So

A mod P = A' mod P, where A' = (2^192 mod P) H + (2^128 mod P) L + G mod P

(where again, addition here means xor). Note that the coefficients 2^192 mod P and 2^128 mod P are constants that can be precomputed.

So the folding algorithm works by keeping a 128-bit accumulator. At each step, we divide it up into high and low 64-bit halves, do a clmul of each half with the corresponding constant, and xor these two results together with the next 128-bit block.

For the reversed case, nearly everything is the same, since bit ordering doesn't matter for carryless multiplication. The only thing to watch out for is that the results need to be shifted one place to the left, since pclmul produces a result in the 127 least-significant bits. If * is carryless multiplication, then

reflect(a)*reflect(b) << 1 = reflect(a*b)

The shift can be applied to the pre-computed constants (they also need to be bit-reflected).

Once the entire message has been reduced to 128 bits (see below for cases where it doesn't come out evenly), we need to fold in the 64 trailing 0 bits in the CRC definition. Let M be divided into 64-bit halves H and L:

M = 2^64 H + L

Then

2^64 M mod P = [(2^128 mod P) H] mod P + (2^64 L) mod P = [(2^128 mod P) H + 2^64 L] mod P

where 2^128 mod P is again one of the constants that was used for folding.

Finally, the 128-bit folded value is reduced to a 64-bit CRC using the technique of Barrett reduction:

https://en.m.wikipedia.org/wiki/Barrett_reduction https://www.nayuki.io/page/barett-reduction-algorithm

If we want to compute C = M mod P, where C < M^2, then we can precompute s = 1/P and then C = M - floor(M s)P. In the case of fixed-point arithmetic, it is convenient to scale s by a power of 2 so that it is not less than one. We precompute mu = floor(2^128 / P). Then

CRC = M - floor(M mu / 2^128) P

which we implement as (remember all operations are carryless, hence addition and subtraction are both xor):

T1 = floor(M / 2^64) mu T2 = floor(T1 / x^64) * P CRC = (M + T2) mod 64

For the reflected case, we again need to reflect/shift the constants.

In the cases where the size of the message is not divisible by 128 bits, we need one more folding round for the remainder.

Performance: On an Intel i7-4600M CPU, the version using pclmul is about ten times faster than the bytewise version for long strings (50-60k bytes). For short strings (20-40 bytes), it is about five times faster. On a Celeron C2840 CPU, the speedups are more modest, four and 1.5 times, respectively.

We currently support compiling this only with gcc. clang implements function multiversioning, but some of the intrinsics are missing/renamed. We haven't tried it on icc, but it's unlikely to work without adjustment.

Definition in file crc64.cxx.

Classes

Namespaces

Functions

Variables

Detailed Description

10110

10110

10110

10110