algorithms - methods of computing fixed point atan2 on FPGA

I am in need of computing atan2(x,y) on an FPGA with a continuous input/output stream of data. I managed to implement it using unrolled, pipelined CORDIC kernels, but to get the accuracy I need, I had to perform 32 iterations. This led to a pretty large amount of LUTs being devoted to this one task. I tried changing the flow to use partially unrolled CORDIC kernels, but then I needed a multiplied clock frequency to execute repeated loops while still maintaining a continuous input/output flow. With this, I could not meet timing.

So now I am reaching out for alternative ways of computing atan2(x,y).

I thought about using block-RAM lookup tables with interpolation, but since there are 2 variables I would need 2 dimensions of lookup tables, and this is very resource intensive in terms of block-RAM usage.

I then thought about using the fact that atan2(x,y) is related to atan(x/y) with quadrant adjustment. The problem with this is that x/y needs a true divide since y is not a constant, and divisions on FPGAs are very resource intensive.

Are there any more novel ways to implement atan2(x,y) on an FPGA that would result in lower LUT usage, but still provide good accuracy ?

Answer

You can use logarithms to get rid of the division. For $(x, y)$ in the first quadrant:

$$z = \log_2(y)-\log_2(x)\\ \text{atan2}(y, x) = \text{atan}(y/x) = \text{atan}(2^z)$$

Figure 1. Plot of $\text{atan}(2^z)$

You would need to approximate $\text{atan}(2^z)$ in range $-30 < z < 30$ to get your required accuracy of 1E-9. You can take advantage of the symmetry $\text{atan}(2^{-z}) = \frac{\pi}{2}-\text{atan}(2^z)$ or alternatively ensure that $(x, y)$ is in a known octant. To approximate $\log_2(a)$:

$$b = \text{floor}(\log_2(a))\\ c = \frac{a}{2^b}\\ \log_2(a) = b + \log_2(c)$$

$b$ can be calculated by finding the location of the most significant non-zero bit. $c$ can be calculated by a bit shift. You would need to approximate $\log_2(c)$ in range $1 \le c < 2$.

Figure 2. Plot of $\log_2(c)$

For your accuracy requirements, linear interpolation and uniform sampling, $2^{14} + 1 = 16385$ samples of $\log_2(c)$ and $30\times 2^{12} + 1 = 122881$ samples of $\text{atan}(2^z)$ for $0 < z < 30$ should suffice. The latter table is pretty large. With it, the error due to interpolation depends greatly on $z$:

Figure 3. $\text{atan}(2^z)$ approximation largest absolute error for different ranges of $z$ (horizontal axis) for different numbers of samples (32 to 8192) per unit interval of $z$. The largest absolute error for $0 \le z < 1$ (omitted) is slightly less than for $\text{floor}(\log_2(z)) = 0$.

The $\text{atan}(2^z)$ table can be split into multiple subtables that correspond to $0 \le z < 1$ and different $\text{floor}(\log_2(z))$ with $z \ge 1$, which is easy to calculate. The table lengths can be chosen as guided by Fig. 3. The within-subtable index can be calculated by a simple bit string manipulation. For your accuracy requirements the $\text{atan}(2^z)$ subtables will have a total of 29217 samples if you extend the range of $z$ to $0 \le z < 32$ for simplicity.

For later reference, here is the clunky Python script that I used to calculate the approximation errors:

from numpy import *
from math import *

N = 10
M = 20
x = array(range(N + 1))/double(N) + 1
y = empty(N + 1, double)
for i in range(N + 1):
    y[i] = log(x[i], 2)

maxErr = 0
for i in range(N):
    for j in range(M):

        a = y[i] + (y[i + 1] - y[i])*j/M
        if N*M < 1000: 
            print str((i*M + j)/double(N*M) + 1) + ' ' + str(a)
        b = log((i*M + j)/double(N*M) + 1, 2)
        err = abs(a - b)
        if err > maxErr:
            maxErr = err

print maxErr


y2 = empty(N + 1, double)
for i in range(1, N):
    y2[i] = -1.0/16.0*y[i-1] + 9.0/8.0*y[i] - 1.0/16.0*y[i+1]


y2[0] = -1.0/16.0*log(-1.0/N + 1, 2) + 9.0/8.0*y[0] - 1.0/16.0*y[1]
y2[N] = -1.0/16.0*y[N-1] + 9.0/8.0*y[N] - 1.0/16.0*log((N+1.0)/N + 1, 2)

maxErr = 0
for i in range(N):

    for j in range(M):
        a = y2[i] + (y2[i + 1] - y2[i])*j/M
        b = log((i*M + j)/double(N*M) + 1, 2)
        if N*M < 1000: 
            print a
        err = abs(a - b)
        if err > maxErr:
            maxErr = err

print maxErr


y2[0] = 15.0/16.0*y[0] + 1.0/8.0*y[1] - 1.0/16.0*y[2]
y2[N] = -1.0/16.0*y[N - 2] + 1.0/8.0*y[N - 1] + 15.0/16.0*y[N]

maxErr = 0
for i in range(N):
    for j in range(M):
        a = y2[i] + (y2[i + 1] - y2[i])*j/M
        b = log((i*M + j)/double(N*M) + 1, 2)
        if N*M < 1000: 

            print str(a) + ' ' + str(b)
        err = abs(a - b)
        if err > maxErr:
            maxErr = err

print maxErr

P = 32
NN = 13
M = 8

for k in range(NN):
    N = 2**k
    x = array(range(N*P + 1))/double(N)
    y = empty((N*P + 1, NN), double)
    maxErr = zeros(P)
    for i in range(N*P + 1):
        y[i] = atan(2**x[i])

    for i in range(N*P):
        for j in range(M):

            a = y[i] + (y[i + 1] - y[i])*j/M
            b = atan(2**((i*M + j)/double(N*M)))
            err = abs(a - b)
            if (i*M + j > 0 and err > maxErr[int(i/N)]):
                maxErr[int(i/N)] = err

    print N
    for i in range(P):
        print str(i) + " " + str(maxErr[i])

The local maximum error from approximating a function $f(x)$ by linearly interpolating $\hat{f}(x)$ from samples of $f(x)$, taken by uniform sampling with sampling interval $\Delta x$, can be approximated analytically by:

$$\widehat{f}(x) - f(x) \approx (\Delta x)^2\lim_{\Delta x\rightarrow 0}\frac{\frac{f(x) + f(x + \Delta x)}{2} - f(x + \frac{\Delta x}{2})}{(\Delta x)^2} = \frac{(\Delta x)^2 f''(x)}{8},$$

where $f''(x)$ is the second derivative of $f(x)$ and $x$ is at a local maximum of the absolute error. With the above we get the approximations:

$$\widehat{\text{atan}}(2^z) - \text{atan}(2^z) \approx \frac{(\Delta z)^2 2^z(1 - 4^z)\ln(2)^2}{8(4^z + 1)^2},\\ \widehat{\log_2}(a) - \log_2(a) \approx \frac{-(\Delta a)^2}{8 a^2\ln(2)}.$$

Because the functions are concave and the samples match the function, the error is always to one direction. The local maximum absolute error could be halved if the sign of the error was made to alternate back and forth once every sampling interval. With linear interpolation, close to optimal results can be achieved by prefiltering each table by:

$$y[k] = \cases{\begin{array}{rrrrrl}&&b_0x[k]&\negthickspace\negthickspace\negthickspace+ b_1x[k+1]&\negthickspace\negthickspace\negthickspace+ b_2x[k+2]&\text{if } k = 0,\\ &c_1x[k-1]&\negthickspace\negthickspace\negthickspace+ c_0x[k]&\negthickspace\negthickspace\negthickspace+ c_1x[k+1]&&\text{if }0 < k < N,\\ b_2x[k-2]&\negthickspace\negthickspace\negthickspace+ b_1x[k-1]&\negthickspace\negthickspace\negthickspace+ b_0x[k]&&&\text{if } k = N, \end{array}}$$

where $x$ and $y$ are the original and the filtered table both spanning $0 \le k \le N$ and the weights are $c_0 = \frac{9}{8}, c_1 = -\frac{1}{16}, b_0 = \frac{15}{16}, b_1 = \frac{1}{8}, b_2 = -\frac{1}{16}$. The end conditioning (first and last row in the above equation) reduces error at the ends of the table compared to using samples of the function outside of the table, because the first and the last sample need not be adjusted to reduce the error from interpolation between it and a sample just outside the table. Subtables with different sampling intervals should be prefiltered separately. The values of the weights $c_0, c_1$ were found by minimizing sequentially for increasing exponent $N$ the maximum absolute value of the approximate error:

$$(\Delta x)^N\lim_{\Delta x\rightarrow 0}\frac{\left(c_1f(x - \Delta x) + c_0f(x) + c_1f(x + \Delta x)\right)(1-a) + \left(c_1f(x) + c_0f(x + \Delta x) + c_1f(x + 2 \Delta x)\right)a - f(x + a\Delta x)}{(\Delta x)^N} =\left\{\begin{array}{ll}(c_0 + 2c_1 - 1)f(x) &\text{if } N = 0, \bigg| c_1 = \frac{1 - c_0}{2}\\ 0&\text{if }N = 1,\\ \frac{1+a-a^2-c_0}{2}(\Delta x)^2 f''(x)&\text{if }N=2, \bigg|c_0 = \frac{9}{8}\end{array}\right.$$

for inter-sample interpolation positions $0 \le a < 1$, with a concave or convex function $f(x)$ (for example $f(x) = e^x$). With those weights solved, the values of the end conditioning weights $b_0, b_1, b_2$ were found by minimizing similarly the maximum absolute value of:

$$(\Delta x)^N\lim_{\Delta x\rightarrow 0}\frac{\left(b_0f(x) + b_1f(x + \Delta x) + b_2f(x + 2 \Delta x)\right)(1-a) + \left(c_1f(x) + c_0f(x + \Delta x) + c_1f(x + 2 \Delta x)\right)a - f(x + a\Delta x)}{(\Delta x)^N} =\left\{\begin{array}{ll}\left(b_0 + b_1 + b_2 - 1 + a(1 - b_0 - b_1 - b_2)\right)f(x) &\text{if } N = 0, \bigg| b_2 = 1 - b_0 - b_1\\ (a-1)(2b_0+b_1-2)\Delta x f'(x)&\text{if }N = 1,\bigg|b_1=2-2b_0\\ \left(-\frac{1}{2}a^2 + \left(\frac{23}{16} - b_0\right)a + b_0 - 1\right)(\Delta x)^2f''(x)&\text{if }N=2, \bigg|b_0 = \frac{15}{16}\end{array}\right.$$

for $0 \le a < 1$. Use of the prefilter about halves the approximation error and is easier to do than full optimization of the tables.

Figure 4. Approximation error of $\log_2(a)$ from 11 samples, with and without prefilter and with and without end conditioning. Without end conditioning the prefilter has access to values of the function just outside of the table.

This article probably presents a very similar algorithm: R. Gutierrez, V. Torres, and J. Valls, “FPGA-implementation of atan(Y/X) based on logarithmic transformation and LUT-based techniques,” Journal of Systems Architecture, vol. 56, 2010. The abstract says their implementation beats previous CORDIC-based algorithms in speed and LUT-based algorithms in footprint size.

Notes

Thursday, April 6, 2017

algorithms - methods of computing fixed point atan2 on FPGA

No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions