Wednesday, August 9, 2017

statistics - Determining the mean and standard deviation in real time


What would be the ideal way to find the mean and standard deviation of a signal for a real time application. I'd like to be able to trigger a controller when a signal was more than 3 standard deviation off of the mean for a certain amount of time.


I'm assuming a dedicated DSP would do this pretty readily, but is there any "shortcut" that may not require something so complicated?



Answer



There's a flaw in Jason R's answer, which is discussed in Knuth's "Art of Computer Programming" vol. 2. The problem comes if you have a standard deviation which is a small fraction of the mean: the calculation of E(x^2) - (E(x)^2) suffers from severe sensitivity to floating point rounding errors.


You can even try this yourself in a Python script:


ofs = 1e9
A = [ofs+x for x in [1,-1,2,3,0,4.02,5]]
A2 = [x*x for x in A]
(sum(A2)/len(A))-(sum(A)/len(A))**2


I get -128.0 as an answer, which clearly isn't computationally valid, since the math predicts that the result should be nonnegative.


Knuth cites an approach (I don't remember the name of the inventor) for calculating running mean and standard deviation which goes something like this:


 initialize:
m = 0;
S = 0;
n = 0;

for each incoming sample x:
prev_mean = m;

n = n + 1;
m = m + (x-m)/n;
S = S + (x-m)*(x-prev_mean);

and then after each step, the value of m is the mean, and the standard deviation can be calculated as sqrt(S/n) or sqrt(S/n-1) depending on which is your favorite definition of standard deviation.


The equation I write above is slightly different than the one in Knuth, but it's computationally equivalent.


When I have a few more minutes, I'll code up the above formula in Python and show that you'll get a nonnegative answer (that hopefully is close to the correct value).




update: here it is.


test1.py:



import math

def stats(x):
n = 0
S = 0.0
m = 0.0
for x_i in x:
n = n + 1
m_prev = m
m = m + (x_i - m) / n

S = S + (x_i - m) * (x_i - m_prev)
return {'mean': m, 'variance': S/n}

def naive_stats(x):
S1 = sum(x)
n = len(x)
S2 = sum([x_i**2 for x_i in x])
return {'mean': S1/n, 'variance': (S2/n - (S1/n)**2) }

x1 = [1,-1,2,3,0,4.02,5]

x2 = [x+1e9 for x in x1]

print "naive_stats:"
print naive_stats(x1)
print naive_stats(x2)

print "stats:"
print stats(x1)
print stats(x2)


result:


naive_stats:
{'variance': 4.0114775510204073, 'mean': 2.0028571428571427}
{'variance': -128.0, 'mean': 1000000002.0028572}
stats:
{'variance': 4.0114775510204073, 'mean': 2.0028571428571431}
{'variance': 4.0114775868357446, 'mean': 1000000002.0028571}

You'll note that there's still some rounding error, but it's not bad, whereas naive_stats just pukes.





edit: Just noticed Belisarius's comment citing Wikipedia which does mention the Knuth algorithm.


No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions

Apparently the of last four, $\ce{Mg^2+}$ is closest in radius to $\ce{Li+}$. Is this true, and if so, why would a whole larger shell ($\ce{...