I am interested in understanding the mathematical reason for why does applying a median filter on an image (or signal) result in reduction of noise.
Intuition: The intuition is this: Your noise is some event or events that are rare, and that when compared to other events, look like outliers that shouldn't really be there.
For example, if you are measuring the speeds of every car on the highway as they pass by you and plot them, you will see that they are usually in the range of say, $50$ mph to $70$ mph. However as you are inspecting your data for your boss, you see that you recorded a speed of $1,000,000$ mph. Not only does this value not make physical sense for the speed of an actual car on a highway, but it also sticks out wildly from the rest of your data. Chalking this event up to some strange measurement error, you remove it, and give the rest of your data to your boss.
However as you continue your measurements day in and day out, you notice that every now and then, you get those wild measurements of speed. For example, over the span of 1 hour, you measure 1000 cars, and their speeds are nicely between $50$ and $70$ mph, however 3 of those have speeds of $23,424$ mph, $12,000,121$ mph, and $192,212,121,329,982,321,912$ mph, breaking not only local state laws, but also those of theoretical physics.
You get tired of continuously having to go in, and remove those errant data points caused by your cheapo-radar by hand. Afterall, your boss is really only interested in the statistics of the speeds, not so much every actual value. He likes to make nice histograms for his bosses.
Those errant and large numbers are a kind of 'noise' you reckon - 'noise' caused by your cheapo-radar that you bought from a shady pawn shop. Is the noise additive white gaussian noise? (AWGN). Yes and no - It's spectrum is wideband and white, but it is temporally rare, sparse, and very localized. It is better referred to as 'salt and pepper' noise, (especially in the image processing domain).
Thus, what you can do, is run your data through a median filter. Your median filter will take a block of say, $5$ speed points, (points 1 to 5), find the median, and spit that value as the 'average' speed. Then it will take the next 5 points, (points 2 to 6), take that median, and spit out this as the average, etc etc.
What happens when you come across one of your faster-than-light speeds?. Let us say that your 5 speeds were [45, 65, 50, 999999, 75]. If you took the normal average, your 'average' speed here will be something quite large. However if you take the median, your 'average' will be 65. Which best approximates the average that you are really trying to measure? The median metric.
Thus, if you filter your data with a median filter, you will be sure to remove those outliers - and you have thus faithfully 'de-noised' your signal. In contrast, if you tried to remove your noise via traditional filtering, (nothing but a moving weighted sum), you will instead 'smear' the error across your data, and not get rid of it.
Math: The math is this: The median measurement is what is referred to as an order statistic. That is, it returns the value of your data, along some point, after it has been ordered. The max and min are also both order statistics - they return the extreme points of your data after it has been ordered. Taking the median also returns the value of your ordered data, but right from the middle.
But why are they different from mean filters? Well, mean-filters compute an average using all the data. If you notice from max, min, and median, you are getting an answer without using all the data. In fact, all the median does is order your data, and pick the value in the middle. It never 'touches' the outliers, like those large speeds that you measured.
This is why median - an order statistic - is able to 'remove' outlier noise for you. Outlier noise segregates itself in front of the median, and the median never comes near it or considers it, while still giving you a nice estimate of central tendency.