I've spent the last couple of days playing with YoloV3, and have had very good results. My use case is sports photography, and the object detection for people/bikes etc is very very good, I'm very impressed. In the future I think I'll train it with my own dataset to improve it further, but out of the box it does a fantastic job already.
What I want to improve:
Once an object has been detected, how can I generate some sort of metric to quantify how well-focused it is?
Past/current approaches
- 1) My first thought/Google, was "variance". First I turn the image grayscale, then use a Laplacian convolution to highlight the edges. Then simply compute the variance of the pixels in the bounding box. High numbers "probably" mean high contrast i.e. high focus, whereas lower numbers would tend to mean low contrast and probably poor focus.
It works pretty well, but not 100% reliable. Imagine you have a person stood up with their arms spread wide, against a skilled bokeh background. Most of the bounding box is blurred background, so the variance ends up being low.
- 2) To improve upon this, I came to the conclusion that there will almost always there be a significant portion of the object somewhere around the center of the bounding box. I set my criteria to center square, 20% of the width of the BB and 20% of the height.
Unfortunately this throws up scenarios where that square happens to be 10% background, between someone's arm and body when running around a tight corner facing the camera, etc.
- 3) "OK, a cross". Thickness equal to 20% of the BB, up and down through the middle and left and right through the middle.
Not bad, not bad. Still getting a lot of background on some images though, as the edges of the box are where background is going to tend to reside.
- 4) "OK, a reduced cross". Same as above, but only extending from the center 2/3rds of the way out to the edges.
Almost fantastic. With the caveat that sometimes you end up with just a competitor's chest, and if they're wearing a single-colour top.... the variance isn't all that.
Examples:
In this photo, the motorbike (close enough...) apparently has great focus, while the person isn't so good. Mainly due to the near-uniformity of his central cross.
Here's a more troubling example. Look at that variance, 5 FFS!
So I think that's the end of that approach.
The future...
I could go on and on with this, and I'm ALWAYS going to end up with some photos that it just doesn't work well for.
I think a different approach is needed.
One thought is just to take the largest variance over a small region, say a 10% width/height square that roams across the bounding box.
But then if the foreground is completely out of focus, and the background is sharp, we'll get a false positive from the background.
Anyone cleverer/more experienced than me have a fantastic solution for this?
It's clearly possible, not least because http://remove.bg and PhotoShop already do a fanastic job of separating foreground from background. But how?
EDIT: I completely neglected to mention that I'm using a Laplacian convolution on a grayscale version of the photos before computing the variance, to detect the edges.
Answer
For your application, image segmentation would be more useful than bounding boxes that contain also background. Other useful keywords: instance-aware image segmentation, instance segmentation.
Figure 1. Instance segmentation example image from Mask R-CNN, by Karol Majek. Bounding boxes are also shown.
Examples of implementations using some version of Yolo:
Other implementation examples:
No comments:
Post a Comment