Perceptive hash

Continuing with metrics…

Everybody can easily recognize a tune, a famous painting or the face of a friend. Not so for a computer, who must first digitize the information, and operate on this dataset to find similarities. Back in the beginning of computer era, people though that it would be easy to recognize characters or voice, but technology had hard time with these issues. OCR is still perfectible, while voice recognition is still not common (things are getting better, but Siri is still fallible)

This is of particular importance nowadays, since our societies tend to control more and more the population, in the name of terrorism prevention. Since they cannot employ as much operator as there are citizens, there is strong need automation and recognition. The general problem of recognition is to find patterns that reflects a feature.

I discovered the notion of “perceptive hash”, which is essentially the problem of recognition through the simplification of data, wondering how websites such as TinEye, worked in their way to find similar images. I subsequently discovered that there is an open source library called pHash that allows this kind of features (not to mention computer vision frameworks such as OpenCV, or the incredible Accord.net developped by Cesar Souza)

To find similar pictures, there are serveral hurdles. First : all the pictures might not have the same size, they can be cropped, (slightly) modified  or rotated.It seems that there are two solutions. The first consists in a brutal dowsizing of the images, followed by a Fourier transform : the Fourier coefficients will capture the fingerprint of the image (this is somehow similar to how How Shazam works) for comparison to a database.
Another option is consider the histogram instead of the picture, and perform comparision using least-square or Bhattacharya distance.

Size reduction and/or histograms to find similar images

I guess this is the one used by TinEye
These methods are rather robust to slight changes in the image, for there is no information on scale, and because the actual location of pixels is erased .
Note that Google image search is based on contextual informations, rather than on actual images.

Face recognition is another problem. First, one has to dectect a face. This is done through machine learning and particulare metrics. I remember (when I used it) that a popular algorithm was Viola-Jones algorithm. The underlying idea is to compute what is called the so-called integral image,where each pixel is the some of all the preceding pixels and comparision with Haar basis function. It turns out that a face is (perceptively) the combination of certain Haar functions, the successful combinations being determined by learning. For Facial recognition (i.e. to tell two person apart, in software such as iPhoto or facebook), I don’t know how they do!

Spell correction is also a class of feature recognition. A good explanation of how it works is detailed here. I first thought that the employed method would use specific word weighting, but no : it seems to be only inferences ;-(. I want to mention the map of Paris metro made only with anagrams which really impressionned me. I still don’t know how the guy managed to get all these anagrams. Not by hand, I hope!

For optical character recognition, they also use neural networks, probably in conjunction with wavelet or mathematical morphology to remove all unwanted features. But I never got very deep into that problem… Can somebody give me information on the state of the art?

 

I want to mention here the website Hook theory (and the companion iPad book) which tries to explain how music works, emotionnally.
I guess that understanding why a feature is the way it is is more sutble than brute-force search for patterns.
Emotion vs. Reason. Again