Segmentation? So how do we classify images in computer vision

Segmentation is one of the most important activities in computer vision. The ability to recognise objects in what we see is automatic in all the living beings with eyes. We do not need to learn it, as we see naturally see shapes and objects, and we can differentiate them instantly. A computer image is a simple matrix of dots (called pixels) and the computer is not able to recognise any shape nor to associate it any meaning. The image is a simple collection of dots, each one with its associated colour, and that is it. If we want to create a system that is able to interact with what it “sees”. It is imperative that we teach him how to transform the collection of dots in a collection of objects and to assign them a semantic label.

It may seem that segmentation segregates our data contained in an image or in a video file, but in reality it provides a meaning to the components we are interested in. The segmentation process can be divided into two steps, it might not always necessary to perform both of them.

If we want to obtain high-class data, we must attach great importance to the individual steps taken in these processes.

Semantic segmentation

Semantic segmentation is the process where the matrix of dots is scanned and labels are assigned to each dot to mark their semantic class, for example, in an image of a highway it marks two items: the cars, the road (for example for counting cars).

Instance segmentation

The instance segmentation associates to an object a more specific label. So, if use the example above in an image of the highway it marks each car with its own plate, so we do not know only how many cars are there but also which ones (for example recognising the plates).

Depending on the problem in hands one or both segmentations must be properly done. If the problem with the pixel assignment occurs during the semantic segmentation, then the pixels may be assigned to the wrong class. So the objects might be deformed (for example, two cars might be fused together and considered one).

Let's consider an example of a company system dealing with the protection of property and objects:

If the system will be trained using badly marked image data, i.e. people in the pictures will not be fully marked (e.g. only part of the trunk or fingers will be labelled, which might well resemble bush branches). The “human model” developed in the system will be different than it should be. It is at the stage of segmentation that the system will not be able to determine what it “sees” because the data on which it trained did not give him the correct image of a person. Learning with incorrect data will make the system react too often, disrupting operatively or even worse, not reacting to an event as it does not recognise the correct human figure. This can result in huge financial losses for both the client and the security company.

As you can see in the example above, the correct segmentation plays a key role in correctly determining what is in the image. The more accurately marked data, the better the system results and the potential customer costs saved.

Paweł CyrtaHead of AI @ DataLabeling.EU

Paweł Cyrta — specjalista ds. dźwięku, głosu, muzyki i multimediów. Doświadczony badacz i twórca oprogramowania specjalizujący się w analizie i przetwarzaniu sygnałów muzycznych, głosowych i dźwiękowych. Posiada obszerną wiedzę na temat systemów informatycznych, implementacji oprogramowania Open Source, Data Science, Data mining, Web mining, Text mining, NLP, Big Data, Machine Learning (HMM, GMM, SVM, ..., BDN, Deep Learning, ...). Dysponuje głęboką wiedzą z dziedziny dźwięku i rozwiązań audio, systemów emisji, przetwarzania, kompresowania i kodowania dźwięku. Są mu bliskie psychoakustyka, akustyka pomieszczeń, modelowanie 3D, programowanie i inżynieria dźwięku.