If you haven't done it yet, we suggest you start this blog by reading our previous blog post introducing the concepts behind reverse image search algorithm. We are reusing here all the concepts and technologies that we have previously introduced there. In particular, we are going to reuse the same topless VGG16 algorithm. A large aerial picture of a west coast neighborhood is used as a benchmark for today's blog. The benchmark picture was taken from the Draper satellite image chronology kaggle competition.
The main differences with respect to the introduction blog post are listed bellow.
Satellite or aerial pictures are generally quite large compared to the size of the objects of interest in the pictures. For instance, the dimension of our benchmark picture is 3100x2329 pixels while the typical size of a car on the picture is about 40x20 pixels. In comparison, in the introductory course, we had 104 pictures of at most 166x116 pixels.
The best way to handle this difference is to divide the picture into same-size tiles. We compute the picture DNA of each tile using the topless VGG16 algorithm introduced earlier and later on, we will build up a similarity matrix by computing the cosine similarity between all pairs of tiles DNA. The size of the tiles should be slightly larger than the size of the objects we are interested in. In this blog, we are interested in objects with a scale of the order of a meter and we've therefore decided to use tiles of 56x56 pixels. The figure bellow shows the benchmark picture with a grid of 56x56 cells.
ConvNets are performing better at identifying object features when those are well centered on the image. Therefore, we need to guarantee that objects of interest are never shared between two tiles. We don't want the tile separation grid to cut the objects in two. The best way to guarantee that is to duplicate all tiles with an offset of half the size of a tile. Of course, we need to apply an offset either horizontally, vertically or both. The size of the offset is typically half of the tile size, but it can also be smaller. In today's benchmark, we are using an offset of 14 pixels (a quarter of the tile size). With such an offsetting, the number of tiles is multiplied by 16. This guarantees that the object of interest is always centered on at least one tile and thus to have the ConvNet extracting properly the object features.
A note about computing time: The total number of tiles is inversely proportional to the square of the tile size. The computing time to compute the tile DNA scales linearly with the number of tiles, but the similarity matrix computation scales quadratically with the number of tiles. So you should perhaps think twice before using tiny tiles.
Pictures from the sky
By definition, satellite or aerial pictures are taken from the sky.... this might appear as a negligible detail for what we are trying to achieve, but it is not. As previously explained, the VGG16 algorithm was trained using the ImageNet dataset that is made of "Picasa"-like pictures. These pictures have a vertical orientation. This means that cars always have their wheels at the bottom of the pictures. Similarly, characters and animals have their head above their legs in most of the pictures. The algorithm has learned that the pictures aren't invariant against rotation or top/down flipping. In other words, the meaning of a car picture with the wheels at the top is very different than the same picture with the wheels pointing down the road.
On the contrary, pictures from the sky have an invariant meaning against either flip symmetry or rotation. In aerial pictures, the orientation of the object is totally random and has no particular meaning. A picture of a car driving toward the west or the east is still the picture of a driving car.
Since the VGG16 algorithm has learned to account for the orientation of the image, we will need some extra steps to make our similar picture finder insensitive to the orientation. We have two options:
Option1: we retrain the VGG16 algorithm to learn that picture A and picture A flipped or picture A rotated must have the same DNA vector. This is a fully unsupervised learning, as we don't need a set of labeled pictures to perform this training. we can use randomly chosen tiles from the main picture to perform this training. However, we would still need a relatively large amount of time to perform this retraining. In addition, there is some risk that the performance of the VGG16 algorithm at identifying picture features get reduced by this loss of picture orientation. This option might be worth doing if you consider a larger project with a gigantic amount of tiles to process.
Option2: instead of computing just one DNA vector per picture, we can compute one DNA vector per picture and per symmetry transformation. Then, we'll try to identify similar images, we can compare the reference picture DNA vector to the vectors of all other pictures and their symmetries. This approach is a bit heavier in terms of CPU when computing the similarity matrix, but the implementation is straight forward. Note that the VGG16 is expected to be already insensitive to left/right symmetry, so the only symmetry that we should consider are actually rotations. The top/down symmetry is not needed either as it could be decomposed as an 180-degree rotation followed by a left/right symmetry.
For today's benchmark, we opted for the option2 and the symmetry that we considered are simply the four 90-degree rotations. This virtually increases the number of tiles by yet a factor 4. In total, we have considered 143664 tiles+symmetries in this benchmark demo. This leads to the computation of about 10 billions cosine similarities.
For this benchmark, we have created a full application demo with:
a D3js frontend: which allows having a nice interface on top of the algorithm. The D3js interface that we built allows zooming on a specific area of the picture (using mouse wheel), to move within the picture by dragging it and to select an area of interest by clicking on the picture. Once an area of interest is selected, the 10 most similar area are highlighted. Moving the mouse over those area shows the value of the cosine similarity.
a Django backend: where the hard code processing is being executed. The Django backend allows us to execute our python code on the fly and ease the access to the database storing precomputed cosine similarity for all the tile+symmetry pairs.
a database with an index: which allows returning the most similar area associated to a specific tile in no time.
a result filtering module: Because we have several tiles covering the same area (or a part of the area) due to the offsetting strategy we choose, we have good chances that the algorithm find that the most similar areas are the reference area shifted by a small offset (a quarter of the tile size in our benchmark example). Although this is indeed a valid result, it is for sure not the interesting results we are naively expecting. So we have added a cross-cleaning module that rejects tiles from the result list that are either partially covering the reference area or a tile with a better similarity score that is already in the result list
The full demo is visible at the bottom of this blog post but is also available on this page.
Performance on specific examples
In this section, we are discussing the performance of the algorithm at finding similar objects for some specific examples. You can repeat these tests by yourself in the demo app. In the series of picture bellow, the top-left image shows the reference area (the one for which we are trying to find similar matches) in blue. The rest of the image is just there to show the reference tile in its context. The 9 other pictures show the closest matches in their context. For tiles that are close to the main picture borders, the missing context (outside the picture) is shown as a black area, For the matches, the similarity score is also given for reference. In both the reference and matches pictures, the tile coordinates are also given, so you can try to locate the tile in the main picture (or in the demo app).
The reference object is a usual road marking symbol. In the 9 similar tiles proposed by the algorithm, we see that 6 of them have the same road marking. The 3 others are clearly false positive, but if we look closer at the reference picture, we can notice that it captures a car side on the left and right of the images as well as large area of road. The 3 false positive pictures don't have the road marking, but still, have these 3 other features, so it could also be considered to be a similar image. We could avoid this false match, by adapting the reference area to the exact size of the road marking if we need to develop such a marking finder application.
The reference object is a rather dark vehicle parked along the sidewalk. In the 9 similar tiles proposed by the algorithm, we see that 4 of them have the same type of vehicle in the same context. The others are clearly false positive, generally showing roof parts. Here the algorithm clearly focuses on the color pattern of the reference image where the dark gray area of the road is close to a lighter area from the sidewalk. Interestingly, a similar type of patterns caused by light/shadow effects is also present in some house roof pictures.
This time, the reference object is a white vehicle parked along the sidewalk. In the 9 similar tiles proposed by the algorithm, we see that 6 of them are really perfect matches. Another one picked up a colored car instead of a white one. And the 2 others are just showing a reference picture with 50% of roads and 50% of sidewalks. In these cases, the algorithm focused on features of the background rather than on the vehicle. We can also notice that in these images, there is a rectangular shape on the sidewalk which may be interpreted as a vehicle by the algorithm.
White cars in front of a house:
We can play the same game with a white car that is parked in front of a house this time. This time we have a perfect score. All pictures are indeed showing similar pictures.
Road, sidewalk, and grass:
Let's try a picture made of three "background" parts: road, sidewalk and some grass on the corner. In the absence of a "main" object, the algorithm may focus on unexpected picture features, so it's an interesting example. We would say that the algorithm is providing meaningful pictures in 8 out of 9 of the cases. On these pictures, there is indeed always some part of roads, sidewalks, and grass. The proportion of each component may differ significantly, but they are always present. Sometimes we have an extra object in addition of these components (eg. a car or a bush), but that's OK. We also count a false positive again caused by light/shadow effect on a roof.
Another complex example is this reference picture showing a road corner (with a curved sidewalk and some grass). The four firsts pictures are quite positive matches? The first one, in particular, is almost identical to the reference picture. We can also note that the algorithm also catches road turning in a quite different way. The other 5 pictures are totally wrong which is consistent with their relatively low similarity score. A quick look at the global picture easily gives us the explanation: there is actually only very few part of the images showing some road corners, so the algorithm as a hard time finding a similar area. So it gives what he can find that is showing similar features, We can, for instance, notice the circular swimming pool that has almost the same bend radius than the road corner.
Finding solar panels on house roof from aerial pictures has a lot of application in marketing, statistics, and energy forecast. So it is interesting to see how well the algorithm is performing at this simple task. We got 3/9 matches which are indeed showing solar panels. Three pictures are showing roof with some objects on it. And the last three are complete false positive. But looking again at the global picture, we can notice that the number of roofs with solar panels are actually quite limited in this neighborhood, so again, the algorithm has a hard time finding matches and does what he can. Moreover, the size of tile is not necessarily appropriate at finding objects of that size.
Finding specific species of trees also have many sorts of daily life applications. Is this algorithm capable of making the difference between a palm tree and an oak? Let's try.
We picked up a reference picture clearly showing a palm tree next to a house. The matches show 6 pictures showing palm trees while the three other matches are showing other sorts of trees. What is interesting is that in some cases, the algorithm has a better match on the shadow of the tree rather than on the three itself. This is quite unexpected, but looking closer at those matching picture, we would say that this is also true for the human eye.
You can find bellow the full demo that we have built up using the topless VGG16 algorithm wrapped in a Django backend and a D3js frontend.
Click on an area of interest in the satellite image below. The Deeper Solution algorithm for reverse image search will find the 10 tiles in the region that look the most similar to the place you selected. Move the mouse over a picture match (blue square) to see its similarity score compared to the reference area (red square). You can zoom on the picture using the mouse wheel and pan/move the picture by holding the mouse button down while moving the mouse. Open the demo in a separated window.
In conclusion, we have seen that convolutional neural networks are quite helpful at finding similar areas in either aerial or satellite pictures. We have demonstrated that encoded small pictures area into small feature-sensitive DNA vector make the picture finder both efficient and fast. We were able to find very specific objects like palm trees, solar panels or specific types of cars, without even training the algorithm at recognizing those objects. A small benchmark application was developed to demonstrate the easiness of deploying such a technique for business solutions. Moreover, the approach and the algorithm can easily be scaled to extremely large pictures (or picture collections) covering large cities or even countries using distributed computing (through Apache Spark for instance).
Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…
If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!