- The Why
- Posts
- Teaching Computers to See with AI
Teaching Computers to See with AI
Everything You Maybe Never Wanted to Know about Computer Vision
Computer vision stands poised to revolutionize countless sectors of our global society.
In the automotive industry, it powers self-driving cars to navigate complex urban environments, ensuring safer roads and reshaping urban mobility. In healthcare, it aids in the rapid diagnosis of diseases by analyzing medical imagery with precision, sometimes catching nuances that might escape even the trained human eye. Retailers employ it to enhance the shopping experience, allowing customers to try on clothes virtually or scan shelves for products. Agriculture harnesses its capabilities to monitor crops and detect early signs of disease or pest infestation, ensuring healthier yields. Meanwhile, in the realm of entertainment, it's used in augmented reality applications, video games, and movie production to create more immersive experiences.
Even if you don’t understand all the details, computer vision with AI has huge implications in nearly every professional field. This technology, by giving machines a semblance of human sight, holds the promise of streamlining processes, enhancing efficiencies, and opening doors to innovations previously deemed the stuff of science fiction.

I know it’s a robot. You know it’s a robot. How does my computer know it’s a robot? And what is it doing on the beach??
And what’s more, you can do it on your own computer or with free resources. We’ll go through some examples after a bit of a breakdown. If you want to skip the breakdown and get to the fun pictures I won’t hold it against you.
Demystifying Computer Vision: Bridging Pixels and Perception
At its heart, computer vision seeks to teach machines to interpret visual information in much the same way we humans do. But, while the concept is straightforward, the underlying mechanisms are intricate. Let's peel back the layers.
Pixels to Patterns: Imagine a jigsaw puzzle. Each piece, no matter how insignificant it seems, plays a crucial role in completing the image. Similarly, every image fed into a computer is dissected into tiny blocks called pixels. Each pixel has a specific color and intensity. Computer vision starts by analyzing these pixels, attempting to identify patterns and connections. It's like recognizing that certain jigsaw pieces, based on their colors and shapes, are parts of the same section.
Feature Extraction: Remember those 'spot the difference' puzzles? You'd focus on specific details or features to discern one image from another. Computer vision employs a similar technique but at an advanced scale. Algorithms sift through images identifying unique features – a curve, an edge, a color gradient. These features become the building blocks, helping the system distinguish one object from another. For instance, recognizing the curves and contours that typically represent a human face within a picture.
Deep Learning & Neural Networks: Drawing inspiration from our brain's neural structure, computer vision uses artificial neural networks, particularly deep learning models. Think of these as layers of interconnected nodes or "neurons." Each layer processes an aspect of the image, refining and interpreting its understanding as it progresses deeper… much like how our brain processes visual cues, starting from recognizing shapes and colors to discerning a familiar face in a crowd.
Classification & Recognition: Once an image is dissected and its features understood, the system can categorize what it 'sees'. Using pre-existing datasets, it matches the analyzed features to known objects, labeling or 'classifying' them. It's akin to a child learning to identify animals: over time, by seeing many images and real-life examples of cats, they can recognize and name a cat when they see one.
So let’s take a look at some examples of computer vision ranging from wow to holy shwow!
Making it Work
Before we go into the meat I do want to give a quick spot of recognition to Midjourney. If you haven’t heard of it, Midjourney is a VERY user-friendly way to generate images. All it takes is Discord and a subscription to use their bot. In fact, the image at the beginning of this article was made with nothing but a simple prompt.

The Midjourney Discord Bot
I may publish some tips and tricks for Midjourney down the line, but as it’s a closed ecosystem it’s not the subject of this article.
First up on the list: Segment Anything Model (SAM) from Meta AI
Segment Anything is a model made of images and masks. It knows (as much as a computer can “know” anything) what various objects are, what they look like, what one part or another of that object is. Think of it when you look at any given person. Every person looks different, has different features, shapes, colors, etc. But you know what a human looks like. And you can ‘segment’ them out into parts: arms, legs, head, etc.
SAM has been trained on around 11 million images (all licensed and privacy preserving). With that model, you can programmatically identify the ‘parts’ of a picture. With a bounding box, with a click, a prompt, some dots on a page, or just let the model go to work and segment everything.

Auto Segmentation with SAM
Here’s an example where I just let SAM go to work. It automatically scans the image and finds different features and then uses annotation to mark them as segments. Each of these segments can be extracted, modified, analyzed, whatever you’d like.
The paper: https://arxiv.org/abs/2304.02643
A good explainer: https://encord.com/blog/segment-anything-model-explained/
Notebook to test it yourself from Facebook Research: https://github.com/facebookresearch/segment-anything/blob/main/notebooks/automatic_mask_generator_example.ipynb
Now let’s look at how we can go further with prompts.
Lang Segment Anything from Luca Medeiros
This library takes SAM a step further and adds in the GroundingDINO detection model so we can use words to find objects. It’s what allowed me to find the robot in that first picture. Thanks to LangSAM, it was all possible with a few lines of code. Take a look here where I ask it to find the airplanes in an image:
from PIL import Image
from lang_sam import LangSAM
from lang_sam.utils import draw_image
import numpy as np
model = LangSAM()
image_pil = Image.open("./assets/airport.jpg").convert("RGB")
text_prompt = "airplane"
masks, boxes, phrases, logits = model.predict(image_pil, text_prompt)
labels = [f"{phrase} {logit:.2f}" for phrase, logit in zip(phrases, logits)]
image_array = np.asarray(image_pil)
image = draw_image(image_array, masks, boxes, labels)
image = Image.fromarray(np.uint8(image)).convert("RGB")
image.save('./assets/outputs/planes.png')

Found you! AND your shadow.
Look closely, and you’ll see a red box around each of the planes in the pic. With LangSAM, I asked for an airplane in a picture and it automatically found them all. We’ll see how CAPTCHA deals with that in the future! For now, imagine the possibilities of uploading medical imagery and saying “find the tumors.” The possibilities are incredible.
Last but not Least: Let’s Make an Image!
Whereas Midjourney is a closed model, Stability AI has released Stable Diffusion XL, an open and available model for all to work with. And it’s so incredibly easy to use. Without a powerful graphics card of my own I couldn’t run it on my laptop, but with a free Google Colab account I could connect to a GPU capable machine and boom; computer generated imagery.
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")
prompt = "Two birds playing an electric guitar"
image = pipe(prompt=prompt).images[0]
image.save("birds_play_guitar.png")

And here we go!
And that was just a ridiculously simple example. You can try it yourself with a copy of my notebook.
My Colab Notebook for Stability XL: https://colab.research.google.com/drive/1MnuaKqIKmhVYB17TReXTrBde0q28tcCv?usp=sharing
Final Thoughts
As we continue to explore the capabilities of computer vision combined with AI image manipulation, we stand at the precipice of a new understanding. It's not just about the technology but the stories, insights, and possibilities it unlocks. There’s a ton of opportunity both in learning how to work with it, and in understanding how it will shape our future in some incredibly interesting ways.
Reply