What is Segment Anything?
Segment Anything is a Meta AI research project that introduces a new AI model called Segment Anything Model (SAM), which can "cut out" any object in an image with a single click. SAM is a promptable segmentation system that can generalize to unfamiliar objects and images without additional training.
Features of Segment Anything
- Zero-shot generalization to unfamiliar objects and images
- Promptable design enables flexible integration with other systems
- Extensible outputs can be used as inputs to other AI systems
- Can take input prompts from other systems, such as object detectors or AR/VR headsets
- Can generate multiple valid masks for ambiguous prompts
How to Use Segment Anything
- Try the demo to see how SAM can segment objects in an image
- Use interactive points and boxes to prompt SAM
- Automatically segment everything in an image
- Generate multiple valid masks for ambiguous prompts
Price
- The model is currently available for research purposes, and the pricing for commercial use is not disclosed.
Helpful Tips
- SAM can be used for a wide range of applications, including image editing, object tracking, and creative tasks like collaging.
- The model can be integrated with other AI systems to enable more complex tasks.
- The dataset used to train SAM is available for download and includes over 1.1 billion segmentation masks collected on 11 million licensed and privacy-preserving images.
Frequently Asked Questions
What type of prompts are supported?
- Foreground/background points
- Bounding box
- Mask
- Text prompts (not currently released)
What is the structure of the model?
- A ViT-H image encoder that runs once per image and outputs an image embedding
- A prompt encoder that embeds input prompts such as clicks or boxes
- A lightweight transformer-based mask decoder that predicts object masks from the image embedding and prompt embeddings
What platforms does the model use?
- PyTorch for the image encoder
- ONNX runtime for the prompt encoder and mask decoder, which can run on CPU or GPU across a variety of platforms
How big is the model?
- The image encoder has 632M parameters
- The prompt encoder and mask decoder have 4M parameters
How long does inference take?
- The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU
- The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution