Segment Anything | Meta AI

What is Segment Anything?

Segment Anything is a Meta AI research project that introduces a new AI model called Segment Anything Model (SAM), which can "cut out" any object in an image with a single click. SAM is a promptable segmentation system that can generalize to unfamiliar objects and images without additional training.

Features of Segment Anything

Zero-shot generalization to unfamiliar objects and images
Promptable design enables flexible integration with other systems
Extensible outputs can be used as inputs to other AI systems
Can take input prompts from other systems, such as object detectors or AR/VR headsets
Can generate multiple valid masks for ambiguous prompts

How to Use Segment Anything

Try the demo to see how SAM can segment objects in an image
Use interactive points and boxes to prompt SAM
Automatically segment everything in an image
Generate multiple valid masks for ambiguous prompts

Price

The model is currently available for research purposes, and the pricing for commercial use is not disclosed.

Helpful Tips

SAM can be used for a wide range of applications, including image editing, object tracking, and creative tasks like collaging.
The model can be integrated with other AI systems to enable more complex tasks.
The dataset used to train SAM is available for download and includes over 1.1 billion segmentation masks collected on 11 million licensed and privacy-preserving images.

Frequently Asked Questions

What type of prompts are supported?

Foreground/background points
Bounding box
Mask
Text prompts (not currently released)

What is the structure of the model?

A ViT-H image encoder that runs once per image and outputs an image embedding
A prompt encoder that embeds input prompts such as clicks or boxes
A lightweight transformer-based mask decoder that predicts object masks from the image embedding and prompt embeddings

What platforms does the model use?

PyTorch for the image encoder
ONNX runtime for the prompt encoder and mask decoder, which can run on CPU or GPU across a variety of platforms

How big is the model?

The image encoder has 632M parameters
The prompt encoder and mask decoder have 4M parameters

How long does inference take?

The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU
The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution

Segment Anything | Meta AI