Finding the right company photo: how we indexed thousands of Trobz images with AI

Trobz has accumulated over 8,000 photos across 15 years of events, trips, and photoshoots. We built an AI-powered search tool to find and rank the best ones — so we can stop relying on stock photos for our own blog.

Trobz has been around for more than 15 years. That’s 15 years of company trips, team events, client photoshoots, and office moments — all photographed, and most of them sitting in various Google Drive folders that nobody remembers the name of.

The problem: whenever we need a photo for a blog post or the website, we spend 15–20 minutes browsing through nested folders, usually settle for something mediocre, and occasionally give up and use Unsplash instead. That felt wrong. To get a sense of the scale: after indexing five of our Drive folders, we had 8,473 files — 7,826 unique photos after removing duplicates (Google Drive’s “Make a copy” feature had created hundreds of identical files under different names). We sampled a few dozen of those files to check average size — roughly 5.6 MB per photo — which puts the estimated archive at around 47 GB for those folders alone. We have real Trobz photos, we just can’t find the good ones quickly. And the archive keeps growing with every new event.

The goal was to fix that: build a tool that lets anyone on the team describe what they’re looking for, and get back the most relevant and visually appealing photos from our own archive.

We found an open-source project that had the right idea: local-image-search, an MCP server that lets Claude search images by natural language. Type a description, get file paths back. The concept was exactly what we needed. The implementation had several problems we had to solve before it was usable for our setup.

What Didn’t Work Out of the Box

The original project used MLX + CLIP for embeddings — a stack that only runs on Apple Silicon Macs. Our machines run Linux. The MLX dependency doesn’t install on non-Apple hardware, full stop.

Beyond the platform problem:

No Google Drive support — the tool only scanned local directories. Our photos live in Drive, not on anyone’s laptop.
MCP startup froze Claude — the server loaded the 800MB model at launch, causing Claude to hang for 30–60 seconds every time it started. Users noticed and uninstalled.
No quality signal — results were ranked purely by semantic similarity. A blurry candid could outrank a professional event photo if the description matched slightly better.
Single-folder only — no way to index multiple Drive folders independently without duplicate entries.

We fixed these in two PRs, working on a standard developer laptop: Intel Core i7-1260P (12 cores, no GPU, 32GB RAM). All timing numbers below reflect this setup — no GPU acceleration.

PR #1 — Making It Run on Linux

feat: replace MLX/CLIP with SigLIP for cross-platform support

The fix was a model swap: out with MLX/CLIP, in with HuggingFace SigLIP (google/siglip-so400m-patch14-384). SigLIP runs anywhere — Linux, macOS, Windows — on CPU, CUDA, or Apple Silicon MPS.

SigLIP’s role is to generate embedding vectors — compact numerical representations (1,152 numbers) that capture the visual meaning of an image or the semantic meaning of a text query. Two things that mean the same thing end up with similar vectors, even if they share no words. That’s what makes natural language search work: the vector for “people in a meeting room” lands close to the vectors of actual meeting room photos.

We chose SigLIP over a standard CLIP port for two reasons beyond platform support:

Better fine-grained matching. CLIP uses softmax loss, which compares each image against all others in the training batch. SigLIP uses sigmoid loss, evaluating each image-text pair independently. In practice this means queries like “person wearing a watch” or “two people reviewing a document” return more accurate results — SigLIP doesn’t get confused by similar-looking images in the same batch.

Larger embedding dimension. 1,152 vs 512 for CLIP base. More dimensions means more expressive representations, which helps with subtle visual differences.

# Before (macOS only)
import mlx.core as mx
from mlx_clip import clip

# After (Linux, macOS, Windows — CPU or GPU)
from transformers import AutoModel, AutoProcessor

MODEL_NAME = "google/siglip-so400m-patch14-384"
EMBED_DIM = 1152

model = AutoModel.from_pretrained(MODEL_NAME).to(device).eval()
processor = AutoProcessor.from_pretrained(MODEL_NAME)

The model downloads from HuggingFace automatically on first use. No manual conversion step, no Apple Silicon required.

PR #2 — Google Drive, Quality Ranking, and a Smarter MCP

feat: add Google Drive folder indexing support

With Linux support working, we moved to the main challenge: indexing our company photo archive from Google Drive and making results actually useful for content work.

Connecting to Google Drive

Drive images are downloaded to memory via the Google Drive API — nothing is written to disk. OAuth2 runs once on first use and caches a refresh token. Folder traversal is recursive: all images at any subfolder depth are indexed automatically. Video files (.mp4) and metadata (.xml) are skipped.

request = service.files().get_media(fileId=file_id)
buf = BytesIO()
downloader = MediaIoBaseDownload(buf, request)
done = False
while not done:
    _, done = downloader.next_chunk()
buf.seek(0)
image = Image.open(buf).convert("RGB")

Each embedding is stored in Lance DB — an open-source columnar vector database designed for ML workloads. It stores the embedding vectors alongside metadata (filename, Drive URL, aesthetic score) in a single file on disk, with no separate database server to manage. Each indexed Drive image gets a drive_url column — search results return a direct clickable link to the original file.

Indexing Time: CPU Reality Check

We indexed five Drive folders across separate sessions on our i7-1260P laptop with no GPU:

Folder	Images	Time (CPU, no GPU)
Photo & Video Source	505	~50 min
Hinh anh của Trobz	1,582	~2.5 h
ALL STAFF	43	~5 min
Hinh Anh	4,179	~5 h
Photoshoot 10.2024	2,164	~3 h
Total	8,473	~11.5 h

With a CUDA GPU, the same workload would take roughly 30–60 minutes — roughly a 10x speedup. If you’re indexing a large archive for the first time, plan accordingly.

The incremental design mitigates this: subsequent runs only process new or changed files. After the initial index, re-syncing a folder with a few new photos takes seconds.

The Deduplication Bug — Two Layers

Layer 1: same file, different folder. When we added a second Drive folder, we discovered 43 duplicate entries in the database. The root cause: our skip logic checked whether the file’s drive_folder_id matched the current folder — so a file already indexed under folder A was re-indexed when we added folder B, creating two identical embeddings with different folder IDs.

The fix: skip based on the file’s unique path (Drive file ID), not the folder ID. A file already in the DB is always skipped, regardless of which folder we’re currently indexing.

Layer 2: same content, different file. After finishing the full index, search results started surfacing near-identical pairs: “DATG7197.JPG” and “Copy of DATG7197.JPG” appearing side by side with identical relevance and aesthetic scores. Drive’s “Make a copy” feature creates a new file with a new ID — so our path-based dedup doesn’t catch it. Across five folders, this pattern produced 647 duplicate entries out of 8,473 indexed files.

The fix: use the md5Checksum field from the Drive API. If a file’s checksum already exists in the DB, skip it regardless of its file ID or name. We also ran a one-time cleanup pass over the existing DB to remove the duplicates already present, bringing the index down to 7,826 unique photos.

# Drive API returns md5Checksum for each file
meta = service.files().get(fileId=file_id, fields="id,md5Checksum").execute()

# Skip during indexing if content already exists
if md5 and md5 in existing_md5s:
    continue

Multi-Folder Metadata Without Re-Embedding

After fixing deduplication, we added a drive_folder_id column to track which folder each image came from. The problem: thousands of existing entries had this column as None.

Re-downloading and re-embedding all those Drive images just to write a metadata field would have taken another 8 hours. Instead, we implemented a three-way classification per file:

Case	Action	Time
File not in DB	Download + embed	~6s per image
File in DB, `drive_folder_id` is null	Metadata update only	~100ms for 500 files
File in DB, folder ID correct	Skip	0ms

Backfilling thousands of entries took under a second. The same pattern applies to any new column added later.

Fixing the MCP Startup Freeze

The original server loaded SigLIP at startup. With no GPU, this meant ~30–60 seconds of heavy CPU usage right as Claude was initializing — the two processes competed for the same resources, and Claude’s UI froze.

The solution was lazy load + idle unload:

Claude starts:    ~50 MB   (reads Lance DB index only)
First search:     ~900 MB  (SigLIP loads on demand, ~5-10s)
Idle > 5 min:     ~50 MB   (model freed automatically)

The model loads only when search_images is actually called. After 5 minutes of inactivity, it’s freed. A background re-indexing loop also waits 2 minutes after startup before its first run, so it doesn’t compete with Claude’s initialization.

One non-obvious gotcha: stdout is reserved for the MCP JSON-RPC protocol. Any print() statement in the server corrupts the communication channel. All logging must go to stderr. We lost 20 minutes to this before realizing why responses were malformed.

Quality Ranking: Finding the Best, Not Just the Relevant

Finding a photo that matches a description is only half the problem. From the start, the real goal was to surface photos worth publishing — not just semantically relevant ones.

“Priority on the best ones — sort pictures based on quality / likeability”

Semantic search doesn’t capture this. A query for “team meeting” returns semantically relevant photos — but the top results might be slightly blurry candid shots that happened to match the description better than a well-lit professional photo taken at the same event.

For blog posts and marketing materials, you want the photo that looks good, not just the one that matches the text.

Aesthetic Scoring

We added cafeai/cafe_aesthetic, a ViT classifier that scores images on a 0–1 aesthetic quality scale. Running it across our archive produced a clear pattern:

The distribution is strongly right-skewed: 76% of photos score above 0.7, meaning the model considers the vast majority of our archive to be high quality. This reflects how the archive was built — most photos come from professional event photographers, not casual phone snapshots. The tail of low-scoring images (the 7% below 0.4) are the candid shots, blurry group selfies, and poorly-lit captures that ended up in the same folders as the professional work.

What the scores look like in practice:

Low scores (below 0.4) — 7%

Blurry candid shots, poor lighting, unflattering angles. Technically captures the moment, unusable for content.

High scores (above 0.7) — 76%

Professional event photography. Good lighting, sharp focus, subjects aware of the camera. Ready to publish.

Running the scorer on our full archive took ~12 hours on our CPU — Drive images had to be re-downloaded for scoring since we don’t store them locally. With a GPU this would be roughly 30–60 minutes. To avoid losing progress to a crash or network error, we checkpoint to the DB every 500 images. Restarting after an interruption skips already-scored images.

Three Search Modes

search_images now supports three ranking strategies via a sort_by parameter:

relevance — classic semantic search. Ranks by SigLIP cosine similarity to the query. Good when you need a specific type of image and quality is secondary.

quality — filters by a minimum relevance threshold, then ranks by aesthetic score. This is what our colleague wanted: filter by topic, rank by beauty.

# Without min_relevance: returns beautiful photos unrelated to your query
# (portraits, landscapes, anything with high aesthetic score)
# With min_relevance=0.08: only on-topic images, ranked by how good they look
if relevance < min_relevance:
    continue
final_score = aesthetic_score

combined — weighted blend of both dimensions:

final_score = (1 - quality_weight) * relevance + quality_weight * aesthetic

In practice, sort_by="combined", quality_weight=0.6 gives the best results for content work.

The Difference in Practice

Same query — “team collaboration meeting” — with different modes:

Mode	Top result	Relevance	Aesthetic
`relevance`	trobzvideo19-30.JPG	0.149	0.966
`quality` + `min_relevance=0.08`	AMAD6877.jpg	0.082	0.993
`combined` weight=0.7	AMAD6909.jpg	0.132	0.985

In Claude, you just ask naturally:

“Find me 5 high-quality photos of people collaborating for a blog post”

Claude calls search_images with the right parameters and returns direct Drive links. The whole thing takes under 10 seconds.

Web UI for the Whole Team

The MCP server works well when working with our agents. But for other contexts we needed a browser-based interface that anyone could open.

Building the search UI itself was straightforward — a FastAPI server on top of the same Lance DB index and SigLIP model. The harder question was how to display the images.

Our photos live in Google Drive. Not everyone on the team has access to every folder, and sharing the folders publicly wasn’t something we wanted to do. We looked at three options:

Approach	Pros	Cons
Make files public ("anyone with the link")	Simple, direct Drive URLs work	Changes permissions on thousands of files
Cache images on server	Fast, no Drive round-trip per request	~15–20 GB disk, needs sync logic
Proxy through server	No permission changes, no local cache	~2–3s latency per image request

We went with the proxy approach. The server already has OAuth2 credentials from the indexing step — it uses those same credentials to download images from Drive on demand, resize them, and stream them back to the browser. Visitors never need a Google account. Nothing is stored locally.

# GET /image/<file_id>?size=400
request_obj = service.files().get_media(fileId=file_id)
buf = BytesIO()
MediaIoBaseDownload(buf, request_obj).next_chunk()
img = Image.open(buf).convert("RGB")
img.thumbnail((size, size), Image.LANCZOS)
# stream back as JPEG

The latency is acceptable for a browsing use case — thumbnails load in about a second, which is fast enough when you’re scanning a grid of results. Clicking a card opens a full-size preview in a lightbox that loads the higher-resolution version in the background.

What We Learned

Relevance and quality are orthogonal dimensions. A photo can perfectly match a description and still be unusable. Treating them as separate axes — and letting the caller choose how to combine them — is more useful than any single score.

min_relevance is mandatory in quality mode. Without a relevance floor, the quality ranker finds your archive’s most beautiful photo regardless of topic. A threshold of 0.08–0.10 acts as a topic gate; everything below it is filtered before quality ranking takes over. We discovered this the hard way when quality mode returned portrait headshots for a query about team meetings.

Metadata backfills are underrated. Adding a column to a large database sounds trivial. When each row requires re-downloading and re-processing a file, it’s an 8-hour job. Separating “needs embed” from “needs metadata update” turned hours into milliseconds.

Stdout is sacred in MCP servers. The MCP protocol uses stdout for JSON-RPC. One print() statement in the wrong place corrupts the entire message stream. Log to stderr or stay silent.

Lazy model loading is non-negotiable. An 800MB model loaded at startup turns your MCP server into a liability. Users feel the freeze and remove it. Load on first use, unload on idle — the server disappears from RAM when not needed.

Deduplication needs two layers for Drive archives. Path-based dedup (skip if file ID exists) handles the obvious case. It misses the subtler one: Google Drive’s “Make a copy” creates a new file ID for identical content. At scale, this silently inflates your index — we had 647 ghost entries (8% of our archive) before catching it. md5Checksum from the Drive API is the right key: same checksum means same pixels, regardless of filename or location.

Try It

If your team’s photos live in Google Drive, setup takes about 10 minutes: create a Google Cloud project, enable the Drive API, run OAuth2 once, and point the indexer at your folders. After that, finding the right photo for a blog post takes a natural language query instead of 20 minutes of browsing.

# Local photos — add to Claude Code
claude mcp add local-image-search -- uvx local-image-search ~/Pictures

# Google Drive photos
uv run --extra drive python embed.py --drive-folder <folder-url>
uv run --extra drive python embed.py --add-aesthetic-scores
uv run --extra drive python server.py

Full source, setup guide, and both PRs: github.com/Eventual-Inc/local-image-search