June 15, 2026
Find Any Product on Retail Shelf Quickly by Name: Grounding DINO + EasyOCR
Open-vocabulary retail shelf detection tool that finds products by text query using Grounding DINO zero-shot object detection and EasyOCR text extraction. No training required—just describe what you're looking for.
Locate Any Product on a Retail Shelf — Instantly
Ever needed to verify product placement on packed retail shelves? Retail Query uses state-of-the-art zero-shot object detection (Grounding DINO) combined with OCR (EasyOCR) to find any product by name—no model training required. Just query by text.
This tool is designed for retail operations, supply chain verification, and shelf analytics. It handles dense shelf images with thousands of products and detects items even when text is small, partially occluded, or at angles.
GitHub: tech-microcosm/retail-query
How It Works
The detection pipeline combines two complementary approaches:
1. Grounding DINO: Visual Object Detection
Grounding DINO is a transformer-based zero-shot object detector that understands natural language queries. Unlike traditional object detectors that need retraining for each product category, Grounding DINO can detect objects based on text descriptions—no training data required.
The model encodes both the shelf image and the query text (e.g., “shrimp snacks”) into a shared embedding space, then localizes regions where the text semantics match the visual features.
2. EasyOCR: Text Extraction & Matching
EasyOCR extracts text from detected bounding boxes and the full image. The pipeline uses multi-pass preprocessing (CLAHE, unsharp masking, color inversion) to handle challenging text:
- Small fonts on product labels
- Curved surfaces (jars, bottles)
- Low contrast (dark text on dark backgrounds)
- Specular reflections
3. Fusion Scoring
Detections are scored using a weighted combination:
combined_score = 0.75 × detection_confidence + 0.25 × text_similarity
This balances visual evidence (the detector is confident an object is present) with textual confirmation (OCR reads matching text). High-confidence matches (≥0.70) are highlighted with thick green boxes; moderate matches (0.35–0.70) fall back to orange.
4. SAHI Slicing (Optional)
For very large images (>6000px), the tool can use Slicing Aided Hyper Inference (SAHI) to tile the image into overlapping patches, run detection on each patch, then merge results. This preserves detection accuracy on small objects at high resolution.
Quick Start
Requirements
- CUDA-capable GPU (12GB+ VRAM recommended)
- Python 3.12
- Linux or WSL
Installation
git clone https://github.com/tech-microcosm/retail-query.git
cd retail-query
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Run a Query
python query_shelf.py \
--image test_images/shelf_image5.jpg \
--query "arabic coffee" \
--output result_arabic.jpg
The tool will:
- Load the image
- Run Grounding DINO detection + OCR-first scan in parallel
- Refine detections with multi-pass OCR preprocessing
- Draw annotated bounding boxes
- Save the result image and a zoomed crop (if high-confidence detections exist)
Demo Examples
Example 1: Finding “Arabic Coffee” in a Dense Coffee Shelf
Query: "arabic"
Image: High-resolution shelf with hundreds of coffee bags
The tool correctly identifies “100% Arabica” coffee packages in the top-left corner, even though the text “Arabica” is partially obscured by shelf reflections.
Raw shelf image:

Detection result (green box = high confidence ≥0.70):

The green box marks the region where OCR extracted “AR” and “Arabita” text that semantically matches “arabic” with 0.85 similarity. Only one high-confidence detection was found, so orange boxes (moderate matches) are suppressed.
Example 2: Finding “Maximo Delight” Snacks (Small Text)
Query: "maximo delight"
Image: Snack shelf with 8000×6000px resolution
This is a challenging case—the “MAXIMO DELIGHT” packages are small (occupying ~100×80 pixels in the original image), tucked on the bottom shelf behind other products. The algorithm:
- Runs OCR across the full image
- Finds exact text match “MAXIMO DELIGHT” on package labels
- Expands OCR bounding boxes to capture the full package
- Scores at 0.86 (high confidence) due to perfect text match
Raw shelf image:

Detection result (6 green boxes):

Zoomed crop (auto-generated for blog use):

The zoomed crop shows the thick green outlines clearly marking the tiny “MAXIMO DELIGHT” packages on the bottom shelf. The boxes are scaled to image resolution—36px thick on this 8000px-wide image, ensuring visibility.
Example 3: Finding “Shrimp Snacks” Across Two Shelves
Query: "shrimp"
Image: Snack aisle with multiple brands
The detector finds 16 instances of “SHRIMP SNACKS” bags across two full shelves. OCR reads “SHRIMP” with 100% confidence (perfect text match), and Grounding DINO confirms the visual packaging. All detections score ≥0.79.
Raw shelf image:

Detection result (16 green boxes):

Zoomed crop:

The zoom crop shows the union of all high-confidence detections with padding—useful for verifying that the algorithm correctly distinguished “shrimp” snacks from neighboring “garlic” and “tempura” products.
Technical Details
Model Architecture:
- Grounding DINO:
IDEA-Research/grounding-dino-basevia Hugging Face Transformers - OCR Engine: EasyOCR with English language support
- Detection Thresholds:
det_threshold=0.35,text_threshold=0.25 - Inference: PyTorch FP32 (deformable attention doesn’t support FP16)
Performance:
- Typical Runtime: 30-50 seconds for 8000×6000px image on RTX 4050 (6GB VRAM)
- Accuracy: ~85% recall on small-text retail labels (compared to ~65% with OWL-ViT baseline)
Preprocessing Pipeline:
- LANCZOS upscaling if image width/height < 1400px
- CLAHE (Contrast Limited Adaptive Histogram Equalization)
- Unsharp masking for edge enhancement
- Color inversion for dark labels with light text
- LAB color space conversion for challenging contrast scenarios
Next Steps: From Desktop Tool to Universal Access
This tool currently requires:
- A desktop/laptop with CUDA GPU (12GB+ VRAM)
- Python environment setup
- Command-line usage
Future directions to make this accessible to everyone:
Web Application
Deploy the model to a cloud GPU backend (RunPod, Modal, AWS) and wrap it in a FastAPI server. Users upload an image via web interface, enter a query, and receive annotated results in seconds. No local GPU required.
Tech stack: FastAPI + React, model inference on A100 pods, results cached in S3.
Mobile Application
Fine-tune a distilled version of Grounding DINO (e.g., MobileDINO) to run on-device using CoreML (iOS) or TensorFlow Lite (Android). This enables real-time shelf scanning in retail stores without internet connectivity.
Challenges: Model size (the base model is 600MB), inference latency on mobile GPUs, battery consumption.
Progressive Enhancement
- API Gateway: Expose the tool as a REST API with rate limiting
- Batch Processing: Allow users to upload multiple images and queue jobs
- Custom Models: Fine-tune Grounding DINO on specific retail categories for higher accuracy
- Real-Time Video: Process video streams for shelf audits (identify misplaced products, out-of-stock items)
The core detection pipeline is production-ready. The main engineering effort is scaling the backend and optimizing the user experience for non-technical users.
Retail Query: Zero-shot shelf detection, powered by Grounding DINO and EasyOCR.
GitHub: tech-microcosm/retail-query