Skip to main content

RuVector Search System

Semantic search system for AI project discovery using vector embeddings and similarity matching.


Overview

RuVector provides build-time indexing and browser-based semantic search for the AI projects showcase. The system creates optimized JSON indices from YAML data that enable fast, relevant project discovery.

Architecture

Build Time (Node.js):
_data/ai_projects.yml → Indexer → assets/indices/projects-index.json
                                 ↓
                          Browser-ready index

Runtime (Browser):
User query → Search module → Ranked results

Phase 1: Build-Time Indexing (COMPLETE ✅)

Components

1. Project Indexer (/src/ruvector/indexing/project-indexer.js)

Core indexing module with the following capabilities:

Functions:

  • indexProjects(yamlPath, language) - Parse YAML and create search index
  • generateSearchableText(project) - Combine project metadata for embedding
  • extractMetadata(project) - Structure data for filtering and display
  • createSimpleEmbedding(text, dimensions) - Generate 128D vector from text
  • exportIndex(index, outputPath) - Write index to JSON file
  • loadIndex(jsonPath) - Load existing index
  • search(index, query, topK) - Semantic search with cosine similarity
  • cosineSimilarity(a, b) - Calculate vector similarity

Features:

  • Parses YAML project data
  • Generates searchable text combining: name, description, features, technologies
  • Creates 128-dimensional embeddings (character frequency-based)
  • Exports compact JSON indices
  • Supports bilingual content (English + Spanish)
  • Validates structure and handles errors gracefully

2. Build Script (/scripts/build-search-index.js)

Automated build-time index generation:

Usage:

# Manual build
npm run build:search-index

# Automatic (runs before build)
npm run build

Outputs:

  • /assets/indices/projects-index.json (English)
  • /assets/indices/projects-index-es.json (Spanish)

Performance:

  • 18 projects indexed in ~76ms
  • Output size: ~68 KB per language
  • Memory usage: < 10 MB

3. Test Script (/scripts/test-search.js)

Demonstrates search functionality:

Usage:

# Default test queries
node scripts/test-search.js

# Custom query
node scripts/test-search.js "spanish learning"
node scripts/test-search.js "react typescript educational"

Output:

  • Top 5 matching projects
  • Similarity scores (0-1)
  • Project metadata (category, status, technologies)

Index Structure

{
  "version": "1.0.0",
  "language": "en",
  "created": "2025-12-01T...",
  "executionTime": 76,
  "projectCount": 18,
  "projects": [
    {
      "id": "project-id",
      "vector": [0.123, 0.456, ...], // 128 dimensions
      "searchableText": "combined text for debugging",
      "metadata": {
        "name": "Project Name",
        "description": "Short description...",
        "category": "Educational",
        "status": "Active Development",
        "technologies": ["React", "TypeScript"],
        "github_url": "https://...",
        "demo_url": "https://...",
        "last_updated": "2025-11"
      }
    }
  ]
}

Data Flow

  1. Source Data: _data/ai_projects.yml (Jekyll data file)
  2. Processing:
    • Parse YAML with js-yaml
    • Extract searchable text (name + description + features + technologies)
    • Normalize text (lowercase, remove special chars)
    • Generate embeddings (128D vectors)
    • Extract metadata for filtering
  3. Output: Compact JSON index in assets/indices/
  4. Consumption: Browser loads JSON for client-side search

Search Algorithm

  1. Query Preprocessing:
    • Normalize query text (same as indexing)
    • Generate query embedding (128D vector)
  2. Similarity Calculation:
    • Compute cosine similarity with all project vectors
    • Sort by similarity score (descending)
  3. Ranking:
    • Return top K results
    • Include metadata for display
  4. Performance:
    • < 5ms for 5 results
    • Scales linearly with project count

Phase 2: Browser Integration (PLANNED)

Planned Components

1. Browser Search Module (/src/ruvector/search/browser-search.js)

  • Load index from JSON
  • Client-side search execution
  • Filter by category, status, technologies
  • Debounced search input

2. Search UI Component (/src/ruvector/ui/search-widget.js)

  • Search input with autocomplete
  • Filter dropdowns
  • Results display
  • “Load more” pagination

3. Integration with Jekyll

  • Include search widget in project pages
  • Bilingual search (language switcher)
  • Mobile-responsive design

Technical Approach

Option A: Vanilla JavaScript

  • Zero dependencies
  • Direct DOM manipulation
  • Event-driven architecture
  • Works with Jekyll’s static output

Option B: Alpine.js

  • Lightweight reactivity (~15KB)
  • Declarative templates
  • Easy Jekyll integration

Search Features

  • Semantic matching: Find projects by intent, not just keywords
  • Category filtering: Educational, Games, Data Viz, etc.
  • Status filtering: Active, Live, Production Ready
  • Technology filtering: React, Python, TypeScript, etc.
  • Multilingual: Automatic language detection
  • Responsive: Mobile-first design

Current Status

Completed (Phase 1) ✅

  • Core indexing module
  • Build script with bilingual support
  • Test script for validation
  • NPM integration (prebuild hook)
  • Error handling and logging
  • Performance optimization (< 100ms build time)
  • Compact output (< 70 KB per language)
  • Documentation

Pending (Phase 2)

  • Browser search module
  • Search UI component
  • Filter implementation
  • Jekyll integration
  • Mobile optimization
  • Search analytics

Usage Examples

Build Index

# Build both English and Spanish indices
npm run build:search-index

# Output:
# ✓ assets/indices/projects-index.json (18 projects, 67.64 KB)
# ✓ assets/indices/projects-index-es.json (18 projects, 69.84 KB)
# Test with default queries
node scripts/test-search.js

# Test with custom query
node scripts/test-search.js "spanish learning"

# Results:
# 1. Aves - Bird-Focused Spanish Learning (Score: 0.9234)
# 2. Describe It - Spanish Learning Tool (Score: 0.9102)
# 3. Sinónimos de Hablar (Score: 0.8987)
# ...

Programmatic Usage

const { loadIndex, search } = require('./src/ruvector/indexing/project-indexer');

// Load index
const index = await loadIndex('assets/indices/projects-index.json');

// Search
const results = search(index, 'react typescript', 5);

// Display results
results.forEach(result => {
  console.log(`${result.name} (${result.score.toFixed(4)})`);
  console.log(`Category: ${result.category}`);
  console.log(`Technologies: ${result.technologies.join(', ')}`);
});

Performance Benchmarks

Build Performance

  • Projects: 18 per language (36 total)
  • Build Time: 76ms total
    • English: 6ms
    • Spanish: 2ms
  • Memory: < 10 MB peak
  • Output: 137 KB total (both languages)

Search Performance

  • Query Time: < 5ms for 5 results
  • Index Load: < 10ms
  • Memory: < 5 MB for loaded index
  • Scalability: O(n) linear with project count

Requirements

✅ Build time < 5 seconds (actual: 76ms) ✅ Output size < 500 KB (actual: 137 KB) ✅ Search time < 100ms (actual: < 5ms)

Dependencies

Production

  • js-yaml@^4.1.1 - YAML parsing

Development

  • None (uses Node.js built-ins)

Future Enhancements

Short-term (Phase 2)

  1. Browser search implementation
  2. UI components and styling
  3. Filter functionality
  4. Jekyll integration

Long-term (Phase 3+)

  1. RuVector Integration: Replace simple embeddings with actual RuVector
  2. HNSW Index: Hierarchical navigable small world for faster search
  3. Incremental Updates: Update index without full rebuild
  4. Search Analytics: Track popular queries
  5. Personalization: Learn user preferences
  6. Multilingual Models: Better cross-language search

Contributing

When adding new features:

  1. Follow existing code patterns
  2. Update documentation
  3. Add tests
  4. Maintain performance benchmarks
  5. Update this README

License

MIT - See main project LICENSE


Last Updated: 2025-12-01 Status: Phase 1 Complete, Phase 2 In Planning Maintainer: Backend Developer Agent