CCSeeker: AI-Powered YouTube Creator Discovery

A Streamlit application that automates niche creator discovery using a blend of AI and search algorithms for relevance and similarity rankings.

Client: Personal Project Role: Solo Developer & Product Designer
Python Streamlit YouTube Data API v3 Google Gemini AI Pandas
CCSeeker application showing YouTube creator search results

Focused on the problem, solution, and business impact

Impact

95%
Time Saved
2
Search Modes
10 min
Avg. Search Time

The Problem

Digital marketers spend 4-6 hours per campaign manually searching for niche YouTube creators. Existing tools either cost $300-500/month or only match keywords in channel names—missing specialists like “Dave’s Reviews” who posts nothing but tent comparisons.

The hardest searches are when you don’t know the niche vocabulary. How do you find “manga YouTubers” if you don’t know shounen from seinen?

The Solution

I built CCSeeker around one insight: the best way to find niche creators is often to start with one you already know.

Keyword Search → When You Know Your Niche

Enter topics like “manga reviews” and CCSeeker finds channels where multiple videos match—not just channels with keywords in their name. A channel called “Alex Reads Comics” with 40 manga videos ranks higher than “MANGA REVIEWS 2024” with 3 random uploads.

Channel-as-Seed → When You Have an Example

Paste a YouTube URL you like, and CCSeeker extracts what makes it tick (topics, posting frequency, engagement, tags), then finds similar creators ranked on a 100-point scale.

Starting PointMental ModelExample
”I know my niche”Filter and rank”Show me camping gear reviewers"
"I have an example”Find similar”Find channels like this one”

Result: 4-6 hours → under 10 minutes. That’s the 95% time reduction.

How It Works

CCSeeker Product Flow

Key Features

Smart Filtering Set minimum subscribers, filter by country, or recent activity. Filters apply before expensive API calls.

AI Integration Enhance Relevance & similarity scoring. Also creates Channel summaries and personalized outreach emails (English/Spanish) via Google Gemini.

Debug Panel Real-time API usage, quota tracking, performance timing and cache effectiveness metrics.

ML Feedback Loop Architecture Captures user satisfaction after each search, BI analysis, Machine Learning training implemented and automated weight updates on scoring systems.

Results & Learnings

What Worked:

  • Hybrid search (video content + channel names) outperformed single-signal approaches
  • Per-channel caching reduced redundant API calls by 75%
  • Debug panel eliminated “why is this slow?” questions

What I’d Do Differently:

  • A/B test the 80/20 AI blend ratio with more feedback data
  • Weight video and channel descriptions as fallback for channels without tags
  • Add French/Portuguese/German stopwords for broader language support
  • Do scoring even more transparent for users.

This project reflects my approach: start with a clear user problem, design for real-world constraints (API quotas, cost), make the system transparent enough that users trust it and finally improve it with feedback.

Scoring Methodology

Human control + AI intelligence

Both search modes blend 80% algorithmic scoring (fast, deterministic) with 20% AI analysis (semantic, catches edge cases). This delivers quality results while staying within free API quotas.

Relevance Score (Keyword Mode)

Measures how well channel’s content matches your query. Analyze up to 50 of the latest videos, each video is checked for keywords matches. Titles weighted 2:1 over tags and the average of all videos scores is the final relevance score.

Similarity Score (Seed Mode)

Measures similarity across five dimensions:

FactorWeightWhat It Measures
Tag Overlap30%Similar video tags (Jaccard similarity)
Keyword Overlap30%Similar title keywords
Engagement Rate17%Similar audience interaction
Subscriber Tier15%Similar channel size
Upload Frequency8%Similar posting pace

The AI component catches stylistic similarities the algorithm misses.


AI Integration

Google Gemini (gemini-2.0-flash-lite) enhances CCSeeker in four ways:

FeatureWhat It Does
Semantic RelevanceEvaluates if keyword matches actually make sense
Similarity “Vibe”Rates how similar top 10 candidates feel to seed
Channel SummariesOne-paragraph overviews from video titles
Outreach DraftsPersonalized emails for top 3 matches (EN/ES)

Graceful Degradation: All AI features are optional. Without Gemini, scoring falls back to 100% algorithmic—still effective.

Search Pipeline

  • Both modes share a 10-step pipeline.
  • Key optimization: filters apply at Steps 3-4, before the expensive video fetch at Step 5.
  • Step 5 (video details) is the bottleneck without AI; Step 7 (AI scoring) is the bottleneck with AI enabled.

Technical Pipeline

CCSeeker Technical Pipeline

Pipeline Step Reference
StepCodeDescriptionCache
0.5P0Query validation (max 2 terms)
1P1Hybrid search + initial ranking3-day
2P2Fetch channel stats7-day
3F1User filters (subs, country, activity)
4F2Backend filters (score threshold, cap 50)
5-6D1Deep video analysis (10 videos/channel)24h smart
7SC1Blended relevance score (both modes)
8SC2Similarity score (seed mode only)
9O1AI summary generation
10O2Results display
O3Outreach drafts (optional)

Entry Points: E1 (Keywords), E2 (Seed) | Exit Points: X1 (Relevance), X2 (Similarity)

APIs: YT-1/2/3 = YouTube Data API v3 | GEM-1/2/3 = Google Gemini

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│  PRESENTATION    │  app/main.py, debug_ui.py - UI, session state │
├─────────────────────────────────────────────────────────────────┤
│  CACHE           │  app/cache/ - @cache_data, per-channel cache  │
├─────────────────────────────────────────────────────────────────┤
│  CORE            │  app/core/ - Pure logic      
├─────────────────────────────────────────────────────────────────┤
│  ANALYTICS       │  app/analytics/ - ML, feedback, quota tracker │
├─────────────────────────────────────────────────────────────────┤
│  EXTERNAL        │  YouTube Data API v3  │  Google Gemini AI    │
└─────────────────────────────────────────────────────────────────┘

Project Structure

CCSeeker/
├── app/
│   ├── core/                     # Pure business logic (testable)
│   │   ├── pipeline.py           # Search orchestration
│   │   ├── relevance.py          # Keyword scoring
│   │   ├── youtube_api.py        # API wrappers
│   │   ├── gemini_api.py         # AI wrappers
│   │   ├── scoring_version.py    # Centralized scoring weights
│   │   ├── seed_topics.py        # Topic extraction
│   │   └── similarity.py         # Multi-factor similarity (Streamlit-agnostic)
│   ├── cache/                    # Streamlit caching layer
│   │   ├── cache_layer.py        # @cache_data wrappers
│   │   └── smart_cache.py        # Per-channel video caching (24h TTL)
│   ├── analytics/                # ML, analytics, and tracking
│   │   ├── ml_trainer.py         # Logistic regression, cross-validation
│   │   ├── weight_optimizer.py   # Scoring weight optimization
│   │   ├── fabric_export.py      # Power BI export
│   │   ├── feedback_tracker.py   # User feedback persistence
│   │   └── quota_tracker.py      # API usage tracking (pure logic)
│   ├── main.py                   # UI (~1,675 lines)
│   └── debug_ui.py               # Debug panel UI
├── tests/                        # Unit tests (367 tests, mocked APIs)
└── docs/

Scoring Algorithms

Relevance (Keyword Mode)

def calculate_keyword_relevance(df, query, title_weight=2.0, tags_weight=1.0):
    # Per-video: title match (0.67) + tags match (0.33) = 1.0 max
    # Channel score = average of video scores → 0.0 to 1.0
    # Final = 80% algorithmic + 20% Gemini

Similarity (Seed Mode)

Two-pass optimization:

  1. Calculate algorithmic scores for ALL candidates (fast)
  2. Enhance only top 10 with Gemini (expensive)
  3. Re-sort after AI enhancement

Key Technical Decisions

DecisionRationale
Layered architectureCore logic testable without Streamlit
Callback patternProgress updates without st.progress() dependency
Dataclass resultsType safety, errors as data (not exceptions)
Filter before fetchSave API quota by eliminating work early
Per-channel cachingSame channel in multiple searches shares cache
Soft penalties”2024” might be noise or relevant—don’t hard block

Performance & Optimization

Caching Strategy

Cache TypeTTLRationale
Search results3 daysQueries repeat; results stable
Channel stats7 daysSubscriber counts change slowly
Video details24 hoursPer-channel, not per-query

Cache key normalization: search("manga, anime") and search("anime, manga") hit the same cache.

Benchmarks (Streamlit Cloud, Jan 2026)

Keyword Search Mode

ScenarioTimeQuotaBottleneck
1 term, cold, no AI9-10s400 unitsVideo details (84-87%)
1 term, warm, no AI<0.1s100 unitsRelevance filtering
1 term, warm, with AI17-19s100 unitsAI relevance (92-94%)
2 terms, warm, with AI20-25s200 unitsAI relevance

Seed-Based Search Mode

ScenarioTimeQuotaBottleneck
1 term, cold, no AI12-15s450 unitsVideo details (70-75%)
1 term, warm, with AI25-30s100 unitsAI relevance (55-60%)
2 terms, warm, with AI35-45s200 unitsAI + similarity calc

Cache benefit: 99% faster, 75% less quota on repeat searches.

Storage Behavior

Cache TypeLocalStreamlit Cloud
@st.cache_dataPersists while runningResets on restart
.feedback_data.jsonPersists indefinitelyResets on restart

App restarts: idle timeout (~7 days), git push, platform maintenance.


Observability

Debug Panel tracks: API calls, quota units, timing per stage, cache hits (<50ms = hit).

Feedback System captures user satisfaction after each search:

  • Inputs: Thumbs up/down with optional reason (poor fit, low quality, wrong topic)
  • Data collected: Timestamp, search mode, query, top 5 results with scores, filter settings, AI enabled flag
  • Seed mode extras: Full scoring component breakdown (tag, keyword, subscriber, engagement, frequency scores)

Analytics Pipeline enables ML-powered improvements:

  • Logistic regression models with cross-validation
  • Weight optimization based on feedback correlation
  • Export to Microsoft Fabric/Power BI for dashboards

Feedback Loop Status: Data collection active, ML training implemented, automated weight updates on roadmap (w/version signature).


Testing Strategy

367 total tests covering all core modules with mocked API clients.

Test FileTestsFocus
test_query_utils.py21URL parsing, validation
test_relevance.py13Scoring accuracy, edge cases
test_youtube_api.py29Search results, channel stats, error handling
test_gemini_api.py31AI scoring, summary generation, API failures
test_pipeline.py26Full flow, filters, early exits, callbacks
test_seed_topics.py46Topic extraction, language detection
test_similarity.py63Similarity scoring, callbacks, Gemini integration
test_analytics.py27ML training, weight optimization
test_feedback_tracker.py27Feedback persistence, export
test_quota_tracker.py42Quota calculations, persistence, tracking
test_scoring_version.py26Scoring weights, version management
test_performance.py16Performance benchmarks, timing

Approach: Mock APIs, test edge cases, verify callbacks, validate ML pipelines.


Tech Stack

LayerTechnologyWhy
LanguagePython 3.11Type hints, broad ecosystem
FrameworkStreamlit 1.49Rapid prototyping, free hosting
APIsYouTube v3, Gemini 2.0 Flash LiteToS-compliant, generous free tiers
DataPandasEfficient filtering/grouping
Testingpytest (367 tests)Standard, good mocking, full coverage
MLscikit-learnLogistic regression, cross-validation

Known Limitations

LimitationImpactPotential Fix
YouTube quota (10K/day)~25 cold searches/dayBYOK option
Language (EN/ES only)Quality degrades elsewhereAdd FR/DE/PT stopwords
Tag dependencyTagless channels max 70/100Weight descriptions as fallback
Ephemeral storageFeedback resets on restartMigrate to cloud storage