CCSeeker: AI-Powered YouTube Creator Discovery
A Streamlit application that automates niche creator discovery using a blend of AI and search algorithms for relevance and similarity rankings.
Focused on the problem, solution, and business impact Focused on architecture, algorithms, and implementation details
Impact
The Problem
Digital marketers spend 4-6 hours per campaign manually searching for niche YouTube creators. Existing tools either cost $300-500/month or only match keywords in channel names—missing specialists like “Dave’s Reviews” who posts nothing but tent comparisons.
The hardest searches are when you don’t know the niche vocabulary. How do you find “manga YouTubers” if you don’t know shounen from seinen?
The Solution
I built CCSeeker around one insight: the best way to find niche creators is often to start with one you already know.
Keyword Search → When You Know Your Niche
Enter topics like “manga reviews” and CCSeeker finds channels where multiple videos match—not just channels with keywords in their name. A channel called “Alex Reads Comics” with 40 manga videos ranks higher than “MANGA REVIEWS 2024” with 3 random uploads.
Channel-as-Seed → When You Have an Example
Paste a YouTube URL you like, and CCSeeker extracts what makes it tick (topics, posting frequency, engagement, tags), then finds similar creators ranked on a 100-point scale.
| Starting Point | Mental Model | Example |
|---|---|---|
| ”I know my niche” | Filter and rank | ”Show me camping gear reviewers" |
| "I have an example” | Find similar | ”Find channels like this one” |
Result: 4-6 hours → under 10 minutes. That’s the 95% time reduction.
How It Works
Key Features
Smart Filtering Set minimum subscribers, filter by country, or recent activity. Filters apply before expensive API calls.
AI Integration Enhance Relevance & similarity scoring. Also creates Channel summaries and personalized outreach emails (English/Spanish) via Google Gemini.
Debug Panel Real-time API usage, quota tracking, performance timing and cache effectiveness metrics.
ML Feedback Loop Architecture Captures user satisfaction after each search, BI analysis, Machine Learning training implemented and automated weight updates on scoring systems.
Results & Learnings
What Worked:
- Hybrid search (video content + channel names) outperformed single-signal approaches
- Per-channel caching reduced redundant API calls by 75%
- Debug panel eliminated “why is this slow?” questions
What I’d Do Differently:
- A/B test the 80/20 AI blend ratio with more feedback data
- Weight video and channel descriptions as fallback for channels without tags
- Add French/Portuguese/German stopwords for broader language support
- Do scoring even more transparent for users.
This project reflects my approach: start with a clear user problem, design for real-world constraints (API quotas, cost), make the system transparent enough that users trust it and finally improve it with feedback.
Scoring Methodology
Human control + AI intelligence
Both search modes blend 80% algorithmic scoring (fast, deterministic) with 20% AI analysis (semantic, catches edge cases). This delivers quality results while staying within free API quotas.
Relevance Score (Keyword Mode)
Measures how well channel’s content matches your query. Analyze up to 50 of the latest videos, each video is checked for keywords matches. Titles weighted 2:1 over tags and the average of all videos scores is the final relevance score.
Similarity Score (Seed Mode)
Measures similarity across five dimensions:
| Factor | Weight | What It Measures |
|---|---|---|
| Tag Overlap | 30% | Similar video tags (Jaccard similarity) |
| Keyword Overlap | 30% | Similar title keywords |
| Engagement Rate | 17% | Similar audience interaction |
| Subscriber Tier | 15% | Similar channel size |
| Upload Frequency | 8% | Similar posting pace |
The AI component catches stylistic similarities the algorithm misses.
AI Integration
Google Gemini (gemini-2.0-flash-lite) enhances CCSeeker in four ways:
| Feature | What It Does |
|---|---|
| Semantic Relevance | Evaluates if keyword matches actually make sense |
| Similarity “Vibe” | Rates how similar top 10 candidates feel to seed |
| Channel Summaries | One-paragraph overviews from video titles |
| Outreach Drafts | Personalized emails for top 3 matches (EN/ES) |
Graceful Degradation: All AI features are optional. Without Gemini, scoring falls back to 100% algorithmic—still effective.
Search Pipeline
- Both modes share a 10-step pipeline.
- Key optimization: filters apply at Steps 3-4, before the expensive video fetch at Step 5.
- Step 5 (video details) is the bottleneck without AI; Step 7 (AI scoring) is the bottleneck with AI enabled.
Technical Pipeline
Pipeline Step Reference
| Step | Code | Description | Cache |
|---|---|---|---|
| 0.5 | P0 | Query validation (max 2 terms) | — |
| 1 | P1 | Hybrid search + initial ranking | 3-day |
| 2 | P2 | Fetch channel stats | 7-day |
| 3 | F1 | User filters (subs, country, activity) | — |
| 4 | F2 | Backend filters (score threshold, cap 50) | — |
| 5-6 | D1 | Deep video analysis (10 videos/channel) | 24h smart |
| 7 | SC1 | Blended relevance score (both modes) | — |
| 8 | SC2 | Similarity score (seed mode only) | — |
| 9 | O1 | AI summary generation | — |
| 10 | O2 | Results display | — |
| — | O3 | Outreach drafts (optional) | — |
Entry Points: E1 (Keywords), E2 (Seed) | Exit Points: X1 (Relevance), X2 (Similarity)
APIs: YT-1/2/3 = YouTube Data API v3 | GEM-1/2/3 = Google Gemini
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ PRESENTATION │ app/main.py, debug_ui.py - UI, session state │
├─────────────────────────────────────────────────────────────────┤
│ CACHE │ app/cache/ - @cache_data, per-channel cache │
├─────────────────────────────────────────────────────────────────┤
│ CORE │ app/core/ - Pure logic
├─────────────────────────────────────────────────────────────────┤
│ ANALYTICS │ app/analytics/ - ML, feedback, quota tracker │
├─────────────────────────────────────────────────────────────────┤
│ EXTERNAL │ YouTube Data API v3 │ Google Gemini AI │
└─────────────────────────────────────────────────────────────────┘
Project Structure
CCSeeker/
├── app/
│ ├── core/ # Pure business logic (testable)
│ │ ├── pipeline.py # Search orchestration
│ │ ├── relevance.py # Keyword scoring
│ │ ├── youtube_api.py # API wrappers
│ │ ├── gemini_api.py # AI wrappers
│ │ ├── scoring_version.py # Centralized scoring weights
│ │ ├── seed_topics.py # Topic extraction
│ │ └── similarity.py # Multi-factor similarity (Streamlit-agnostic)
│ ├── cache/ # Streamlit caching layer
│ │ ├── cache_layer.py # @cache_data wrappers
│ │ └── smart_cache.py # Per-channel video caching (24h TTL)
│ ├── analytics/ # ML, analytics, and tracking
│ │ ├── ml_trainer.py # Logistic regression, cross-validation
│ │ ├── weight_optimizer.py # Scoring weight optimization
│ │ ├── fabric_export.py # Power BI export
│ │ ├── feedback_tracker.py # User feedback persistence
│ │ └── quota_tracker.py # API usage tracking (pure logic)
│ ├── main.py # UI (~1,675 lines)
│ └── debug_ui.py # Debug panel UI
├── tests/ # Unit tests (367 tests, mocked APIs)
└── docs/
Scoring Algorithms
Relevance (Keyword Mode)
def calculate_keyword_relevance(df, query, title_weight=2.0, tags_weight=1.0):
# Per-video: title match (0.67) + tags match (0.33) = 1.0 max
# Channel score = average of video scores → 0.0 to 1.0
# Final = 80% algorithmic + 20% Gemini
Similarity (Seed Mode)
Two-pass optimization:
- Calculate algorithmic scores for ALL candidates (fast)
- Enhance only top 10 with Gemini (expensive)
- Re-sort after AI enhancement
Key Technical Decisions
| Decision | Rationale |
|---|---|
| Layered architecture | Core logic testable without Streamlit |
| Callback pattern | Progress updates without st.progress() dependency |
| Dataclass results | Type safety, errors as data (not exceptions) |
| Filter before fetch | Save API quota by eliminating work early |
| Per-channel caching | Same channel in multiple searches shares cache |
| Soft penalties | ”2024” might be noise or relevant—don’t hard block |
Performance & Optimization
Caching Strategy
| Cache Type | TTL | Rationale |
|---|---|---|
| Search results | 3 days | Queries repeat; results stable |
| Channel stats | 7 days | Subscriber counts change slowly |
| Video details | 24 hours | Per-channel, not per-query |
Cache key normalization: search("manga, anime") and search("anime, manga") hit the same cache.
Benchmarks (Streamlit Cloud, Jan 2026)
Keyword Search Mode
| Scenario | Time | Quota | Bottleneck |
|---|---|---|---|
| 1 term, cold, no AI | 9-10s | 400 units | Video details (84-87%) |
| 1 term, warm, no AI | <0.1s | 100 units | Relevance filtering |
| 1 term, warm, with AI | 17-19s | 100 units | AI relevance (92-94%) |
| 2 terms, warm, with AI | 20-25s | 200 units | AI relevance |
Seed-Based Search Mode
| Scenario | Time | Quota | Bottleneck |
|---|---|---|---|
| 1 term, cold, no AI | 12-15s | 450 units | Video details (70-75%) |
| 1 term, warm, with AI | 25-30s | 100 units | AI relevance (55-60%) |
| 2 terms, warm, with AI | 35-45s | 200 units | AI + similarity calc |
Cache benefit: 99% faster, 75% less quota on repeat searches.
Storage Behavior
| Cache Type | Local | Streamlit Cloud |
|---|---|---|
@st.cache_data | Persists while running | Resets on restart |
.feedback_data.json | Persists indefinitely | Resets on restart |
App restarts: idle timeout (~7 days), git push, platform maintenance.
Observability
Debug Panel tracks: API calls, quota units, timing per stage, cache hits (<50ms = hit).
Feedback System captures user satisfaction after each search:
- Inputs: Thumbs up/down with optional reason (poor fit, low quality, wrong topic)
- Data collected: Timestamp, search mode, query, top 5 results with scores, filter settings, AI enabled flag
- Seed mode extras: Full scoring component breakdown (tag, keyword, subscriber, engagement, frequency scores)
Analytics Pipeline enables ML-powered improvements:
- Logistic regression models with cross-validation
- Weight optimization based on feedback correlation
- Export to Microsoft Fabric/Power BI for dashboards
Feedback Loop Status: Data collection active, ML training implemented, automated weight updates on roadmap (w/version signature).
Testing Strategy
367 total tests covering all core modules with mocked API clients.
| Test File | Tests | Focus |
|---|---|---|
test_query_utils.py | 21 | URL parsing, validation |
test_relevance.py | 13 | Scoring accuracy, edge cases |
test_youtube_api.py | 29 | Search results, channel stats, error handling |
test_gemini_api.py | 31 | AI scoring, summary generation, API failures |
test_pipeline.py | 26 | Full flow, filters, early exits, callbacks |
test_seed_topics.py | 46 | Topic extraction, language detection |
test_similarity.py | 63 | Similarity scoring, callbacks, Gemini integration |
test_analytics.py | 27 | ML training, weight optimization |
test_feedback_tracker.py | 27 | Feedback persistence, export |
test_quota_tracker.py | 42 | Quota calculations, persistence, tracking |
test_scoring_version.py | 26 | Scoring weights, version management |
test_performance.py | 16 | Performance benchmarks, timing |
Approach: Mock APIs, test edge cases, verify callbacks, validate ML pipelines.
Tech Stack
| Layer | Technology | Why |
|---|---|---|
| Language | Python 3.11 | Type hints, broad ecosystem |
| Framework | Streamlit 1.49 | Rapid prototyping, free hosting |
| APIs | YouTube v3, Gemini 2.0 Flash Lite | ToS-compliant, generous free tiers |
| Data | Pandas | Efficient filtering/grouping |
| Testing | pytest (367 tests) | Standard, good mocking, full coverage |
| ML | scikit-learn | Logistic regression, cross-validation |
Known Limitations
| Limitation | Impact | Potential Fix |
|---|---|---|
| YouTube quota (10K/day) | ~25 cold searches/day | BYOK option |
| Language (EN/ES only) | Quality degrades elsewhere | Add FR/DE/PT stopwords |
| Tag dependency | Tagless channels max 70/100 | Weight descriptions as fallback |
| Ephemeral storage | Feedback resets on restart | Migrate to cloud storage |