CCSeeker: AI-Powered YouTube Creator Discovery

The Problem

Digital marketers spend 30–45 minutes manually searching for niche YouTube creators for a single outreach session. Existing tools either cost $300-500/month or only match keywords in channel names—missing specialists like “Dave’s Reviews” who posts nothing but tent comparisons.

The hardest searches are when you don’t know the niche vocabulary. How do you find “manga YouTubers” if you don’t know shounen from seinen?

The Solution

I built CCSeeker around one insight: the best way to find niche creators is often to start with one you already know.

Keyword Search → When You Know Your Niche

Enter topics like “manga reviews” and CCSeeker finds channels where multiple videos match—not just channels with keywords in their name. A channel called “Alex Reads Comics” with 40 manga videos ranks higher than “MANGA REVIEWS 2024” with 3 random uploads.

Channel-as-Seed → When You Have an Example

Paste a YouTube URL you like, and CCSeeker extracts what makes it tick (topics, posting frequency, engagement, tags), then finds similar creators ranked on a 100-point scale.

Starting Point	Mental Model	Example
”I know my niche”	Filter and rank	”Show me camping gear reviewers"
"I have an example”	Find similar	”Find channels like this one”

Result: 30–45 minutes → under 5 minutes. That’s a real, repeatable improvement.

How It Works

CCSeeker Product Flow

Key Features

Smart Filtering Set minimum subscribers, filter by country, or recent activity. Filters apply before expensive API calls.

AI Integration Enhance Relevance & similarity scoring. Also creates Channel summaries and personalized outreach emails (English/Spanish) via Google Gemini.

Debug Panel Real-time API usage, quota tracking, performance timing and cache effectiveness metrics.

ML Feedback Loop Architecture Captures user satisfaction after each search, BI analysis, Machine Learning training implemented and automated weight updates on scoring systems.

Results & Learnings

What Worked:

Hybrid search (video content + channel names) outperformed single-signal approaches
Per-channel caching reduced redundant API calls by 75%
Debug panel eliminated “why is this slow?” questions

What I’d Do Differently:

A/B test the 80/20 AI blend ratio with more feedback data
Weight video and channel descriptions as fallback for channels without tags
Add French/Portuguese/German stopwords for broader language support
Do scoring even more transparent for users.

This project reflects my approach: start with a clear user problem, design for real-world constraints (API quotas, cost), make the system transparent enough that users trust it and finally improve it with feedback.

Scoring Methodology

Human control + AI intelligence

Both search modes blend 80% algorithmic scoring (fast, deterministic) with 20% AI analysis (semantic, catches edge cases). This delivers quality results while staying within free API quotas.

Relevance Score (Keyword Mode)

Measures how well channel’s content matches your query. Analyze up to 50 of the latest videos, each video is checked for keywords matches. Titles weighted 2:1 over tags and the average of all videos scores is the final relevance score.

Similarity Score (Seed Mode)

Measures similarity across five dimensions:

Factor	Weight	What It Measures
Tag Overlap	30%	Similar video tags (Jaccard similarity)
Keyword Overlap	30%	Similar title keywords
Engagement Rate	17%	Similar audience interaction
Subscriber Tier	15%	Similar channel size
Upload Frequency	8%	Similar posting pace

The AI component catches stylistic similarities the algorithm misses.

AI Integration

Google Gemini (gemini-2.0-flash-lite) enhances CCSeeker in four ways:

Feature	What It Does
Semantic Relevance	Evaluates if keyword matches actually make sense
Similarity “Vibe”	Rates how similar top 10 candidates feel to seed
Channel Summaries	One-paragraph overviews from video titles
Outreach Drafts	Personalized emails for top 3 matches (EN/ES)

Graceful Degradation: All AI features are optional. Without Gemini, scoring falls back to 100% algorithmic—still effective.

Search Pipeline

Both modes share a 10-step pipeline.
Key optimization: filters apply at Steps 3-4, before the expensive video fetch at Step 5.
Step 5 (video details) is the bottleneck without AI; Step 7 (AI scoring) is the bottleneck with AI enabled.

Technical Pipeline

CCSeeker Technical Pipeline

Pipeline Step Reference

Step	Code	Description	Cache
0.5	P0	Query validation (max 2 terms)	—
1	P1	Hybrid search + initial ranking	3-day
2	P2	Fetch channel stats	7-day
3	F1	User filters (subs, country, activity)	—
4	F2	Backend filters (score threshold, cap 50)	—
5-6	D1	Deep video analysis (10 videos/channel)	24h smart
7	SC1	Blended relevance score (both modes)	—
8	SC2	Similarity score (seed mode only)	—
9	O1	AI summary generation	—
10	O2	Results display	—
—	O3	Outreach drafts (optional)	—

Entry Points: E1 (Keywords), E2 (Seed) | Exit Points: X1 (Relevance), X2 (Similarity)

APIs: YT-1/2/3 = YouTube Data API v3 | GEM-1/2/3 = Google Gemini

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│  PRESENTATION    │  app/main.py, debug_ui.py - UI, session state │
├─────────────────────────────────────────────────────────────────┤
│  CACHE           │  app/cache/ - @cache_data, per-channel cache  │
├─────────────────────────────────────────────────────────────────┤
│  CORE            │  app/core/ - Pure logic      
├─────────────────────────────────────────────────────────────────┤
│  ANALYTICS       │  app/analytics/ - ML, feedback, quota tracker │
├─────────────────────────────────────────────────────────────────┤
│  EXTERNAL        │  YouTube Data API v3  │  Google Gemini AI    │
└─────────────────────────────────────────────────────────────────┘

Project Structure

CCSeeker/
├── app/
│   ├── core/                     # Pure business logic (testable)
│   │   ├── pipeline.py           # Search orchestration
│   │   ├── relevance.py          # Keyword scoring
│   │   ├── youtube_api.py        # API wrappers
│   │   ├── gemini_api.py         # AI wrappers
│   │   ├── scoring_version.py    # Centralized scoring weights
│   │   ├── seed_topics.py        # Topic extraction
│   │   └── similarity.py         # Multi-factor similarity (Streamlit-agnostic)
│   ├── cache/                    # Streamlit caching layer
│   │   ├── cache_layer.py        # @cache_data wrappers
│   │   └── smart_cache.py        # Per-channel video caching (24h TTL)
│   ├── analytics/                # ML, analytics, and tracking
│   │   ├── ml_trainer.py         # Logistic regression, cross-validation
│   │   ├── weight_optimizer.py   # Scoring weight optimization
│   │   ├── fabric_export.py      # Power BI export
│   │   ├── feedback_tracker.py   # User feedback persistence
│   │   └── quota_tracker.py      # API usage tracking (pure logic)
│   ├── main.py                   # UI (~1,675 lines)
│   └── debug_ui.py               # Debug panel UI
├── tests/                        # Unit tests (367 tests, mocked APIs)
└── docs/

Scoring Algorithms

Relevance (Keyword Mode)

def calculate_keyword_relevance(df, query, title_weight=2.0, tags_weight=1.0):
    # Per-video: title match (0.67) + tags match (0.33) = 1.0 max
    # Channel score = average of video scores → 0.0 to 1.0
    # Final = 80% algorithmic + 20% Gemini

Similarity (Seed Mode)

Two-pass optimization:

Calculate algorithmic scores for ALL candidates (fast)
Enhance only top 10 with Gemini (expensive)
Re-sort after AI enhancement

Key Technical Decisions

Decision	Rationale
Layered architecture	Core logic testable without Streamlit
Callback pattern	Progress updates without `st.progress()` dependency
Dataclass results	Type safety, errors as data (not exceptions)
Filter before fetch	Save API quota by eliminating work early
Per-channel caching	Same channel in multiple searches shares cache
Soft penalties	”2024” might be noise or relevant—don’t hard block

Performance & Optimization

Caching Strategy

Cache Type	TTL	Rationale
Search results	3 days	Queries repeat; results stable
Channel stats	7 days	Subscriber counts change slowly
Video details	24 hours	Per-channel, not per-query

Cache key normalization: search("manga, anime") and search("anime, manga") hit the same cache.

Benchmarks (Streamlit Cloud, Jan 2026)

Keyword Search Mode

Scenario	Time	Quota	Bottleneck
1 term, cold, no AI	9-10s	400 units	Video details (84-87%)
1 term, warm, no AI	<0.1s	100 units	Relevance filtering
1 term, warm, with AI	17-19s	100 units	AI relevance (92-94%)
2 terms, warm, with AI	20-25s	200 units	AI relevance

Seed-Based Search Mode

Scenario	Time	Quota	Bottleneck
1 term, cold, no AI	12-15s	450 units	Video details (70-75%)
1 term, warm, with AI	25-30s	100 units	AI relevance (55-60%)
2 terms, warm, with AI	35-45s	200 units	AI + similarity calc

Cache benefit: 99% faster, 75% less quota on repeat searches.

Storage Behavior

Cache Type	Local	Streamlit Cloud
`@st.cache_data`	Persists while running	Resets on restart
`.feedback_data.json`	Persists indefinitely	Resets on restart

App restarts: idle timeout (~7 days), git push, platform maintenance.

Observability

Debug Panel tracks: API calls, quota units, timing per stage, cache hits (<50ms = hit).

Feedback System captures user satisfaction after each search:

Inputs: Thumbs up/down with optional reason (poor fit, low quality, wrong topic)
Data collected: Timestamp, search mode, query, top 5 results with scores, filter settings, AI enabled flag
Seed mode extras: Full scoring component breakdown (tag, keyword, subscriber, engagement, frequency scores)

Analytics Pipeline enables ML-powered improvements:

Logistic regression models with cross-validation
Weight optimization based on feedback correlation
Export to Microsoft Fabric/Power BI for dashboards

Feedback Loop Status: Data collection active, ML training implemented, automated weight updates on roadmap (w/version signature).

Testing Strategy

367 total tests covering all core modules with mocked API clients.

Test File	Tests	Focus
`test_query_utils.py`	21	URL parsing, validation
`test_relevance.py`	13	Scoring accuracy, edge cases
`test_youtube_api.py`	29	Search results, channel stats, error handling
`test_gemini_api.py`	31	AI scoring, summary generation, API failures
`test_pipeline.py`	26	Full flow, filters, early exits, callbacks
`test_seed_topics.py`	46	Topic extraction, language detection
`test_similarity.py`	63	Similarity scoring, callbacks, Gemini integration
`test_analytics.py`	27	ML training, weight optimization
`test_feedback_tracker.py`	27	Feedback persistence, export
`test_quota_tracker.py`	42	Quota calculations, persistence, tracking
`test_scoring_version.py`	26	Scoring weights, version management
`test_performance.py`	16	Performance benchmarks, timing

Approach: Mock APIs, test edge cases, verify callbacks, validate ML pipelines.

Tech Stack

Layer	Technology	Why
Language	Python 3.11	Type hints, broad ecosystem
Framework	Streamlit 1.49	Rapid prototyping, free hosting
APIs	YouTube v3, Gemini 2.0 Flash Lite	ToS-compliant, generous free tiers
Data	Pandas	Efficient filtering/grouping
Testing	pytest (367 tests)	Standard, good mocking, full coverage
ML	scikit-learn	Logistic regression, cross-validation

Known Limitations

Limitation	Impact	Potential Fix
YouTube quota (10K/day)	~25 cold searches/day	BYOK option
Language (EN/ES only)	Quality degrades elsewhere	Add FR/DE/PT stopwords
Tag dependency	Tagless channels max 70/100	Weight descriptions as fallback
Ephemeral storage	Feedback resets on restart	Migrate to cloud storage

Impact