Building a Multimodal Phishing Agent: Beyond Simple URL Scanning

Beginner-friendly guide to building a multimodal phishing agent that goes beyond simple URL scanning.

Posted Mar 1, 2026

By Sujay Sundar Raj

2 min read

Building a Multimodal Phishing Agent: Beyond Simple URL Scanning

🛡️ Building a Multimodal Phishing Agent: Beyond Simple URL Scanning

In the world of 2026 CyberScale threats, a simple “URL Reputation” check is a relic of the past. Sophisticated attackers now use cloaking to hide malicious payloads from automated scanners, while legitimate marketing emails often trigger false positives due to complex tracking redirects.

To solve this, I’ve built a Multimodal Forensic Engine, an “Expert-in-a-Box”, that combines visual evidence with technical header analysis using Llama 3.2-Vision.

1. The Stealth Detonator (The Eyes)

Before we can analyze a site, we have to see it without being caught. Attackers often serve a benign page if they detect a headless browser. We use playwright-stealth to mask our automation footprint and simulate real human interaction.

        
      
async def scan_url(url: str):
    # Wrap the entire browser context in a stealth layer
    async with Stealth().use_async(async_playwright()) as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
        )
        page = await context.new_page()
        
        # Navigate and capture the visual evidence
        await page.goto(url, wait_until="networkidle")
        
        # Human-like jitter to bypass basic anti-bot
        await page.mouse.move(random.randint(100, 700), random.randint(100, 700))
        
        screenshot_path = f"static/screenshots/scan_{uuid.uuid4().hex}.png"
        await page.screenshot(path=screenshot_path)
        
        return {"local_path": screenshot_path, "final_url": page.url}

2. The Forensic Ingestion (The Context)

Context is king. An email might look visually perfect, but a mismatch between the From address and the Return-Path is a massive red flag. Our parser extracts these technical “breadcrumbs” to feed the AI.

        
      
def extract_bundle(raw_eml: str):
    msg = email.message_from_string(raw_eml, policy=policy.default)
    headers = {
        "from": msg.get("From"),
        "return_path": msg.get("Return-Path"),
        "spf": msg.get("Authentication-Results", "None")
    }
    # Extracting the body and truncating to stay within AI context limits
    body = msg.get_body(preferencelist=('plain')).get_content()
    return {"headers": headers, "body": body[:1500]}

3. The Multimodal Brain (The Verdict)

The magic happens when we hand the Screenshot + Headers + Body to Llama 3.2-Vision. Instead of just “reading” the text, the AI “looks” at the branding and compares it to the technical data.

Because local models can be “chatty,” we implemented a Heuristic Recovery layer to ensure we always get a structured verdict, even if the AI decides to write a three-paragraph review.

        
      
# The 'Heuristic Recovery' logic for stubborn local models
# Prioritizes JSON, falls back to Regex keyword extraction
verdict_pattern = r"(?:Verdict|Answer|Conclusion):\s*\*?\*?(CLEAN|SUSPICIOUS|MALICIOUS)\*?\*?"
match = re.search(verdict_pattern, raw_ai_output, re.IGNORECASE)

if match:
    return {
        "verdict": match.group(1).upper(),
        "confidence": 85,
        "status": "success"
    }

Conclusion: Privacy-First Triage

By running this entire stack locally via Ollama and FastAPI, we’ve created a privacy-first security agent. It doesn’t leak sensitive corporate emails to third-party APIs, it’s cynical by design, and—most importantly—it sees the whole picture.

The result? A 100% automated triage loop that can distinguish between a complex IndiGo marketing mailer and a pixel-perfect credential harvester in under 90 seconds.

Projects, Security

This post is licensed under CC BY 4.0 by the author.