Methodology — Corpus Dashboard

Corpus Summary

Stage	Count
Raw comments collected	2.1M
After cleaning & deduplication	920.3K
Relevant to AI topics	365.0K
LLM coded	366.3K

youtube

341773

24490

Collection Pipeline

YouTube Data API v3

commentThreads.list — top-level comments + replies across AI-related videos

Reddit PRAW

Top-level comments on AI-related posts from r/artificial, r/MachineLearning, etc.

The Guardian API

Article comments via Open Platform API on AI coverage

YouTube expansion

74 targeted search queries across 8 AI discussion categories; 600 discussion-first videos

Consolidation

CSV → SQLite via load.py; deduplication by composite key

Cleaning (clean.py)

Unicode normalisation, strip HTML, remove bots, word-count tagging

Relevance filter (filter_relevant.py)

Keyword + regex rule: AI term ∩ harm/ethics term, min 20 chars

LLM coding (code.py)

Llama-3.3-70B (IONOS API) — batch 5, T=0, 4 dimensions per comment

Recode reasoning

recode_reasoning.py — splits consequentialist → utilitarian vs consequentialist

LLM Coding Schema

Each comment is coded on four dimensions using a zero-temperature Llama-3.3-70B call. Batch size: 5 comments. Values validated against the controlled vocabulary below.

Dimension	Valid Values
responsibility	developer · company · government · user · ai_itself · distributed · none · unclear
reasoning	consequentialist · utilitarian · deontological · virtue · contractualist · mixed · unclear
policy	regulate · ban · liability · industry_self · none · unclear
emotion	outrage · fear · resignation · approval · indifference · mixed · unclear

Relevance Filter Criteria

Comments pass filter_relevant.py if they contain at least one term from each of the primary and context keyword sets, or match a direct AI-harm regex.

Rule	Description
AI keywords	ai, artificial intelligence, machine learning, llm, chatgpt, gpt, claude, deepmind, openai…
Harm/ethics keywords	harm, danger, risk, bias, responsible, ethics, rights, regulate, ban, liability, fear…
Min length	≥ 20 characters after strip (removes bot replies, single words)
Excluded	Purely promotional content, URLs-only, non-English dominated text

LLM System Prompt (excerpt)

You are a research assistant coding public comments about AI
for a philosophy dissertation on folk moral intuitions.

DIMENSIONS:
1. responsibility — who is held accountable for AI harm
2. reasoning      — moral reasoning type used
3. policy         — regulatory response the commenter supports
4. emotion        — dominant emotional register

RULES:
- Use ONLY the listed values.
- Return ONLY a valid JSON array, no markdown.
- One object per comment, preserving the given id.

OUTPUT:
[{"id":"<id>","responsibility":"<val>",
  "reasoning":"<val>","policy":"<val>","emotion":"<val>"},...]