Methodology

Data collection pipeline, LLM coding schema, and corpus quality notes for the Folk AI Ethics study.

Corpus Summary
StageCount
Raw comments collected2.1M
After cleaning & deduplication920.3K
Relevant to AI topics365.0K
LLM coded366.3K
youtube
341773
reddit
24490
Collection Pipeline
1
YouTube Data API v3
commentThreads.list — top-level comments + replies across AI-related videos
2
Reddit PRAW
Top-level comments on AI-related posts from r/artificial, r/MachineLearning, etc.
3
The Guardian API
Article comments via Open Platform API on AI coverage
4
YouTube expansion
74 targeted search queries across 8 AI discussion categories; 600 discussion-first videos
5
Consolidation
CSV → SQLite via load.py; deduplication by composite key
6
Cleaning (clean.py)
Unicode normalisation, strip HTML, remove bots, word-count tagging
7
Relevance filter (filter_relevant.py)
Keyword + regex rule: AI term ∩ harm/ethics term, min 20 chars
8
LLM coding (code.py)
Llama-3.3-70B (IONOS API) — batch 5, T=0, 4 dimensions per comment
9
Recode reasoning
recode_reasoning.py — splits consequentialist → utilitarian vs consequentialist
LLM Coding Schema

Each comment is coded on four dimensions using a zero-temperature Llama-3.3-70B call. Batch size: 5 comments. Values validated against the controlled vocabulary below.

DimensionValid Values
responsibility developer · company · government · user · ai_itself · distributed · none · unclear
reasoning consequentialist · utilitarian · deontological · virtue · contractualist · mixed · unclear
policy regulate · ban · liability · industry_self · none · unclear
emotion outrage · fear · resignation · approval · indifference · mixed · unclear
Relevance Filter Criteria

Comments pass filter_relevant.py if they contain at least one term from each of the primary and context keyword sets, or match a direct AI-harm regex.

RuleDescription
AI keywords ai, artificial intelligence, machine learning, llm, chatgpt, gpt, claude, deepmind, openai…
Harm/ethics keywords harm, danger, risk, bias, responsible, ethics, rights, regulate, ban, liability, fear…
Min length ≥ 20 characters after strip (removes bot replies, single words)
Excluded Purely promotional content, URLs-only, non-English dominated text
LLM System Prompt (excerpt)
You are a research assistant coding public comments about AI for a philosophy dissertation on folk moral intuitions. DIMENSIONS: 1. responsibility — who is held accountable for AI harm 2. reasoning — moral reasoning type used 3. policy — regulatory response the commenter supports 4. emotion — dominant emotional register RULES: - Use ONLY the listed values. - Return ONLY a valid JSON array, no markdown. - One object per comment, preserving the given id. OUTPUT: [{"id":"<id>","responsibility":"<val>", "reasoning":"<val>","policy":"<val>","emotion":"<val>"},...]