Methodology
Data collection pipeline, LLM coding schema, and corpus quality notes for the Folk AI Ethics study.
Corpus Summary
| Stage | Count |
|---|---|
| Raw comments collected | 2.1M |
| After cleaning & deduplication | 920.3K |
| Relevant to AI topics | 365.0K |
| LLM coded | 366.3K |
youtube
341773
reddit
24490
Collection Pipeline
1
YouTube Data API v3
commentThreads.list — top-level comments + replies across AI-related videos
2
Reddit PRAW
Top-level comments on AI-related posts from r/artificial, r/MachineLearning, etc.
3
The Guardian API
Article comments via Open Platform API on AI coverage
4
YouTube expansion
74 targeted search queries across 8 AI discussion categories; 600 discussion-first videos
5
Consolidation
CSV → SQLite via load.py; deduplication by composite key
6
Cleaning (clean.py)
Unicode normalisation, strip HTML, remove bots, word-count tagging
7
Relevance filter (filter_relevant.py)
Keyword + regex rule: AI term ∩ harm/ethics term, min 20 chars
8
LLM coding (code.py)
Llama-3.3-70B (IONOS API) — batch 5, T=0, 4 dimensions per comment
9
Recode reasoning
recode_reasoning.py — splits consequentialist → utilitarian vs consequentialist
LLM Coding Schema
Each comment is coded on four dimensions using a zero-temperature Llama-3.3-70B call. Batch size: 5 comments. Values validated against the controlled vocabulary below.
| Dimension | Valid Values |
|---|---|
| responsibility | developer · company · government · user · ai_itself · distributed · none · unclear |
| reasoning | consequentialist · utilitarian · deontological · virtue · contractualist · mixed · unclear |
| policy | regulate · ban · liability · industry_self · none · unclear |
| emotion | outrage · fear · resignation · approval · indifference · mixed · unclear |
Relevance Filter Criteria
Comments pass filter_relevant.py if they contain at least one term from
each of the primary and context keyword sets, or match a direct AI-harm regex.
| Rule | Description |
|---|---|
| AI keywords | ai, artificial intelligence, machine learning, llm, chatgpt, gpt, claude, deepmind, openai… |
| Harm/ethics keywords | harm, danger, risk, bias, responsible, ethics, rights, regulate, ban, liability, fear… |
| Min length | ≥ 20 characters after strip (removes bot replies, single words) |
| Excluded | Purely promotional content, URLs-only, non-English dominated text |
LLM System Prompt (excerpt)
You are a research assistant coding public comments about AI
for a philosophy dissertation on folk moral intuitions.
DIMENSIONS:
1. responsibility — who is held accountable for AI harm
2. reasoning — moral reasoning type used
3. policy — regulatory response the commenter supports
4. emotion — dominant emotional register
RULES:
- Use ONLY the listed values.
- Return ONLY a valid JSON array, no markdown.
- One object per comment, preserving the given id.
OUTPUT:
[{"id":"<id>","responsibility":"<val>",
"reasoning":"<val>","policy":"<val>","emotion":"<val>"},...]