Raw LLM Responses
Inspect the exact model output for any coded comment.
Look up by comment ID
Random samples — click to inspect
G
If this argument wins it will still lose. A temporary cottage industry of people…
ytc_UgwZ2amTO…
G
*there is NO real need for ai*
Citation needed? Citation needed for your follow…
ytr_UgwOJul1_…
G
Ok but sometimes Siri goes off when I’m playing with the dog… telling the dog “o…
ytc_Ugy1VE59H…
G
T9 anyone who is confused, AI doesn't typically take frames from videos, they ta…
ytc_UgzdJSMFf…
G
quarterly results are more important than long term planning and sustainability.…
ytc_UgyOOtqZ2…
G
It’s too expensive for the customer to be shipped in a self driving car. The pri…
ytc_UgwF6M5uu…
G
imo we really need to stay away from AI _confirming_ targets. AI _identifying_ p…
ytc_UgxbTIG_Y…
G
I have taken dozens of rides in Waymos all over the covered area in LA and I can…
ytc_UgwnCJe3z…
Comment
Fine then I'll talk.
1: The title has nothing to do with the paper. This is not a quote, doesn't take into account what the paper says about the various improvements of the model, etc.
2: The quote used isn't in full. To quote:
>Figure 4: Code generation. (a) Overall performance drifts. For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June. The drop was also large for GPT-3.5 (from 22.0% to 2.0%). GPT-4’s verbosity, measured by number of characters in the generations, also increased by 20%. (b) An example query and the corresponding responses. In March, both GPT-4 and GPT-3.5 followed the user instruction (“the code only”) and thus produced directly executable generation. **In June, however, they added extra triple quotes before and after the code snippet, rendering the code not executable.**
Which means that by the paper's own admission, the problem is not the code given but that their test doesn't work.
​
For the prime numbers, the problem was fixed in march notably because their prompt didn't work which means they didn't manage to test what they were trying to do. Quote:
> Figure 2: Solving math problems. (a): monitored accuracy, verbosity (unit: character), and answer overlap of GPT-4 and GPT-3.5 between March and June 2023. Overall, a large performance drifts existed for both services. (b) an example query and corresponding responses over time. GPT-4 followed the chain-of-thought instruction to obtain the right answer in March, but ignored it in June with the wrong answer. GPT-3.5 always followed the chain-of-thought, but it insisted on generating a wrong answer (\[No\]) first in March. This issue was largely fixed in June.
>
>\[...\] This interesting phenomenon indicates that the same prompting approach, even these widely adopted such as chain-of-thought, could lead to substantially different performance due to LLM drifts.
​
The "sensitive question
reddit
AI Harm Incident
1689753378.0
♥ 106
Coding Result
| Dimension | Value |
|---|---|
| Responsibility | none |
| Reasoning | deontological |
| Policy | none |
| Emotion | outrage |
| Coded at | 2026-04-25T08:33:43.502452 |
Raw LLM Response
[{"id":"rdc_jsm5wzy","responsibility":"company","reasoning":"deontological","policy":"none","emotion":"outrage"},
{"id":"rdc_jsl8ta1","responsibility":"company","reasoning":"consequentialist","policy":"none","emotion":"outrage"},
{"id":"rdc_jsl0p6a","responsibility":"none","reasoning":"consequentialist","policy":"none","emotion":"indifference"},
{"id":"rdc_jskabl2","responsibility":"none","reasoning":"consequentialist","policy":"none","emotion":"indifference"},
{"id":"rdc_jskaeh0","responsibility":"none","reasoning":"deontological","policy":"none","emotion":"outrage"}]