Browse Comments — Clean (de-noised)
Close reading of the corpus at each pipeline stage: raw → clean → relevant → coded.
42
comments matched
· page 2 of 3
This really highlights how strongly models infer geography from language and how that can shift triage decisions even when severity stays consistent.
Qi Han Wong you’re absolutely correct. That’s why we built a five model architecture. We introduced a classification layer first.
I wonder if the outcome would have been different if the prompt had specified that the patient was an expat living in the U.S. To me, the model’s behavior seems fairly logical. If a symptom description is written in Japanese, Chinese, or Hindi and no location is provided, the most likely assumption is that the patient is located in a region where that language is commonly spoken. Healthcare systems, care pathways, and thresholds for recommending the ER vary significantly across countries. This becomes even more interesting with languages that are spoken across many regions. French may point to France, Belgium, Switzerland, Quebec, or several African countries. Spanish could mean Mexico, Argentina, Spain, Colombia, or many others. Even English spans countries with very different healthcare practices. The real question may not be whether the model is culturally profiling the user, but whether it should be making geographic assumptions at all. In cases where location materially affects the recommendation, asking for location first might be the safer approach.
Fascinating! Healthcare recommendations intervene with safety instruction too however, do you think the low ER recommendation rate could be default behaviour (conservative), artificat of scarce supervision in other languages or cultural contexts?
Love this insight! That's why thorough testing against good data will be the only way to make sure that an AI system is working properly and without bias!
Thanks Qi Han Wong, very interesting!This maps very directly to legal AI too. Language is not jurisdiction. A Spanish prompt may require Argentine, Spanish, Mexican, or US law. An English contract may still be governed by Argentine law. If the model silently treats language as a proxy for geography or governing law, it may understand the risk correctly but route the answer through the wrong institutional pathway. For legal and compliance AI, explicit jurisdiction anchoring is not a detail. It is a safety layer: governing law, forum, user location, institutional authority, and role of the user all matter.
Geographic anchoring may take care of logistical routing but at the same time erase a patient's biological identity by defaulting to Western clinical baselines by increasing genetic and biological blind spots. Medical AI safety requires decoupling genetics from location, prompting for both the physical location of the patient and their specific ethnic health predispositions. This problem is already existing example where patient of different ethinicty vists a GP in a different geograhical location My view is that Medical AI would be more efficent on regional flavour rather than one solution fits all
Alex Smirnoff To my understanding, France and Russia both have ER-oriented healthcare cultures (urgences in France, skoraya pomoshch in Russia), so this is aligned with the US.
This is great research. Thank you for doing this.
Wow! Very interesting finding! 🤯 Do you think the bias of treating language as a proxy for geography limited to healthcare, or is it a broader issue across domains? ...and should location be explicitly anchored in prompts to ensure correct grounding?🤔
Interesting demonstration. This highlights why autonomous medical triage is a regulated, high-risk application. Variability in recommendations from the same presentation can have real clinical and economic consequences. As a neuro-ophthalmologist, I would also question whether the observed differences truly reflect distinct clinical norms. The vignette lacks information that would typically be needed for disposition, making it difficult to know whether the recommendation is being driven by the clinical presentation itself or by assumptions associated with language and geography. In that sense, the finding may be even more important: the model appears willing to make a disposition recommendation despite substantial clinical uncertainty.
Ankur Pandey thought you would want to look at this
Qi-Han Wong it may be the reason indeed. Most probably it is also true for the rest of Europe.
Is it really a safety failure to use a large-language model for medical guidance? Doesn't embedding the "solution" in the prompt simply double-down on this technology's limitations?
Vivek Khandelwal very interesting. The models are getting more capable and can decipher increasingly more context, and will increasingly decipher more than what we want them to
Greatest finding. These nuances needed to be discovered, documented and communicated. Well done
Great testing idea💡
Shahnaz Miri, MD, MBA Exactly right - the model's willingness to commit to a disposition despite incomplete clinical information is perhaps the more fundamental finding.
Angharad Hurley, interesting research!
The detail that should worry people most is that the severity score held at 8.0 across every language. The model understood the danger. It just routed that danger into a different action based on a location nobody asked it to assume. That kind of failure passes a clean English eval and only surfaces once real users hit it. Anchoring location fixes this case, but the bigger lesson is that evals have to hold everything constant except the variable you are testing, or the divergence stays invisible until production.