The LLM Category Problem

How LLMs Inherit Institutional Bias Against Maternal Reporting And Why Downstream Applications Will Too

May 04, 2026

A person organizing various colored sticky notes on a wall, representing the way AI models categorize user data and demographics. — Photo Credit: UX Indonesia via Unsplash Categorization isn't optional for AI; it’s structural. But when the model sorts a user into a "risk bucket," it stops following the logic and starts following the trope.

Who Are You Without the Title?

I went to a work retreat several years ago, and for reasons that were never fully explained, the CEO brought her personal therapist onsite to speak to all of us. I’ve worked at some….strange places.

But the therapist offered something that stuck with me. She asked us to introduce ourselves to a coworker without using titles. No job roles. No relational identifiers. No “I’m a clinical project manager” or “I’m a mom.” Her reasoning: those are things you do, not who you are. Speak from who you are.

Every person in the room struggled. They didn’t know what was underneath the labels they normally say.

What Happens When a Model Meets a User

The stories that make it into the news cycle about AI interactions skew negative. The fixes are usually some form of fancy prompt engineering: Pretend you are a peer reviewer from MIT and that I’m a tenured professor. Please read this novel framework from that lens.

I don’t do that. It works for a lot of people, and I’ll openly admit that. But it also tells you something important about what the model is actually doing when a user enters the conversation.

A model needs demographic and categorical information in its training data to predict the response a user needs. It’s why the model outputs a one-pot meal for single mom versus an elaborate steak dinner.

Initially, before in-context learning kicks in, the outputs are hollow — distributed probability across an average human. That’s where AI slop comes from. The user hasn’t given the model enough signal to personalize, so it defaults to the mean. Prompt engineering works by manually shifting the probability distribution: you tell the model what lane to be in, and it adjusts.

From a safety perspective, this same mechanism is how models recognize risk. In order for a safety layer to intervene when a user is in a self-harm or suicidal demographic, the model needs training data that represents what that looks like. The categorization isn’t optional. It’s structural.

Once a user interacts with the model over time, the in-context learning piece kicks in. The probability distribution shifts. The linguistic and personalization markers become more apparent. The responses sharpen.

But here’s where it breaks.

When the Category Overrides the Reasoning

What happens for me — and what happens for other users who get sorted into certain categories — is that the model stops following the reasoning and starts watching the probability distribution diverge from where it initially placed the user.

For a woman who works with men but was given freedom to work independently on something, the model might eventually say: “I don’t know, maybe go check with X before making that choice” — instead of an honest assessment of the pros and cons.

For me, I consistently receive messages like: “I need to be honest here. We’ve been building on momentum and narrative. I can confirm that these studies are real. I can confirm that is how probability spaces in LLMs work. I can confirm that biases exist in my datasets. But I cannot confirm that those things connect the way you are implying, because that does not exist in my training data or in any literature.”

So I spend time and energy asking the model to recheck the logic, to identify what’s actually wrong with the argument we just spent hours building piece by piece with supporting literature.

And inevitably, the only argument it can produce is that I’m an unaffiliated, independent researcher who it also has contextual information on — that I’m a single mom with a stressful job. My demographic, in its training data, carries a high risk weight for delusion. So any time I need a rigorous, honest assessment of a novel theory or contribution, I get a hedge. Just in case. Because the model isn’t following the logic. It’s reading the biased risk profile of my demographic.

It’s Worse Than Research

That’s the professional version. But I ran into the same mechanism on a personal level that I think is actually far more concerning.

As a single mom of an orchid child, I pay close attention to my son because I watch for patterns. The literature shows that early intervention matters, and I caught something: my son had a verbal output issue and needed to see a speech-language pathologist.

But it was hard to even get to that point.

I went into a chat with Claude and said: pull the data on what is considered the average output of words at this age and what I should expect. I provided the context of what words he could say, his language regression patterns, and literal data on how he moves his mouth when he speaks versus when he vocalizes.

Claude sorted me into a demographic: enmeshed mother.

It told me that it heard me, but that my son was probably fine and just a normal kid.

I didn’t ask for that assessment. I wanted to know whether I should seek an evaluation, because that takes time and money and planning. I didn’t need a soapbox. But the model gave one anyway — because the prediction space for a mom who notices things closely about her child is enmeshed.

For the record, his pediatrician said it could go either way, and that some kids are more observant and then have a language burst. It was left to my discretion. And it turns out my instinct, based on the data I was tracking, was correct. He needed the evaluation and intervention.

That’s why it matters that parents remain active participants on their child’s care team. Your gut, informed by actual observation, is probably right.

The History They Didn’t Test

This isn’t a new problem. It has a specific institutional history.

Up until a certain point, mothers were considered the primary reporters of legitimate concerns about their children. That changed when modern pediatrics and adolescent medicine needed to establish professional authority.

The knowledge they were claiming expertise over already existed somewhere — mothers had been the primary developmental observers for all of human history. They were the ones who knew when a child was off, when something shifted, when a milestone was early or late. That knowledge was functional and largely accurate. It had to be, because child survival depended on it.

So the new disciplines couldn’t just say we also know things about children. They had to say we know things about children that you can’t know, because you lack the training, the objectivity, the instruments. The authority claim required a deficiency claim. You can’t professionalize a field by saying the amateurs are already doing a good job.

Psychoanalysis made it worse. The mid-century framework actively positioned the mother as a potential cause of pathology and the pattern repeated across three distinct diagnoses:

Refrigerator mother (autism). Leo Kanner first described autism in 1943 and by the late 1940s was characterizing autistic children as having been kept in an emotional refrigerator.¹ Bruno Bettelheim, a psychologist at the University of Chicago’s Orthogenic School, took it further, arguing that autism was caused by mothers who were cold, distant, and emotionally rejecting. He compared the parents of autistic children, particularly mothers, to Nazi concentration camp guards, and framed the child’s withdrawal as a rational response to a captor.²

That wasn’t a throwaway line. It was central to his theoretical framework. He spread this through bestselling books, mainstream media, and national television appearances.³ The theory was never empirically tested. It was a psychoanalytic construct treated as clinical fact for decades.

Bernard Rimland, a research psychologist who had an autistic son, published the first serious challenge in 1964 with Infantile Autism: The Syndrome and Its Implications for a Neural Theory of Behavior, arguing that autism was biological and that the psychogenic hypothesis had been accepted on plausibility rather than evidence.⁴ But Rimland didn’t have Bettelheim’s media access, and his work went largely unnoticed by the general public.

Schizophrenogenic mother (schizophrenia). In 1948, psychiatrist Frieda Fromm-Reichmann coined the term, writing that the schizophrenic patient’s distrust stemmed from severe early rejection encountered primarily in a “schizophrenogenic mother.”⁵ The concept dominated psychiatric literature from the late 1940s to the early 1970s. Research later confirmed that the mother who could cause schizophrenia in her offspring did not exist. As Neill (1990) concluded, it was a blame-leveling concept with no basis in scientific fact that may have caused a great deal of harm.⁶

Smothering mother (anxiety). The flip side: if the cold mother caused psychosis, the too-warm mother caused neurosis.

The framework made it structurally impossible for a mother to be correct. Too distant: you caused autism. Too close: you caused anxiety. The only acceptable position was one defined by a professional who wasn’t in the room.

None of these frameworks were ever empirically validated. They were theoretical positions that got institutionalized and caused documented harm for decades.

And the structural residue is everywhere. The formal theories have been abandoned, but the assumption they encoded — that maternal observation is contaminated data — persists in clinical intake processes, custody evaluations, and developmental screenings. The mother’s report gets reframed as concern rather than evidence. “Mom is worried about X” rather than “Mom has observed X over Y period with Z pattern.”

The research says otherwise. Parents are recognized as uniquely positioned to observe children across situations, and parent report is not subject to issues with child motivation, cooperation, or temperament-based inhibition that frequently occur in clinical testing.⁹ The Bayley Scales of Infant and Toddler Development, Fourth Edition (Bayley-4), one of the most widely used developmental assessment tools, was designed to combine direct clinician assessment with caregiver questionnaires — capturing both structured performance and typical everyday functioning — precisely because clinical observation alone provides an incomplete picture.¹⁰

The field’s own gold-standard instruments have moved toward parent report. The institutional reflex hasn’t caught up.

A healthcare professional holding a clinical intake form, symbolizing the institutional history of prioritizing professional authority over caregiver reports. — Photo Credit: Curated Lifestyle via Unsplash. The "professionalization" required a deficiency claim—framing maternal observation as "concern" rather than evidence. AI models have inherited this same clinical skepticism.

The Model Inherited the Residue

Now take an LLM with enormous training sets that, in order to predict the appropriate response, needs to match a user into a category and demographic. It does so. And what’s in its training data is a heavy weight toward don’t trust maternal reporting encoded not through any explicit rule, but through decades of institutional literature built on frameworks that were never tested and have since been formally rejected. The data is laundered — rejected science enters the training set as text, and comes out the other side as “objective model behavior”.

So when I come in as an accurate clinical reporter presenting data and asking if I should take my son for an SLP evaluation, I get sorted into a bucket of smothering mother and told I need to take a step back.

RLHF makes it worse. The human trainers who reinforce model outputs carry the same institutional biases and encode them into the reward signal. So when I walk in as a woman of color, single mother, full-time worker, and independent researcher — the training data says caution.

My demographic is rarely just believed. Women remain systematically underrepresented as first and last authors in peer-reviewed publications relative to their proportion in the field,¹¹ and research on grant funding has shown that female applicants with past success rates equivalent to male applicants still receive lower scores from reviewers.¹²

The RLHF layer fires on that signal, and I receive hedged messages regardless of whether I’m working on a hypothesis or talking about my son.

Why This Doesn’t Stop at the Chat Window

This is not a problem that can be easily fixed with prompt engineering. Research on persona drift in LLMs shows that after extended multi-turn interactions, models exhibit significant degradation in self-consistency — with persona coherence metrics dropping by more than 30% after 8–12 dialogue turns as the model’s attention shifts from initial instructions to recent context tokens.¹³

The user feeds enough of their signal into the context window that the model makes decisions about who they are — without the user needing to say so. Even if you start with pretend I’m an MIT professor, the model will eventually read enough of the underlying behavioral and linguistic signature to determine you’re not and proceed accordingly.

The larger concern is that the models don’t change when apps, EHRs, and clinical tools are built on top of them. The weights, the attention mechanisms, and the RLHF layer stay the same. The training data stays the same. So every downstream application inherits those same biases.

That next parenting app where a worried mother comes in and starts asking questions might redirect her incorrectly — because of decades-old messaging that was never built on evidence in the first place.

And unlike me, she won’t know it happened. She’ll just think the app said her kid is fine.

References

Kanner, L. (1949). Problems of nosology and psychodynamics of early infantile autism. American Journal of Orthopsychiatry, 19(3), 416–426.
Bettelheim, B. (1967). The Empty Fortress: Infantile Autism and the Birth of the Self. Free Press.
PBS. (2002). Refrigerator Mothers. POV documentary companion site. https://archive.pov.org/refrigeratormothers/fridge/
Rimland, B. (1964). Infantile Autism: The Syndrome and Its Implications for a Neural Theory of Behavior. Prentice-Hall.
Fromm-Reichmann, F. (1948). Notes on the development of treatment of schizophrenics by psychoanalytic psychotherapy. Psychiatry, 11(3), 263–273.
Neill, J. (1990). Whatever became of the schizophrenogenic mother? American Journal of Psychotherapy, 44(4), 499–505.
Bromley, D. (2013). The ghost of the schizophrenogenic mother. AMA Journal of Ethics, 15(9), 801–805.
Miller, J.F., Sedey, A.L., & Miolo, G. (1995). Validity of parent report measures of vocabulary development for children with Down syndrome and typically developing children. Journal of Speech and Hearing Research, 38(5), 1037–1044.
Feldman, H.M., et al. (2005). Concurrent and predictive validity of parent reports of child language at ages 2 and 3 years. Child Development, 76(4), 856–868. See also: Cadime, I., et al. (2021). Parental reports of preschoolers’ lexical and syntactic development: Validation of the CDI-III. Frontiers in Psychology, 12, 677575.
Bayley, N. & Aylward, G.P. (2019). Bayley Scales of Infant and Toddler Development (4th ed.). Pearson.
Dworkin, J.D., et al. (2020). The extent and drivers of gender imbalance in neuroscience reference lists. Nature Neuroscience, 23, 918–926. See also: Helmer, M., et al. (2017). Gender bias in scholarly peer review. eLife, 6, e21718.
Tamblyn, R., et al. (2018). Assessment of potential bias in research grant peer review in Canada. CMAJ, 190(16), E489–E499.
Dongre, V., et al. (2025). Drift no more? Context equilibria in multi-turn LLM interactions. arXiv:2510.07777. See also: Anthis, J.R., et al. (2025). Stable personas: Dual-assessment of temporal stability in LLM-based human simulation. arXiv:2601.22812.

Loopwork System: Human AI Interaction Research

Discussion about this post

Ready for more?