Researchers from the University of Zurich, the University of Amsterdam, Duke University, and New York University have demonstrated that modern artificial intelligence language models still cannot convincingly mimic human communication in terms of emotional expression, according to a new study. The research tested nine popular open-source large language models (LLMs) , including Llama 3.1 variants, Mistral 7B, and others , against social media posts from platforms such as X (formerly Twitter), Bluesky, and Reddit. Employing newly developed classifier algorithms, the team achieved an accuracy of 70–80% in distinguishing AI-generated texts from those written by humans.
This research presents an updated version of the "Computational Turing Test," which employs automated linguistic analysis to detect subtle emotional and stylistic differences that betray the artificial origin of AI-generated texts. The investigators found that despite efforts to fine-tune and clarify prompts, AI models consistently exhibited an overly polite, smooth, and less toxic tone compared to the more informal, sarcastic, and emotionally varied style typical of human interaction online. Attempts to increase realism by integrating user examples or contextual information only partially closed the gap, smoothing out sentence length and structure differences but failing to replicate the nuanced emotional cues of human speech.
A surprising insight from the study was that instructional training, which aims to make models more helpful and aligned, actually reduced their ability to imitate genuine human emotional expression. For instance, baseline models like Llama 3.1 8B and Mistral 7B v0.1 outperformed their instruction-tuned counterparts in producing more 'human-like' responses. Moreover, scaling up model size did not enhance their human-likeness; the larger Llama 3.1 with 70 billion parameters was found less convincing than its 8 billion parameter sibling. Intriguingly, when AI texts tried harder to disguise themselves as human, their semantic similarity to real user posts diminished, making them paradoxically easier to identify as machine-generated.
Platform-specific differences emerged as significant: AI-generated content was most convincing on X (Twitter), where the style of communication may be more formulaic, though detection accuracy was lowest. On Bluesky, AI performance was moderate, while on Reddit, where user communications are more varied and nuanced, AI texts stood out more distinctly. Researchers suggest these disparities may result from differences in platform user behaviour and the extent to which LLM training datasets incorporated that platform’s data.
Despite rapid advancements in generating grammatically correct and contextually relevant text, modern LLMs still struggle with replicating the spontaneous emotional expressiveness and ambiguity that characterises human communication. This "emotional smoothness" remains a key signature distinguishing AI outputs from genuine human text.
These findings contrast with some recent studies demonstrating that AI models like GPT-4 and GPT-4.5 can pass interactive Turing tests under certain controlled conditions, being judged human a significant portion of the time. For example, GPT-4.5 was judged human up to 73% in some three-party Turing tests, and GPT-4 passed an interactive conversation-based Turing test 54% of the time. However, those assessments often focus on conversational fluency and language coherence rather than the nuanced emotional texture and social media-style authenticity explored in the present study. Additionally, research by AI21 Labs found that approximately one-third of participants could not differentiate human from AI conversational bots, reflecting growing sophistication in AI dialogue.
Overall, while AI language models are increasingly effective at mimicking human language on a surface level, significant challenges remain in emulating the deeper emotional and affective dimensions of human communication. The researchers’ "Computational Turing Test" underscores that AI’s polite and 'too nice' demeanour in text is an enduring indicator of its artificial nature, limiting its ability to fully pass as human in everyday social media interactions.
📌 Reference Map:
- [1] (Hi-Tech.ua) - Paragraphs 1, 2, 3, 4, 5, 6, 7
- [2] (arXiv:2511.04195) - Paragraphs 2, 3
- [3] (arXiv:2407.08853) - Paragraph 8
- [4] (arXiv:2405.08007) - Paragraph 8
- [5] (arXiv:2503.23674) - Paragraph 8
- [6] (Ars Technica) - Paragraph 2
- [7] (PR Newswire) - Paragraph 8
Source: Noah Wire Services