Johns Hopkins Study Highlights AI's Limitations in Social Scene

AI Still Struggles with Human Social Understanding, Johns Hopkins Study Finds
Published: April 24, 2025

Despite remarkable progress in object recognition and language understanding, artificial intelligence models still fall short when it comes to interpreting human social interactions in dynamic scenes, according to a new study by Johns Hopkins University researchers.

“AI systems struggle to determine basic human cues such as identifying who is speaking, predicting when a person will cross the street, or recognizing simple conversational engagement,” reports ScienceDaily.

These findings, also highlighted in INDIA New England News, cast doubt on the readiness of AI for autonomous roles that demand real-time social awareness, such as self-driving vehicles or assistive robots.

The Study

The investigation was led by Dr. Leyla Isik, cognitive scientist at Johns Hopkins, along with Kathy Garcia, a doctoral student in her lab. The research was presented at the International Conference on Learning Representations (ICLR) in Singapore, underlining its importance to the machine learning community (The Hub, ScienceDaily).

The researchers argue that the social-understanding gap may arise from current AI architectures, which are modeled after neural networks optimized for static images not contextual, real-world interactions (The Hub, ScienceDaily).

Methodology

More than 150 human participants were shown three-second video clips of various social scenarios people interacting, performing parallel activities, or acting alone and asked to rate key social features on a five-point scale (INDIA New England News, ScienceDaily).

Researchers then tested 350+ AI models from across language, video, and image domains. These models were tasked with predicting:

Human ratings for each clip
Neural response patterns based on text descriptions

The models included:

Large language models (text-based)
Video models (motion sequences)
Image models (still frames)
(The Hub, ScienceDaily, INDIA New England News)

Key Findings

Human participants showed high consistency in their understanding of the scenes, indicating a shared sense of social dynamics. AI models, however, failed to match these judgments, regardless of size or training (ScienceDaily, INDIA New England News).

Video models performed poorly at identifying interactions, frequently misclassifying conversational engagement as independent activity.
Image models often failed at even simple scene understanding when only still frames were provided.
Large language models fared slightly better in predicting human judgments of behavior.
Video models showed some capability in predicting neural activity, suggesting complementary strengths across modalities.
(The Hub, ScienceDaily)

Implications for Real-World AI

These limitations raise concerns about deploying AI in environments where social-cue interpretation is critical:

Self-driving cars need to interpret pedestrian behavior accurately (Bloomberg Law).
Assistive robots in healthcare and hospitality must read body language and conversational tone (INDIA New England News).
On manufacturing floors, robots must anticipate human movements to ensure team coordination and safety (Bloomberg Law).

Charting the Path Forward

Experts argue that future AI systems must:

Move beyond static-image-based architectures
Integrate motion-sensitive processing modules
Leverage context-aware memory
Adopt multimodal fusion strategies
(The Hub, ScienceDaily)

Additionally, AI training datasets need to include more richly annotated, dynamic social interactions to help models generalize better in real-world applications (INDIA New England News, ScienceDaily).

Conclusion

While AI continues to excel in object recognition and language processing, this study highlights a critical human capability AI lacks: the ability to interpret social context in motion.

“Good news: the robots aren’t taking over just yet.” INDIA New England News

For now, human judgment remains indispensable in fields requiring nuanced social perception.

Production Planning and Scheduling

Capacity Planning

Material Requirement Planning

Bill of Materials (BOM) & Routing

Quality Control on Salesforce

Salesforce MES

Accounts Payable

Accounts Receivable

Audit and Compliance

Budgeting and Forecasting

Bank Reconciliation

Adjustment Entries

Axo AI on Salesforce

Revenue Management Salesforce

Warehouse Management

Shipment (UPS, FedEx and USPS)

Order Management

Inbound Logistics

Outbound Logistics

Purchase Orders

Product Configurator

Quoting

Inventory Management

Quote to Order

Subscription Plans

DocuSign

Payments (PayPal, Stripe and Authorize.net)

Field Service Booking & Scheduling

Event Scheduling & Booking

Appointment Scheduling & Booking

Consultant Booking

Space Queuing & Self Check-In

Check-In Kiosk

Room Booking

Distribution

Industrial Manufacturing

Retail

Service Management

Healthcare

Public Sector

Dealership

Food Industry

Shopping & Fashion

Support

Customer Success

Articles

Whitepapers

Videos

Roadmap

FAQs

Comparison

Axolt vs Netsuite

Axolt vs Odoo

Axolt vs MRPeasy

Axolt vs Fish Bowl

Axolt vs Katana MRP

Axolt vs Sage

Axolt vs ERPNext

Axolt vs Dynamics 365

Salesforce Inventory Management

Salesforce ERP vs Traditional ERP

Salesforce MRP

Manufacturing Finance ERP Real Time

MRP vs ERP:

Manufacturing Inventory Accuracy

Field Service on Salesforce

Why Salesforce ERP Implementations Fail?

BOM Management on Salesforce

Production Planning on Salesforce

Clear to Build in Salesforce

Agentforce Inventory Operations

Procure to Pay on Salesforce

A Practical Guide to Salesforce Inventory Management

CPQ for Manufacturing and Field Service

Purchase Order to Delivery

Quote-to-Cash Is Broken in Manufacturing

Batch vs Serial Tracking

Pricing

Implementation cost

About us