Abstract

Developing socially intelligent robots requires tight integration of robotics, computer vision, speech processing, and web technologies. We present the Socially-interactive Robot Software (SROS) platform, an open-source framework that addresses this need through a modular layered architecture. SROS connects the Robot Operating System (ROS) layer responsible for movement with web and Android interface layers using standard messaging and application programming interfaces (APIs). Specialized perceptual and interaction skills are implemented as reusable ROS services that can be deployed on any robot. This enables rapid prototyping of collaborative behaviors by synchronizing perception of the environment with physical actions. We experimentally validated core SROS technologies like computer vision, speech processing, and autocomplete speech using GPT2 - all implemented as plug-and-play ROS services. The abilities validated confirm SROS's effectiveness in developing socially interactive robots through synchronized cross-domain interaction. Demonstrations synchronizing multimodal behaviors on an example platform illustrate how SROS lowers barriers for researchers to advance state-of-the-art in customizable, adaptive human-robot systems through novel applications integrating perception and social skills.

Key Technologies

Computer Vision: Landmark detection using Media Pipe extracts 2D body/face key points from camera images in real-time. DeepFace performs facial analysis to classify emotion expressions based on detected landmarks.

Speech Processing: Audio feature extraction with Praat calculates MFCCs and prosodic metrics from microphone input. Google Cloud Speech API performs Persian speech recognition.

Dialogue & NL Understanding: Natural language parsing and GPT-2 based response generation enables basic conversational skills in Persian.

Mechanical system (Movement Module): Dynamixel controller and ROS-based commanding

Affective expression system (cognitive module): Desinging the interactive behavior, audio, and facial expressions

Android client: updating the robot's face on demand

user interface: monitoring and control across domains

II. SYSTEM ARCHITECTURE

A. Overview

We present a new modular and customizable software design for building social robot abilities using ROS. It has key layers to integrate advanced sensing, reasoning and interaction skills needed for natural human-robot partnerships. A remote graphic user interface allows personalized access from different devices. A central controller promotes useful robot behaviors by coordinating independent parts asynchronously. Processing tasks are split across ROS nodes to optimize efficiency. Core functions like multi-sensing, activity understanding, and emotion display are standardized as interchangeable services. This streamlines combining capabilities quickly. The design advocates distributed, synchronized real-time processing of streaming data across pipelines. Modularity permits custom interfaces and behavior tweaks. This adaptable framework is well-suited for next-gen robot assistants, friends and tutors with high-level sensing and socio-emotional abilities for meaningful social exchanges with people. The goal is to advance the field by empowering fast prototype tests and rigorous studies of novel applications. The synergistic approach aims to greatly speed progress.

B. System Design

The SROS architecture incorporates modular hardware and software components tailored for social robot capabilities. The overall system architecture consists of four main layers as shown in the figure below:

Graphical User Interface (GUI) Layer: Allows remote monitoring and control of the robot via a web interface.
Web Layer: Manages user requests, parses the GUI, coordinates services between GUI and other layers, and handles the database.
ROS Layer: Handles low-level robot I/O, sensor processing, and actuator control, while providing reusable perception, analysis, and generation capabilities.
Hardware Layer: Handles the robot’s hardware (Camera, mic, actuators, etc.)

B.1 Graphical User Interface Layer:

The GUI is a web-based control panel developed using HTML, CSS, and JavaScript. This control panel provides remote access to monitor and control the robot. It allows users to define temporary sequences of multimedia behaviors combining specific face and sound pairings. It connects to roscore through ROS Bridge and subscribes to needed topics on the page load. A live video feed from the robot’s camera can be viewed through image_raw and image_raw/landmarked topics with options to turn facial landmark analysis graphics overlay on or off. Virtual joystick controls enable teleoperating the robot’s movements by sending geometry_msgs/Twist messages on cmd_vel_wheel topic as shown in the figure below. Future work may include additional features to further enhance the remote control and visualization capabilities.

B.2. Web Layer:

The layered web services architecture plays an integral role in realizing the systematic vision behind SROS. By providing a stable and standardized integration layer based on RESTful APIs and asynchronous messaging, it facilitates tight yet modular coordination across robotics, perception, and user interfaces.

The web layer itself consists of four different layers:

Database
Django Web application
Python WSGI
Web Server

The Django framework acts as the coordinator, exposing APIs that allow different parts to interact independently. This lets each skill be developed separately and tested on its own.

Most importantly, this architecture supports SROS's synchronized dataflow approach by treating perception, actuation, and interface elements as interconnected yet specialized domains. Standardized communication protocols promote flexible yet tight collaboration both within and across these domains.

Gunicorn and Nginx help integrate the skills together in real-time. They handle incoming requests, route them to the right skills, and make sure everything runs smoothly and can scale up. This architecture supports synchronized data sharing. It treats sensing, moving, and interfaces as connected but separate areas. Common communication standards help them tightly collaborate both within and across areas. The layered web services stack is at the core. It brings robotics, AI and human interaction together through breakable yet unified modules.

The database provides a standard way to store information. It currently holds emotion profiles linking visuals and audio to feelings. This data helps sensing understand emotions. Profiles for each robot are also stored so interfaces can customize for its abilities. In the future, the database could define individual skill parameters. This would treat low-level components as independent services that higher levels combine as needed.

By providing a common interface, the database enhances standardized communication. It helps optimize complex behaviors through centralized access to specifications. This exemplifies coordinating usually separate research areas like sensing and movement control in a unified robot system.

B.3 ROS Layer:

The ROS layer executables follow the following structure pattern:

                  ROS Layer
                  ├── basic
                  │   ├── node1
                  │   ├── node2
                  │   └── ...
                  ├── middleware
                  │   ├── node3
                  │   ├── node4
                  │   └── ...
                  ├── services
                  │   ├── service1
                  │   ├── service2
                  │   └── ...
                  └── tests
                  ├── test1
                  ├── test2
                  └── ...

The system distributes computationally intensive tasks like computer vision and audio processing across separate ROS nodes, improving parallel efficiency. Nodes in the "basic" directory handle robot I/O inputs and initial processed data, that need to be published and accessed constantly for research purposes. The nodes in "middleware" manage to route heavier data processing pipelines. These nodes call the corresponding services in the "services" directory and publish the output to the defined topic. This design facilitates customization in two ways. First, users can redefine individual nodes to update algorithms or I/O. Second, launch file configuration determines node inclusion/exclusion in experiments.

Figure below illustrates the proposed ROS-based computational graph for real-time multi-task computer vision processing. At the start of the pipeline, raw color images and the camera information are published by the video stream node from an RGB camera. The images and the camera information are published to the image_raw and camera_info topics respectively for downstream consumption.

The core image processing is handled by the OpenCV client node, which bridges ROS and OpenCV. It transforms subscribed image and calibration data into an OpenCV image format before publishing preprocessed images to the image_cv2 topic. Here we must emphasize that since ROS does not support multidimensional array messages, OpenCV data such as the frame and landmarks are first transformed from Numpy arrays to lists of lists, and then published as custom messages. Other system limitations are also overcome following this methodology.

Two computer vision pipelines branch off from the image_cv2 topic. The FaceEmotionAnalysis node implements the DeepFace module to perform facial emotion classification. In parallel, the Landmark_detection node utilizes Media Pipe for facial landmark localization, publishing full-body pose data accurately as shown in the figure below

Detected landmarks then get utilized in other sensory data processing services such as gaze_detector service as shown in the figure below

The gaze pose estimator service as a downstream node of our example pipeline (e.g. the computer vision pipeline), is responsible only for the gaze detection functionalities and calculations. Its service estimates the head’s position (pitch and yaw values), and then projects the pupil location on the 3D plane of the face to estimate the gaze position. This methodology helps keep a wide variety of functionalities separate which helps significantly while expanding the architecture by adding new modules.

This modular architecture supports various extensions and applications. For example, face recognition can be added by calling relevant services or subscribing to the image_raw/landmarked topic1 to bypass early processing steps. Our goal is to provide a configurable toolkit that equips researchers with fundamental algorithms and interfaces through ROS, rather than an all-in-one solution. This is intended to streamline the prototyping of social robotics concepts by furnishing the necessary building blocks to pace development. The extensible design aims to facilitate expedited workflow in collaborative robotics research.

The remaining capabilities of the proposed system follow the same data processing protocol of vision. The proposed ROS-based functionalities include broadcasting camera image converted to OpenCV image (as custom-defined messages for lists of lists since ROS doesn’t yet support

Demonstration

We experimentally validate SROS's capabilities using both a physical social robot platform and a virtual machine. Landmark detection enables tracking of face/body poses. DeepFace accurately classifies facial expressions in real-time. Camera images are processed using Media Pipe for full-body pose data. Audio samples are processed using Praat and emotions are classified via a Random Forest model. Speech recognition is demonstrated via test utterances processed through the Google STT API. User input of simulated emotions triggers behaviors. The Android app receives video/audio streams over WiFi and outputs the status of the streaming action. Gaze estimation from landmark and pose data is also simulated.

Conclusion

SROS is a modular framework that enables the development of socially intelligent robots. It promotes modular reuse across different robot morphologies and supports additional packages without platform changes. Ongoing work involves expanding the toolkit of social competencies and exploring applications in human-robot collaboration.

Overall Architecture

Overall architecture of the proposed framework

Modular Customizable ROS-Based Framework for Rapid Development of Social Robots

to interact with children on the autism spectrum

Technical Flowcharts, Simulations, Specification

Abstract

Key Technologies

II. SYSTEM ARCHITECTURE

A. Overview

B. System Design

B.1 Graphical User Interface Layer:

B.2. Web Layer:

B.3 ROS Layer:

Demonstration

Conclusion

Overall Architecture

Interested in more details?

Modular Customizable ROS-Based Framework for Rapid Development of Social Robots

to interact with children on the autism spectrum

Technical Flowcharts, Simulations, Specification

Abstract

Key Technologies

Design V.3 of the Face

Singing with the music

Design V.2 of the Face

Design V.1 of the Face

II. SYSTEM ARCHITECTURE

A. Overview

B. System Design

B.1 Graphical User Interface Layer:

Showing the communication between the interface (right) and the simulated face (left)

B.2. Web Layer:

B.3 ROS Layer:

Demonstration

Conclusion

Overall Architecture

Interested in more details?