This paper introduces an approach for estimating evoked categories expression from videos with the temporal position fusion. Pre-trained models on large-scale datasets in computer vision and audio signals were used to extract the deep representation for timestamps in the video. A temporal convolution network, rather than an RNN-like architecture, was applied to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the noise labels, the temporal position was fused with the deep learned feature to ensure the network differentiates the time steps when noise labels were removed from the training set. This technique helps the system gain a considerable improvement compared to other methods. We conducted experiments on EEV, a large-scale dataset for evoked expression from videos, and achieved a score of 0.054 in terms of Pearson correlation coefficient as a state-of-the-art result. Further experiments on a sub set of LIRIS-ACCEDE dataset - MediaEval 2018 benchmark, also demonstrated the effectiveness of our approach.
2021
IEEE Multimedia
End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
and Kim, Soo-Hyung
This work presents an approach for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: 1) lightweight feature extractor, 2) attention strategy, and 3) adaptive loss. We proposed a lightweight deep architecture with approximately 1 MB, which for the most crucial part, accounts for feature extraction, in the emotion recognition systems. The relationship in regard to the time dimension of features is explored with temporal convolutional network instead of RNNs-based architecture to leverage the parallelism and avoid the challenge of vanishing gradient. The attention strategy is employed to adjust the knowledge of temporal networks based on the time dimension and learning of each modality’s contribution to the final results. The interaction between the modalities is also investigated when training with adaptive objective function, which adjusts the network’s gradient. The experimental results obtained on a large-scale dataset for emotion recognition on Koreans demonstrate the superiority of our method when employing attention mechanism and adaptive loss during training.
2020
FG 2020
Multimodality Pain and related Behaviors Recognition based on Attention Learning
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
and Kim, Soo-Hyung
In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
Our work aimed to study facial data as well as movement data for recognition of pain and related behaviors in the context of everyday physical activities, which was provided as three tasks in EmoPain 2020 challenge. We explored deep visual representation and geometric features, which included head pose, facial landmarks, and action units in facial data with a combination of fully connected layers for estimating pain from facial data. In tasks with movement data, we employed long short-term memory layers to learn temporal information in each segment of 180 frames. We examined attention mechanism to investigate the relationship and gather data from multiple sources together. Experiments on EmoPain dataset showed that our methods significantly outperformed baseline results on pain recognition tasks.
IEEE Access
Semantic Segmentation of the Eye With a Lightweight Deep Network and Shape Correction
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
and Kim, Soo-Hyung
This paper presents a method to address the multi-class eye segmentation problem which is an essential step for gaze tracking or applying a biometric system in the virtual reality environment. Our system can run on the resource-constrained environments, such as mobile, embedded devices for real-time inference, while still ensuring the accuracy. To achieve those ends, we deployed the system with three major stages: obtain a grayscale image from the input, divide the image into three distinct eye regions with a deep network, and refine the results with image processing techniques. The deep network is built upon an encoder-decoder scheme with depthwise separation convolution for the low-resource systems. Image processing is accomplished based on the geometric properties of the eye to remove incorrect regions as well as to correct the shape of the eye. The experiments were conducted using OpenEDS, a large dataset of eye images captured with a head-mounted display with two synchronized eye-facing cameras. We achieved a mean intersection over union (mIoU) of 94.91% with a model of size 0.4 megabytes and 16.56 seconds to iterate over the test set of 1,440 images.
2019
ICCVW 2019
Eye semantic segmentation with a lightweight model
Huynh, Van-Thong,
Kim, Soo-Hyung,
Lee, Guee-Sang,
and Yang, Hyung-Jeong
In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
This paper describes an approach for the engagement prediction task, a sub-challenge of the 7th Emotion Recognition in the Wild Challenge (EmotiW 2019). Our method involves three fundamental steps: feature extraction, regression and model ensemble. In the first step, an input video is divided into multiple overlapped segments (instances) and the features extracted for each instance. The combinations of Long short-term memory (LSTM) and Fully connected layers deployed to capture the temporal information and regress the engagement intensity for the features in previous step. In the last step, we performed fusions to achieve better performance. Finally, our approach achieved a mean square error of 0.0597, which is 4.63% lower than the best results last year.
ICMLSC 2019
Emotion recognition by integrating eye movement analysis and facial expression model
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
Kim, Soo-Hyung,
and Na, In-Seop
In Proceedings of the 3rd International Conference on Machine Learning and Soft Computing