This paper introduces an approach for estimating evoked categories expression from videos with the temporal position fusion. Pre-trained models on large-scale datasets in computer vision and audio signals were used to extract the deep representation for timestamps in the video. A temporal convolution network, rather than an RNN-like architecture, was applied to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the noise labels, the temporal position was fused with the deep learned feature to ensure the network differentiates the time steps when noise labels were removed from the training set. This technique helps the system gain a considerable improvement compared to other methods. We conducted experiments on EEV, a large-scale dataset for evoked expression from videos, and achieved a score of 0.054 in terms of Pearson correlation coefficient as a state-of-the-art result. Further experiments on a sub set of LIRIS-ACCEDE dataset - MediaEval 2018 benchmark, also demonstrated the effectiveness of our approach.
@article{HUYNH2023245,title={Prediction of evoked expression from videos with temporal position fusion},journal={Pattern Recognition Letters},volume={172},pages={245-251},year={2023},issn={0167-8655},author={Huynh, Van-Thong and Yang, Hyung-Jeong and Lee, Guee-Sang and Kim, Soo-Hyung},}
2021
IEEE Multimedia
End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
and Kim, Soo-Hyung
This work presents an approach for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: 1) lightweight feature extractor, 2) attention strategy, and 3) adaptive loss. We proposed a lightweight deep architecture with approximately 1 MB, which for the most crucial part, accounts for feature extraction, in the emotion recognition systems. The relationship in regard to the time dimension of features is explored with temporal convolutional network instead of RNNs-based architecture to leverage the parallelism and avoid the challenge of vanishing gradient. The attention strategy is employed to adjust the knowledge of temporal networks based on the time dimension and learning of each modality’s contribution to the final results. The interaction between the modalities is also investigated when training with adaptive objective function, which adjusts the network’s gradient. The experimental results obtained on a large-scale dataset for emotion recognition on Koreans demonstrate the superiority of our method when employing attention mechanism and adaptive loss during training.
@article{vthuynhIEEEM,author={Huynh, Van-Thong and Yang, Hyung-Jeong and Lee, Guee-Sang and Kim, Soo-Hyung},journal={IEEE MultiMedia},title={End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss},year={2021},volume={28},number={2},pages={59-66},}
2020
FG 2020
Multimodality Pain and related Behaviors Recognition based on Attention Learning
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
and Kim, Soo-Hyung
In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
Our work aimed to study facial data as well as movement data for recognition of pain and related behaviors in the context of everyday physical activities, which was provided as three tasks in EmoPain 2020 challenge. We explored deep visual representation and geometric features, which included head pose, facial landmarks, and action units in facial data with a combination of fully connected layers for estimating pain from facial data. In tasks with movement data, we employed long short-term memory layers to learn temporal information in each segment of 180 frames. We examined attention mechanism to investigate the relationship and gather data from multiple sources together. Experiments on EmoPain dataset showed that our methods significantly outperformed baseline results on pain recognition tasks.
@inproceedings{yang2020multimodality,title={Multimodality Pain and related Behaviors Recognition based on Attention Learning},author={Huynh, Van-Thong and Yang, Hyung-Jeong and Lee, Guee-Sang and Kim, Soo-Hyung},booktitle={2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)},pages={814--818},year={2020},organization={IEEE},}
IEEE Access
Semantic Segmentation of the Eye With a Lightweight Deep Network and Shape Correction
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
and Kim, Soo-Hyung
This paper presents a method to address the multi-class eye segmentation problem which is an essential step for gaze tracking or applying a biometric system in the virtual reality environment. Our system can run on the resource-constrained environments, such as mobile, embedded devices for real-time inference, while still ensuring the accuracy. To achieve those ends, we deployed the system with three major stages: obtain a grayscale image from the input, divide the image into three distinct eye regions with a deep network, and refine the results with image processing techniques. The deep network is built upon an encoder-decoder scheme with depthwise separation convolution for the low-resource systems. Image processing is accomplished based on the geometric properties of the eye to remove incorrect regions as well as to correct the shape of the eye. The experiments were conducted using OpenEDS, a large dataset of eye images captured with a head-mounted display with two synchronized eye-facing cameras. We achieved a mean intersection over union (mIoU) of 94.91% with a model of size 0.4 megabytes and 16.56 seconds to iterate over the test set of 1,440 images.
@article{vthuynh2020semantic,title={Semantic Segmentation of the Eye With a Lightweight Deep Network and Shape Correction},author={Huynh, Van-Thong and Yang, Hyung-Jeong and Lee, Guee-Sang and Kim, Soo-Hyung},journal={IEEE Access},volume={8},pages={131967--131974},year={2020},publisher={IEEE},}
2019
ICCVW 2019
Eye semantic segmentation with a lightweight model
Huynh, Van-Thong,
Kim, Soo-Hyung,
Lee, Guee-Sang,
and Yang, Hyung-Jeong
In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
@inproceedings{vthuynh2019eye,title={Eye semantic segmentation with a lightweight model},author={Huynh, Van-Thong and Kim, Soo-Hyung and Lee, Guee-Sang and Yang, Hyung-Jeong},booktitle={2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)},pages={3694--3697},year={2019},organization={IEEE},}
ICMI 2019
Engagement Intensity Prediction with Facial Behavior Features
Huynh, Van-Thong,
Kim, Soo-Hyung,
Lee, Guee-Sang,
and Yang, Hyung-Jeong
In 2019 International Conference on Multimodal Interaction
This paper describes an approach for the engagement prediction task, a sub-challenge of the 7th Emotion Recognition in the Wild Challenge (EmotiW 2019). Our method involves three fundamental steps: feature extraction, regression and model ensemble. In the first step, an input video is divided into multiple overlapped segments (instances) and the features extracted for each instance. The combinations of Long short-term memory (LSTM) and Fully connected layers deployed to capture the temporal information and regress the engagement intensity for the features in previous step. In the last step, we performed fusions to achieve better performance. Finally, our approach achieved a mean square error of 0.0597, which is 4.63% lower than the best results last year.
@inproceedings{thong2019engagement,title={Engagement Intensity Prediction with Facial Behavior Features},author={Huynh, Van-Thong and Kim, Soo-Hyung and Lee, Guee-Sang and Yang, Hyung-Jeong},booktitle={2019 International Conference on Multimodal Interaction},pages={567--571},year={2019},}
ICMLSC 2019
Emotion recognition by integrating eye movement analysis and facial expression model
Huynh, Van-Thong,
Yang, Hyung-Jeong,
Lee, Guee-Sang,
Kim, Soo-Hyung,
and Na, In-Seop
In Proceedings of the 3rd International Conference on Machine Learning and Soft Computing
@inproceedings{van2019emotion,title={Emotion recognition by integrating eye movement analysis and facial expression model},author={Huynh, Van-Thong and Yang, Hyung-Jeong and Lee, Guee-Sang and Kim, Soo-Hyung and Na, In-Seop},booktitle={Proceedings of the 3rd International Conference on Machine Learning and Soft Computing},pages={166--169},year={2019},}
2018
ICMLSC 2018
Learning to detect tables in document images using line and text information
@inproceedings{huynh2018learning,title={Learning to detect tables in document images using line and text information},author={Huynh, Van-Thong and Nguyen-An, Khuong and Khanh, Trinh Le Ba and Yang, Hyung-Jeong and Tran, Tuan Anh and Kim, Soo-Hyung},booktitle={Proceedings of the 2nd International Conference on Machine Learning and Soft Computing},pages={151--155},year={2018},}