We are building a never-seen before humanoid robot that uses facial recognition technology, speech recognition technology, natural language understanding, emotion recognition, context and intent recognition and dialog management, to provide a holistic, empathetic understanding of the human user during interactive conversations. The humanoid robot will also express its own emotion in terms of speech, facial expressions and gesture. The robot will have the capability of learning on the fly during its interactions with users and the environment. We plan to install this robot in public areas of HKUST, where students and visitors can interact with it in different languages. We plan to show case this robot to primary and high school students to attract interest in STEM fields. This project will add interest to campus life and also stimulate intellectual debates on artificial intelligence.
Intelligent systems have evolved to the stage where virtual assistants are ubiquitous in smartphones and consumer-grade robots are affordable. Better human-machine interactions depend on machines being able to empathize with human emotions and discover their intent. We are building empathetic machines that are able to recognize and detect meanings and intent during conversations with the human, from their speech, language, facial expressions and context. In addition to understanding the content of speech, the machine needs to recognize if and when the human user is being sarcastic or joking, or if the human user is distressed and needs comforting. The “empathy module” is indispensable for machines and robots of the future that will become caretakers and companions to humans.


End-to-end task-oriented dialog systems usually suffer from the challenge of incorporating knowledge bases. We propose a novel yet simple end-to-end differentiable model called memory-to-sequence (Mem2Seq) to address this issue. Mem2Seq is the first neural generative model that combines the multi-hop attention over memories with the idea of pointer network. We empirically show how Mem2Seq controls each generation step, and how its multi-hop attention mechanism helps in learning correlations between memories. In addition, our model is quite general without complicated task-specific designs. As a result, we show that Mem2Seq can be trained faster and attain the state-of-the-art performance on three different task-oriented dialog datasets.


Artificial Intelligence systems can lead to negative social impact if they are used to create fake news, and if they are biased. Machine learning leads to models that often contain the same bias as the human samples they are trained from. We work on research topics that actively combat these negative forces. To date we have been working on online abusive language detection including sexist and racist languages. Abusive language detection models tend to have a problem of being biased toward identity words of a certain group of people because of imbalanced training datasets. For example, “You are a good woman” was considered “sexist” when trained on an existing dataset. Such model bias is an obstacle for models to be robust enough for practical use. In this work, we measure gender biases on models trained with different abusive language datasets, while analyzing the effect of different pre-trained word embeddings and model architectures. We also experiment with three bias mitigation methods: (1) debiased word embeddings, (2) gender swap data augmentation, and (3) fine-tuning with a larger corpus. These methods can effectively reduce gender bias by 90-98% and can be extended to correct model bias in other scenarios. We work on computational ways to measure inherent sexist or racist bias in chat bots.
We also work on fact checking statements made online.
Fact-checking of textual sources needs to effectively extract relevant information from large knowledge bases. In this paper, we extend an existing pipeline approach to better tackle this problem. We propose a neural ranker using a decomposable attention model that dynamically selects sentences to achieve promising improvement in evidence retrieval F1 by 38.80%, with (×65) speedup compared to a TF-IDF method. Moreover, we incorporate lexical tagging methods into our pipeline framework to simplify the tasks and render the model more generalizable. As a result, our framework achieves promising performance on a large-scale fact extraction and verification dataset with speedup. There are many remaining challenges in terms of quantitative measures of machine bias and computational models to alleviate such bias. How we can use AI actively for good is an overriding theme of our research group.


The objective in multi-label learning problems is a simultaneous prediction of many labels for each input instance. During the past years, there were many proposed embedding based approaches to solve this problem by considering label dependencies and decreasing learning and prediction cost. However, compressing the data leads to losing part of the information included in label space. The idea in this work is to divide the whole label space to some small independent groups which leads to independent learning and prediction for each small group in the main space, rather than transforming to the compressed space. We use subspace clustering approaches to extract these small partitions such that the labels in each group do not include any information to improve the results for the labels in the other groups. According to the experiments on different datasets with a various number of features and labels, the approach improves prediction quality with lower computational cost compared to the state-of-the-art.


We research on multimodal deep learning of affect, personality, emotions and sentiment from voice, language and facial expressions. Intelligent systems have evolved to the stage where virtual assistants are ubiquitous in smartphones and consumer-grade robots are affordable. Better human-machine interactions depend on machines being able to empathize with human emotions and discover their intent. We are building empathetic machines that are able to recognize and detect meanings and intent during conversations with the human, from their speech, language, facial expressions and context. In addition to understanding the content of speech, the machine needs to recognize if and when the human user is being sarcastic or joking, or if the human use.
We propose a multilingual model to recognize Big Five Personality traits from text data in four different languages: English, Spanish, Dutch and Italian. Our analysis shows that words having a similar semantic meaning in different languages do not necessarily correspond to the same personality traits. We propose a personality alignment method, GlobalTrait, which has a mapping for each trait from the source language to the target language (English), such that words that correlate positively to each trait are close together in the multilingual vector space. Using these aligned embeddings for training, we can transfer personality related training features from high-resource languages such as English to other low-resource languages, and get better multilingual results, when compared to using simple monolingual and unaligned multilingual embeddings.We also propose a tri-modal architecture to predict Big Five personality trait scores from video clips with different channels for audio, text, and video data. For each channel, stacked Convolutional Neural Networks are employed. The channels are fused both on decision-level and by concatenating their respective fully connected layers. It is shown that a multimodal fusion approach outperforms each single modality channel, with an improvement of 9.4% over the best individual modality (video). Full back propagation is also shown to be better than a linear combination of modalities, meaning complex interactions between modalities can be leveraged to build better models. Furthermore, we can see the prediction relevance of each modality for each trait. The described model can be used to increase the emotional intelligence of virtual agents.


We are collaborating with artists, designers and art and design students at the Central Academy of Fine Arts in Beijing, on using AI tools for creative arts, ranging from machine generated art and paintings, to exploring the artistic meaning of AI technology. We are interested in using AI for the future of design and in using AI as a new medium of art creativity. Meanwhile, we are interested in exploring machine creativity and machine learning of aesthetic taste.


Businesses often need to analyze their customer calls, customer feedback, and user comments to find out about what they can do to improve sales, to improve services and to identify new business opportunities. This Big Data needs to be transcribed and analyzed automatically as they are too huge to be analyzed manually. The objective of our research project is to design algorithms that can perform sentiment analysis and emotion recognition from speech and text data. We use statistical modeling and machine learning methods to train our algorithms to determine whether a user sentiment is positive or negative towards a product or service. We also use audio and lexical features from speech and its transcription to recognize whether a caller is satisfied or dissatisfied with a service. We use the same methods to find out what type of sales calls leads to a successful transaction.


With the rapid penetration of the global market by smart phones and tablet computers, the need for universal voice-enabled user interface has become increasingly urgent. The most recent example is the launch of the Siri Personal Assistant on the iPhone 4s where most user-to-machine interaction can be achieved by spoken commands. Despite the maturity of speech recognition performance in major languages, it is still a major challenge to build recognition systems for languages with little written data for training the system. There are large numbers of speakers of non-standard languages who still do not have access to practical speech recognition systems. Most of the Indian languages, spoken by millions of people, do not have adequate online data that is useful for training speech recognition systems. In Hong Kong, many meetings and official speeches are delivered in Cantonese, a major non-standard Chinese language. We propose to investigate methods to meet the challenges of acoustic and language modeling issues of large vocabulary, non-standard language speech recognition with little online written text, with the near term application to non-standard Chinese language recognition. Our proposed methods can also be applied to specific applications for which there is little training data initially.


Mixed language is an increasingly common occurrence in today's globalized entertainment, academic and business worlds. We propose new approaches to acoustic and language modeling of mixed language speech collected in a naturalistic setting, with the objective of building a speaker independent mixed language speech recognition system. Such a system must be able to recognize embedded foreign language speech without sacrificing the performance on the native (matrix) language speech. Mixed language is a heterogeneous genre produced by speakers who either use borrowed foreign words, switch from one language to another in the middle of a sentence (intra-sentential code mixing) or at the end of a sentence (inter-sentential code switching). Previous work on automatic recognition of mixed language speech has been restricted to either speaker dependent systems, or on speech with less than 2% borrowed words, or on short phrases with constrained grammar. In reality, we have found English in around 20% of the fluent speech recorded at meetings conducted in Chinese, for example. There has not been any research on speaker-independent systems of this kind so far.

Speech recognition in mixed language has difficulties adapting to the end-to-end deep learning framework due to the lack of data and overlapping phone sets, for example in words such as ”one” in English and ”wan” in Chinese. We propose a CTC-based end-to-end automatic speech recognition model for intra-sentential English-Mandarin code-switching. The model is trained by joint training on monolingual datasets, and fine-tuning with the mixed-language corpus. During the decoding process, we apply a beam search and combine CTC predictions and language model score. The proposed method is effective in leveraging monolingual corpus and detecting language transitions and it reduces the character error rate.


We are interested in exploring unsupervised learning methods to extract bilingual sentences and bilingual lexicons from online data. We were among the first to have proposed using distributional semantics to match sentences and words to their counterparts in another language. More recently we are intrigued by methods using deep learning to translate terminology in an unsupervised manner.