We are building a never-seen before humanoid robot that uses facial recognition technology, speech recognition technology, natural language understanding, emotion recognition, context and intent recognition and dialog management, to provide a holistic, empathetic understanding of the human user during interactive conversations. The humanoid robot will also express its own emotion in terms of speech, facial expressions and gesture. The robot will have the capability of learning on the fly during its interactions with users and the environment. We plan to install this robot in public areas of HKUST, where students and visitors can interact with it in different languages. We plan to show case this robot to primary and high school students to attract interest in STEM fields. This project will add interest to campus life and also stimulate intellectual debates on artificial intelligence.


Intelligent systems have evolved to the stage where virtual assistants are ubiquitous in smartphones and consumer-grade robots are affordable. Better human-machine interactions depend on machines being able to empathize with human emotions and discover their intent. We are building empathetic machines that are able to recognize and detect meanings and intent during conversations with the human, from their speech, language, facial expressions and context. In addition to understanding the content of speech, the machine needs to recognize if and when the human user is being sarcastic or joking, or if the human user is distressed and needs comforting. The “empathy module” is indispensable for machines and robots of the future that will become caretakers and companions to humans.


Businesses often need to analyze their customer calls, customer feedback, and user comments to find out about what they can do to improve sales, to improve services and to identify new business opportunities. This Big Data needs to be transcribed and analyzed automatically as they are too huge to be analyzed manually. The objective of our research project is to design algorithms that can perform sentiment analysis and emotion recognition from speech and text data. We use statistical modeling and machine learning methods to train our algorithms to determine whether a user sentiment is positive or negative towards a product or service. We also use audio and lexical features from speech and its transcription to recognize whether a caller is satisfied or dissatisfied with a service. We use the same methods to find out what type of sales calls leads to a successful transaction.


With the rapid penetration of the global market by smart phones and tablet computers, the need for universal voice-enabled user interface has become increasingly urgent. The most recent example is the launch of the Siri Personal Assistant on the iPhone 4s where most user-to-machine interaction can be achieved by spoken commands. Despite the maturity of speech recognition performance in major languages, it is still a major challenge to build recognition systems for languages with little written data for training the system. There are large numbers of speakers of non-standard languages who still do not have access to practical speech recognition systems. Most of the Indian languages, spoken by millions of people, do not have adequate online data that is useful for training speech recognition systems. In Hong Kong, many meetings and official speeches are delivered in Cantonese, a major non-standard Chinese language. We propose to investigate methods to meet the challenges of acoustic and language modeling issues of large vocabulary, non-standard language speech recognition with little online written text, with the near term application to non-standard Chinese language recognition. Our proposed methods can also be applied to specific applications for which there is little training data initially.


Mixed language is an increasingly common occurrence in today's globalized entertainment, academic and business worlds. We propose new approaches to acoustic and language modeling of mixed language speech collected in a naturalistic setting, with the objective of building a speaker independent mixed language speech recognition system. Such a system must be able to recognize embedded foreign language speech without sacrificing the performance on the native (matrix) language speech. Mixed language is a heterogeneous genre produced by speakers who either use borrowed foreign words, switch from one language to another in the middle of a sentence (intra-sentential code mixing) or at the end of a sentence (inter-sentential code switching). Previous work on automatic recognition of mixed language speech has been restricted to either speaker dependent systems, or on speech with less than 2% borrowed words, or on short phrases with constrained grammar. In reality, we have found English in around 20% of the fluent speech recorded at meetings conducted in Chinese, for example. There has not been any research on speaker-independent systems of this kind so far.


The World Wide Web is a "boundless world of information interconnected by hypertext links". We argue that the Web is a virtually infinite and continuously growing corpus for natural language processing. Rather than taking a snapshot of it at one moment, and use the result as a static corpus, we propose to continuously crawl the Web for new, comparable data for mining parallel sentences. Rather than focusing on a single domain such as news, or on translated parallel sites with matching structures, we propose to look for sites that are comparable in content, HTML structure, link structure, URL as well as in temporal distance as they potentially contain parallel sentences.

Much effort has been made in the past to try to automatically extract parallel resources from comparable corpora on one hand, and to use the Web as a corpus on the other. Both approaches (often combined) allow more diversity in the data harvested. (Resnik and Smith, 2003) directly extracted parallel texts from the Web, relying mostly on URL names. Some work has been done to extract parallel resource (sentences, sub-sentential fragments, lexicon) from comparable data. (Munteanu and Marcu, 2005) showed they can extract relevant parallel sentences using a supervised approach on newspaper corpora, although their main goal was to show how they manage to use such resources to improve Statistical Machine Translation. (Fung and Cheung, 2004; Wu and Fung, 2005) extracted parallel sentence from quasi-comparable corpora, that is corpora containing documents from the same domains as well as documents of different domains. We need to be able to combine advanced IR/Web crawling techniques with advanced NLP methods in order to obtain large and high quality sets of parallel sentences. From this point of view, we do not want to focus on one particular domain (such as newspaper, as it is often the case in related works). Of course, we are aware and will keep in mind than better results can be obtained from certain kind of documents (for example, Wikipedia constitutes a large source of very comparable, easy to harvest and well structured documents), but propose a general approach for mining from any website, in any dominant Web language. We strive to reduce the language dependency and domain dependency to a minimum.


With the advent of online digital music services, people have access to an unprecedented amount of music. The Long Tail Theory, a well-known model of the online economy, states that, in addition to a large supply of content, the key is to give users personalized and structured access to this content. All major service providers today allow users to search for music according to discrete labels, such as singer/song titles, as well as genre and mood labels. In particular, mood and genres are among the most common tags users rely on to retrieve music they like. Music search engine logs showed that 28.2% of the queries are directly emotion-related and 33% are theme-related (e.g. "rainy morning”). However, tagging each musical piece manually with its genre and mood labels is time consuming and expensive when we consider that online music services today provide from about a million (e.g. Pandora) to 18 million songs (e.g Spotify). Automatic Music Information Retrieval (MIR) systems are developed with the objective of organizing the vast amount of music data into structured form.


We propose a one step rhetorical structure parsing, chunking and extractive summarization approach to automatically generate meeting minutes from parliamentary speech using acoustic and lexical features. We investigate how to use lexical features extracted from imperfect ASR transcriptions, together with acoustic features extracted from the speech itself, to form extractive summaries with the structure of meeting minutes. Each business item in the minute is modeled as a rhetorical chunk which consists of smaller rhetorical units. Principal Component Analysis (PCA) graphs of both acoustic and lexical features in meeting speech show clear self-clustering of speech utterances according to the underlying rhetorical state—for example acoustic and lexical feature vectors from the question and answer or motion of a parliamentary speech, are grouped together. We then propose a Conditional Random Fields (CRF)-based approach to perform both rhetorical structure modeling and extractive summarization in one step, by chunking, parsing and extraction of salient utterances. Extracted salient utterances are grouped under the labels of each rhetorical state, emulating meeting minutes to yield summaries that are more easily understandable by humans. We compare this approach to different machine learning methods.


The focus in automatic speech recognition (ASR) research has gradually shifted from read speech to spontaneous speech. ASR systems can reach an accuracy of above 90% when evaluated on read speech, but the accuracy of spontaneous speech is much lower. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. An analysis of pronunciation variations at the acoustic level reveals that pronunciation variations include both complete changes and partial changes. Complete changes are the replacement of a canonical phoneme by another alternative phone, such as ‘b’ being pronounced as ‘p’. Partial changes are variations within the phoneme and include diacritics, such as nasalization, centralization, voiceless, voiced, etc. Most of the current work in pronunciation modeling attempts to represent pronunciation variations either by alternative phonetic representations or by the concatenation of subphone units at the state level. In this work, we show that partial changes are a lot less clear-cut than previously assumed and cannot be modeled by mere representation in alternate or concatenation of phone units. When partial changes occur, a phone is not completely substituted, deleted or inserted, and the acoustic representation at the phone level is often ambiguous. We suggest that in addition to phonetic representations of pronunciation variations, the ambiguity of acoustic representations caused by partial changes should be taken into account. The acoustic model for spontaneous speech should be different from that of read and planned speech -- it should have a strong ability to cover partial changes. We propose modeling partial changes by combing the pronunciation model with acoustic model at the state level. Based on this pronunciation model, we reconstruct the acoustic model to improve its resolution without sacrificing the model’s identity with the goal of accommodating pronunciation variations.