Natural Language Processing

The ICST has been working on computer network and database applications since 1990s, and developed a news reports publishing management system in 1994, which has been widely used in many Chinese newspapers. After 12 years of research and development, the team successfully developed a novel digital asset management system for the news industry, which has made significant impact for the press industry of China, and also won the second prize of the "State Scientific and Technological Progress Award" in 2006.

Fig 2.5 Different approaches to parsing for graph-structured syntactic-semantic representations

With the development of the Internet since 2000, we gradually shifted our emphasis into digital publishing and new media, and focus on inventing and developing novel technologies for digital content analysis and knowledge based intelligent services. Our research outcomes have been published on top-tier NLP/DB international journals and conferences, such as TPAMI, ACM TOIS, Computational Linguistics, VLDB Journal, TKDE, TASLP, TACL, ACL, IJCAI, AAAI, SIGMOD, VLDB, ICDE, SIGIR, EMNLP.

Our current research mainly focuses on:

·             Chinese language understanding, such as deep syntactic parsing and compositional semantic analysis.

·             Automatic text generation, including automatic text summarization, natural language generation, etc.

·             Social media mining, including real-time filtering and retrieval for Weibo and WeChat, short text analysis, social network analysis, sentiment analysis, opinion mining, dissemination effect analysis, etc.

·             Semantic analysis and knowledge based intelligent service, including open-domain knowledge base construction, question answering, news event extraction/mining, user profiling, personalized recommendation, etc.

·             Massive information storage and automatic semantic web construction, including graph data management, automatic construction and dynamic extension for massive RDF semantic data.

·             Knowledge mining and semantic search for massive data, including semantic concept disambiguation, knowledge mining, association analysis, structure analysis, fuzzy semantic search, etc.

·             Internet public opinion analysis, including massive semi-structured data search and mining, public opinion search and mining, public opinion monitoring and early warning systems, etc.

·             Natural language based human computer interaction, including open domain and domain specific dialogue systems, dialogue retrieval and generation; controllable and extendible dialogue systems, etc.

Our scientific achievement includes:

·           Digital Asset Management System for News Industry

Our digital asset management system digitalized the news collection and compilation process, as well as the operation management process of news media. This system has won the second prize of the "State Scientific and Technological Progress Award" in 2006.

·           Internet Public Opinion Analysis and Early Warning System

This system is able to automatically collect, monitor, and analyze the Internet public opinion, and has been widely employed by the government. The system has made significant contribution to the healthy development of the Internet, and further to the construction of harmonious society as well.

·           Digital Newspaper and Cross-Media Publication System

This system provides core functions for digital publishing services, including layout processing, digital newspaper production, multi-channel publishing, automatic upload, digital distribution control, and multi-terminal display, which has realized the automation of newspaper production, the standardization of content exchange, and the diversification of operation modes. Thousands of news agencies have employed our system, which has made great contribution to the digital publication technology of China.

·           Chinese Semantic Knowledge Base — PKUBase

PKUBase is a large-scale open domain Chinese knowledge graph, which is automatically constructed from multiple online encyclopedias, and includes over 13,000,000 Chinese entities, over 50,000,000 high-quality knowledge triples, more than 100,000 concepts and nearly 3,000,000 category-related triples.

·           Knowledge Graph Data Management System — gStore

gStore is an open source graph-based RDF data storage and querying system, which supports structured querying over 10 billion RDF triples with second-level response time.

·           Chinese Natural Language Processing Toolkit — PKUNLP

PKUNLP provides high-precision lexical and syntactic parsing for Chinese, and also supports deep semantic dependency analysis.

·           Auto-Writing System — PKUWriter

PKUWriter can automatically generate real-time short or long news reports with good readability based on structured data and textual materials. It has been widely employed by both traditional newspapers and Internet news providers, such as Bytedance (Toutiao), Southern Metropolis Daily and Guangzhou Daily, producing thousands of news reports on sports and society. Our system has been widely covered in the media from both domestic and abroad.

·           Human-Computer Dialogue System

By learning from massive human dialogue corpora, and combining retrieval-based and generation-based dialogue techniques, our self-developed human-computer dialogue system can generate controllable dialogue which can incorporate specific information, either explicitly or implicitly, to the output dialogue. This system has been successfully applied into real products. 

