Main Page

From Data Management Lab
Jump to: navigation, search
Data Management Lab is subordinate to Institute of Computer Science and Technology of Peking University. The instructor is Lei Zou. Our lab's research is mainly about Graph-based Data Management.Graph Data Management involves how to store and search semantic data based on RDF(Resource Description Framework), graph stream data analysis, distributed graph database engine development and how to use new hardware technology to improve graph computing and graph query.


  • [2017-11-27]Sen Hu's paper on Knowledge Base Question Answering (Answering Natural Language Questions by Subgraph Matching over Knowledge Graphs) is accepted by TKDE 2017. This is an extension paper over previous work on Question Answering that is published on SIGMOD 2014. This paper also follow the semantic parsing based approach, it first translate the natural language question to semantic query graph, then find the appropriate matches over the knowledge graph. The framework in previous work is relation first which have some obstacles including can not tackle implicit relations. The core idea of the extension is to propose a node first framework to build semantic query graph, which can tackle implicit relations by utilizing data mining on offline data graph. It is robust because it does not rely on any templates to build the query graph structure and allow some false edge. Those false edges would be removed on the evaluation process according the matches. Thus the node first framework can tackle more complex questions. The experiment results show that our solution not only improves the precision but also speeds up query perfermence greatly.
  • [2017-09-28]Youhuan Li’s paper on longest increasing subsequence (LIS) computation over data stream (Longest Increasing Subsequence Computation over Streaming Sequences) is accepted by TKDE 2017. This is an extension paper over previous work on streaming LIS computation that is published on VLDB 2017. This paper also focuses on the LIS computation over data stream with the proposed quadruple neighbor list (QN-list) data structure. The core idea of the extension part lies in the computation of LIS with all existing constraints. In previous work, we design algorithm for computing LIS with extreme gap/weight, while, in this version, we not only include the computation of LIS with extreme width (width is defined as the positional distance between the head item and the tail item of an LIS), but also consider the range/slope-constrained LIS computation. The range-constrained LIS is the LIS where any two consecutive items satisfy that their difference and distance are within given ranges. The slope-constrained LIS is the LIS where the slope between any two consecutive items (i.e., the ratio of the difference over the distance) should be not less than a certain parameter m. Computing range/slope-constrained LIS are quite different that of LIS with extreme gap/weight/width while we innovatively design efficient algorithm to support range/slope-constrained LIS and our solution is better than comparative work not only theoretically but also experimentally.
  • [2017-08-05]Shuo Han’s paper (Keyword Search on RDF Graphs — A Query Graph Assembly Approach) is accepted by CIKM 2017. Keyword search provides ordinary users an easy-to-use interface for querying RDF data. In this paper, we study how to assemble a query graph that is to represent user’s query intention accurately and efficiently. Based on the input keywords, we first obtain the elementary query graph building blocks, such as entity/class vertices and predicate edges. Then, we formally define the query graph assembly (QGA) problem. Although we prove theoretically that QGA is a NP-complete problem,we design some heuristic lower bounds and propose a bipartite graph matching-based best-first search algorithm. The algorithm’s time complexity does not depend on the RDF graph size, which guarantees the good scalability of our system in large RDF graphs.
  • [2017-05-19]Our team gStreamPKU wins the finalist award (top-5) in SIGMOD Programming Contest 2017 among 100+ teams. SIGMOD Programming Contest is held annually for undergraduate and graduate students to solve challenging database problems. The task this year is to search documents and return matchings from a dynamic updated dictionary. Please see for details. It is worth mentioning that we are also the finalists in the last year's contest.
  • [2017-04-10]Weiguo Zheng's paper (Efficient SimRank-based Similarity Join) is accepted by ACM Transactions on Database Systems (ACM TODS). Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, etc. In this paper, we adopt “SimRank” to evaluate the similarity between two vertices in a large graph because of its generality. Specifically, we define a SimRank-based join (SRJ) query to find all the vertex pairs satisfying the threshold in a data graph G. In order to reduce the search space, we propose a shortest-path distance based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover+ vertex pairs), based on which, the SimRank score of any vertex pair can be computed easily. To find the h-go cover+ vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.
  • [2017-03-28]Peng Peng's paper(Answering top-K query combined keywords and structural queries on RDF graphs) is published on Information System. Although SPARQL has been the predominant query language over RDF (Resource Description Framework) graphs, some query intentions cannot be captured well using only SPARQL syntax. On the other hand, keyword search enjoys widespread usage because of its intuitive way of specifying information needs, but suffers from the problem of low precision. To maximize the advantages of both SPARQL and keyword search, we introduce a novel paradigm that combines them and propose a hybrid query (called a SPARQLKeyword (SK) query) that integrates SPARQL and keyword search. To answer SK queries efficiently, we propose a novel integrated query algorithm based on a structural index. We also present a distance-based optimization technique to further improve the efficiency of SK queries evaluation. We test our method in three large real RDF graphs and the experiments demonstrate both the effectiveness and efficiency of our method.
  • [2017-03-08]Xinbo Zhang's paper(Improving the Precision of RDF Question/Answering Systems— A Why Not Approach) is accepted by WWW 2017 as a poster. In this paper, we propose a self-learning solution based on the users' feedback over the original query graph. Specifically, our method automatically refines the SPARQL query into a new query graph with minimum modifications. The new query will fix the errors and omissions of the query results. Furthermore, each amendment can also be used to improve the precision in answering subsequent natural language questions.
  • More...


  • For recruitment information, please see the navigation link.
  • Our lab is recruiting interns. See Interns Recruitment for more details.

Upcoming Conferences

  • WWW 2017
  • ICDE 2017
  • SIGMOD 2017 Ⅱ