Web Information Retrieval

Spring 2013

Min Zhang & Yiqun Liu
Department of Computer Science and Technology, Tsinghua University

Time & Venue;    Course Description;     Syllabus;     Homework & Project;

Course Info in Spring, 2013

Back to top


Course Description

This course gives a survey to the new research branches, introduces the state-of-the-art technologies, and discusses on open problems and challenges on Web information retrieval (Web IR). At the same time, the course focuses on the practical applications in the Internet environment, making case study and detail analysis on commercial search engines (SE). The main topics of the course includes (but not limited to): General IR architecture; Models in IR, such as boolean, VSM, probabilistic models, and Language Models; IR in Web environment, such as crawling, link analysis, anti-spam etc; Challenges in Web IR, including Scale, Quality, Web conventions, multi-source fusion, evaluation, UI; opinion / sentimental analysis; social media and IR, Human Computation; User behavior analysis, such as eye tracking studies, click models, etc; Visual IR, including image retrieval and video retrieval; and evaluation issues.

The course is composed of lectures and student-conducted discussions.

Reference books and readings


Dr. Min Zhang is an associate professor in the Department of Computer Science & Technology (DCST), Tsinghua University. She received her Bachelor and PhD degrees from DCST at Tsinghua University in 1999 and 2003, respectively. During the past years, she has visited DFKI Germany, City University of HongKong, Kyoto University, and MSRA as visiting researcher. Dr. Zhang specializes in information retrieval, Web user behavior analysis and machine learning. She has published more than 100 papers on important international journals and conferences, such as JASIST, JIR, SIGIR, WWW, WSDM, CIKM, etc. She has participated in TREC (Text REtrieval Conference) benchmarks as the team leader since 2002. Her team has continuously achieved multiple top performances during 10 years.  She also contributed in INTENT tasks in NTCIR evaluation as task co-organizer from 2011 to 2013. Dr. Zhang serves as area chairs or senior PC members at CIKM and AIRS, and PC members at SIGIR, WWW, WSDM, KDD, ACL, etc. Currently she is also the executive director of Tsinghua University-Microsoft Research Asia Joint Research Lab on Media and Search, and the vice director of Tsinghua-Sohu Joint Research Lab of Search Technology.

Dr. Yiqun Liu is now working as an associate professor in the department of computer science and technology in Tsinghua University. He is interested in the research areas of Web search technology and Web user behavior analysis, and have particular focus on improving search performance with the help of user behavior analysis. The aspects of our recent projects cover Web page quality estimation, Web spam page and illegal resource identification, search performance evaluation, on-line advertising performance evaluation, and search engine query recommendation. He has published several high-quality papers in TWeb, JASIST, SIGIR, IJCAI, WWW, CIKM, WSDM, and some other important journals and conferences. Algorithms and systems developed from these projects have been adopted by Sogou.com, one of the most popular search engines in China.

Teaching Assistant:


The course requirements include lecture discussions, homeworks, and a course project. This is a M.S. level class, and by the end of this class you should have an understanding of the basic concepts in Web information retrieval and IR-related Web applications. You should also be able to construct a campus-based demonstration search engine system with this knowledge. The grading breakdown is as follows:


Back to top


(NOTE: The syllabus is subject to change.)

·         Week 1: Basic IR Concepts I: Introduction to Information retrieval slides

          History of IR, IR related Concepts, Architectures of IR system, Introduction to research branches of IR

·         Week 2: Basic IR Concepts II: Architecture of Web IR systems slides

  Web environment and IR systems,  Overview of crawling system, indexing system and ranking system

·         Week 3: Basic IR Concepts III: Performance Evaluation of Web IR system slides

    Basic problem and the classification of evaluation

    User survey based evaluation

    Cranfield-like methodology

    Impact factors for user experience on Web search

    Automatic evaluation technology

·         Week 4: Web IR Technologies I: Crawling slides

    Functions and performance requirements

    Crawling frontend

    Web page analysis

    Scheduling strategy

    Repetition and low quality detection

·         Week 5: Web IR Technologies II: Indexing

    Functions and performance requirements

    Dictionary construction

    Invert index construction

    Parallel indexing

          Index compression

·         Week 6: Web IR Technologies III: Hyperlink Analysis slides

    Functions and performance requirements

    Web hyperlink graph

    HITS algorithm

    PageRank algorithm

    Hyperlink and anchor text

·         Week 7: Web IR Technologies IV: Ranking slides

    Functions and performance requirements

    Ranking factors

    Boolean model

    Vector space model

    Probabilistic model

    Learning to rank

·         Week 8: User behavior analysis for IR I: Eye Tracking Studies slides

    Sources of user behavior information

    Categorization of user behavior and user intention

    Eye-tracking Studies and observations for IR

    User behavior analysis vs. social psychology and cognition

·        Week 9: User behavior analysis for IR II: Click Models slides

    User modeling for IR

            Click Models (Cascade model, UBM, etc)

            Problems and future directions

·        Week 10: Web IR Challenges I: Scale and Data quality estimation slides

    Challenges in Scale and data quality

    Web data quality estimation with hyperlink analysis

    Web data quality estimation with user behavior analysis

    Data quality estimation for Web 2.0 services

    Problems and future directions

·        Week 11: Web IR Challenges II: Web spam fighting

    Definition of Web spam

    Web spamming techniques

    Spam fighting with content analysis

    Spam fighting with hyperlink analysis

    Spam fighting with user behavior analysis

    Problems and future directions

·        Week 12: Web IR Challenges III: Web conventions, multi-source fusion

    Rethinking of Web Conventions and their effects to Web search

    Problems and future directions

    Multi-source fusion and meta search

·        Week 13: Web IR Challenges IV: Search Engine Evaluation and UI

    Integration and Evaluation

    Novelty and Diversity

    Challenges in New UI design

·        Week 14: Visual IR I: Web Image Retrieval slides

    What Is Visual IR & The Importance

    Difference Approaches in Image Retrieval

    New Techniques and applications

·        Week 15: Visual IR II: Web Video Retrieval slides

    Challenges in Video retrieval

    Difference Approaches in Image Retrieval

    New solution in Web age

·        Week 16: Social IR and Human Computation slides

                      Social network and social media

              Information request on Web2.0 environment

              Link analysis

              Social media retrieval models

              Social network: virtual society vs. human society

Back to top

Homework assignments (subject to change)

Project (subject to change)

 Students should select from one of the following topics as the course project. Both code and working report needs to be submitted via http://learn.tsinghua.edu.cn/ on the day it is due.

Back to top

© 2012~2013 Min Zhang & Yiqun Liu @ Department of Computer Science and Technology, Tsinghua University
Last Updated: 2013/08/06