**19319510** INFORMATION RETRIEVAL

Pro-Seminar; Winter Semester 2018/19; 2 ECTS

by Prof Christoph Schommer

### What's it all about?

The sensors of the Big Data hype have engulfed the digital world as well as our society. Technical innovations, scientific achievements (in particular with the field of Artificial Intelligence), and an insistent information-centric thinking have made it possible that data - and with it - information has been put in front more than ever before. This leads to consequences, for example the demand for a deep understanding of data and information, the necessity to clearly differentiate between correlation and causality, and to identify appropriate technical ways to make reliable information available whenever it is needed.

### Organisation

The course is a mixture of seminar and lecture and is held as a block course from **Monday, 19 November until Thursday, 22 November **as well as from **Monday, 26 November until Thursday, 29 November**. The** lecture takes place from 14h00 - 15h30** whereas the **seminar presentation starts from 15h45 **(= 45 minutes of presentation plus 15 minutes of discussion).

**LECTURE PART:**

**Monday, 15 October**: Course overview, Presentation of the projects.**Monday, 19 November:**What is Information Retrieval? Structured and Unstructured data. The structure of a Search Engine. Distance measures (Minkowski Metrics, cosine/angle, Edit distance, Hamming distance, Jaccard-index and n-grams). Term-Document incidence and Boolean Retrieval: Boolean Matrix, Posting Lists, and skip pointers. Inverted Index List and the Merge algorithm. Boolean Query Optimization.**Tuesday, 20 November:**a) Selected aspects of Natural Language Processing: Stop Words, Tokenization, Lemmatization, Stemming. Porter Stemmer (original paper from 1980 about Suffix stripping). Edit distance, a dynamic programming approach, to calculate the distance between strings. SoundEx, a coding algorithm on the basis of phonetics. b) Use of hash-tables and balanced trees to allow a wildcard search, for example Ber*. Use of a reverse balanced tree to find answers to wildcard requests like: Be*in.**Wednesday, 21 November:**Final remarks to a) wildcard searches with double wildcards (à la B*r*in) and to b) positional indexes (posting lists with frequencies/whole text collection as well as the individual positions in the documents).__Main Topic of the day__: Possibilty towards a query extension. Use of apriori-algorithm (association discovery; data mining) to calculate associative rules based on recorded user queries. Algorithmic idea: 2 phases with a) Rule Production Phase and b) Rule Generation Phase. ad a) iterations of join and pruning steps based on a minmum-frequency threshold (called min-support). ad b) usage of the conditional probability towards a minimum-conditional probability threshold (called min_confidence). Possible applications of the discovered associative rules: a) present a list with potential extensions__before__the user requests the query or b) present an extended query__after__the user has indicated a non-satisfaction with the result set.**Thursday, 22 November:****Monday, 26 November:**Evaluation of search results: error, accuracy. More important: relevance of retrieved documents, expressed by Precision (= P(relevant | retrieved) and Recall (= P(retrieved | relevant). F-measure combines both statistical values: fundaments on the harmonic mean as well as on van Rijsbergen's effectiveness function E.**Remarks to PageRank**(= today's talk): from a random walking-perspective, we use of a damping factor**d**(=> teleporting) and have a PageRank of an arbitrary page**j**as Pr(j) =**(1 - d)**+**d*** (**SUM_(i->j) Pr(i) * P( i->j )**), where P(i->j) is the random value to go from node i to node j. With that, we can establish equations, depending on the number of nodes concerned. Example. The values are close to the Eigenwert-problem Pr(.) = x*M^k, where x is the start vector and M the transition Matrix. Consequence: convergence, independently with which vector we start.**Tuesday, 27 November:**Evaluation (ff.): Remarks concerning__a)__van Rijsbergen's effectiveness function E_alpha and the F_beta-measure: E_alpha=1 - F_beta. Beta = Recall/Precision, and alpha = 1/(1+beta^2). Reason: gradients dE/dP and dE/dR, respectively, are equal exactly for

beta = (1-alpha) / alpha.__b)__Cohen´s kappa-statistics with P(agreement) and P(expected agreement).**Wednesday, 28 November:****Summary of the contents (= chapters 1-4, 6-10, 13).****Thursday, 29 November:**Individual final talk with the participants:

14h00 - 14h20: Hr Pfeifer, 14h25 - 14h45: Hr Bergmann, 14h50 - 15h10: Hr Nguyen, 15h15 - 15h35: Hr Chen, 15h40 - 16h00: Hr Casares.

**SEMINAR PART:**

- A seminar presentation includes a presentation with or without slides.
- Each person gives exactly one talk.
**The talks can to be held either in German or English language.**

__TALKS:__

**Herr Pfeifer:**Index construction (Chapters 4.2 - 4.5; pp. 63-73).__Termin__:**Dienstag, 20 November 2019, 15h45 - 16h30**Posting lists, Blocked sort-based indexing, Single-pass in-memory indexing, Distributed indexing, MapReduce, Dymanic indexing.

__Inhalt:__

**Herr Bergmann:**Parametric Zone Indexes (Chapter 6.1; pp. 100-107).__Termin__:**Mittwoch, 21 November 2019, 15h45 - 16h30**

Zones in texts, parametric zone indexes, weighted zone scoring, learning the weights, finding an optimal weight g.__Inhalt:__

**Herr Nguyen:**PageRank (Chapters 21.1 - 21.3, pp. 421-433)__Termin__:**Montag, 26 November 2019, 15h30 - 16h30**

General Idea: web as graph, dead-end nodes; teleporting; hubs and authorities; topic-specific pageRank; hyperlink induced topic search; summary.__Inhalt:__

**Herr Chen:**A Vector Space Model for XML Retrieval and Result Evaluation (Chapters 10.3 - 10.4; pp. 188-196)__Termin__:**Dienstag, 27 November 2019, , 15h05 - 16h20**

Structured and unstructured data, XML concepts, challenges in XML retrieval, a VSM for XML retrieval, Evaluation of XML retrieval. Cumhyriyet example.__Inhalt:__

**Herr Casares:**Text Classification and Naive Bayes (Chaptes 13.1 - 13.2; pp. 234-243)__Termin__:**Mittwoch, 28 November 2019, 14h00 - 15h00**

Text Cassification problem, Naive Bayes Text Classification: relation to multinomial unigram language model, naive bayes algorithm, laplace correction/smoothing.__Inhalt:__

**Evaluation**

The **seminar presentation counts for 2/3** of the mark whereas **1/3 of the mark is from an oral examination (Thursday, 29 November 2018)**.

### Aims and Goals

The course aims at the exchange of contents and experiences in the field of **Information Retrieval**. Every candidate should be able to understand and to critically question existing methods, but also be able to realize own concepts and ideas.

### Selected References

**MAIN REFERENCE:**C. Manning, P. Raghavan, H. Schütze:*Introduction to Information Retrieval*. Cambridge University Press. See also the companion website at Stanford to follow the book:**https://nlp.stanford.edu/IR-book/information-retrieval-book.html**- S. Büttcher, C. L. A. Clarke, G. V. Cormack:
*Information Retrieval: Implementing and Evaluating Search Engines*. MIT Press. - R. Baeza-Yates, B. Ribeiro-Neto:
*Modern Information Retrieval.*ACM Press Books.

### Additional Information

- Die Themenvergabe hat stattgefunden zwischen dem 15. und 22. Oktober 2018).
**Contact**: christoph.schommer @ {fu-berlin.de ; uni.lu}