19319510 INFORMATION RETRIEVAL
Pro-Seminar; Winter Semester 2018/19; 2 ECTS
by Prof Christoph Schommer
What's it all about?
The sensors of the Big Data hype have engulfed the digital world as well as our society. Technical innovations, scientific achievements (in particular with the field of Artificial Intelligence), and an insistent information-centric thinking have made it possible that data - and with it - information has been put in front more than ever before. This leads to consequences, for example the demand for a deep understanding of data and information, the necessity to clearly differentiate between correlation and causality, and to identify appropriate technical ways to make reliable information available whenever it is needed.
Organisation
The course is a mixture of seminar and lecture and is held as a block course from Monday, 19 November until Thursday, 22 November as well as from Monday, 26 November until Thursday, 29 November. The lecture takes place from 14h00 - 15h30 whereas the seminar presentation starts from 15h45 (= 45 minutes of presentation plus 15 minutes of discussion).
LECTURE PART:
- Monday, 15 October: Course overview, Presentation of the projects.
- Monday, 19 November: What is Information Retrieval? Structured and Unstructured data. The structure of a Search Engine. Distance measures (Minkowski Metrics, cosine/angle, Edit distance, Hamming distance, Jaccard-index and n-grams). Term-Document incidence and Boolean Retrieval: Boolean Matrix, Posting Lists, and skip pointers. Inverted Index List and the Merge algorithm. Boolean Query Optimization.
- Tuesday, 20 November: a) Selected aspects of Natural Language Processing: Stop Words, Tokenization, Lemmatization, Stemming. Porter Stemmer (original paper from 1980 about Suffix stripping). Edit distance, a dynamic programming approach, to calculate the distance between strings. SoundEx, a coding algorithm on the basis of phonetics. b) Use of hash-tables and balanced trees to allow a wildcard search, for example Ber*. Use of a reverse balanced tree to find answers to wildcard requests like: Be*in.
- Wednesday, 21 November: Final remarks to a) wildcard searches with double wildcards (à la B*r*in) and to b) positional indexes (posting lists with frequencies/whole text collection as well as the individual positions in the documents). Main Topic of the day: Possibilty towards a query extension. Use of apriori-algorithm (association discovery; data mining) to calculate associative rules based on recorded user queries. Algorithmic idea: 2 phases with a) Rule Production Phase and b) Rule Generation Phase. ad a) iterations of join and pruning steps based on a minmum-frequency threshold (called min-support). ad b) usage of the conditional probability towards a minimum-conditional probability threshold (called min_confidence). Possible applications of the discovered associative rules: a) present a list with potential extensions before the user requests the query or b) present an extended query after the user has indicated a non-satisfaction with the result set.
- Thursday, 22 November: Ranking, options 1 and 2 with Jaccard-Index and the parametrical weighting-approach (see yesterday´s talk): both work fine but are insufficient since frequencies are not taken into account! Better way: a) use a combination of term frequencies and inverted document frequencies to measure the quantity/term frequency (term appears more often in a docuemnt => document is more similar) and the presence of rarity (the more rare a term, the more similar is a document if it contains this term). Apply the logarithm to dampen down the effect. b) to calculate, transfer each document and each query in the k-dim Vector Space (k terms define the axes; tf.idf are the coordinates). Use of the cosine to find a similarity by transferring vectors to the unit cube and by applying a length normalisation. Examples: 3 Jane Austen´s masterpieces and 4 selected query terms.
- Monday, 26 November: Evaluation of search results: error, accuracy. More important: relevance of retrieved documents, expressed by Precision (= P(relevant | retrieved) and Recall (= P(retrieved | relevant). F-measure combines both statistical values: fundaments on the harmonic mean as well as on van Rijsbergen's effectiveness function E. Remarks to PageRank (= today's talk): from a random walking-perspective, we use of a damping factor d (=> teleporting) and have a PageRank of an arbitrary page j as Pr(j) = (1 - d) + d * ( SUM_(i->j) Pr(i) * P( i->j ) ), where P(i->j) is the random value to go from node i to node j. With that, we can establish equations, depending on the number of nodes concerned. Example. The values are close to the Eigenwert-problem Pr(.) = x*M^k, where x is the start vector and M the transition Matrix. Consequence: convergence, independently with which vector we start.
- Tuesday, 27 November: Evaluation (ff.): Remarks concerning a) van Rijsbergen's effectiveness function E_alpha and the F_beta-measure: E_alpha=1 - F_beta. Beta = Recall/Precision, and alpha = 1/(1+beta^2). Reason: gradients dE/dP and dE/dR, respectively, are equal exactly for
beta = (1-alpha) / alpha. b) Cohen´s kappa-statistics with P(agreement) and P(expected agreement). - Wednesday, 28 November: Summary of the contents (= chapters 1-4, 6-10, 13).
- Thursday, 29 November: Individual final talk with the participants:
14h00 - 14h20: Hr Pfeifer, 14h25 - 14h45: Hr Bergmann, 14h50 - 15h10: Hr Nguyen, 15h15 - 15h35: Hr Chen, 15h40 - 16h00: Hr Casares.
SEMINAR PART:
- A seminar presentation includes a presentation with or without slides.
- Each person gives exactly one talk. The talks can to be held either in German or English language.
TALKS:
- Herr Pfeifer: Index construction (Chapters 4.2 - 4.5; pp. 63-73). Termin: Dienstag, 20 November 2019, 15h45 - 16h30
Inhalt: Posting lists, Blocked sort-based indexing, Single-pass in-memory indexing, Distributed indexing, MapReduce, Dymanic indexing.
- Herr Bergmann: Parametric Zone Indexes (Chapter 6.1; pp. 100-107). Termin: Mittwoch, 21 November 2019, 15h45 - 16h30
Inhalt: Zones in texts, parametric zone indexes, weighted zone scoring, learning the weights, finding an optimal weight g.
- Herr Nguyen: PageRank (Chapters 21.1 - 21.3, pp. 421-433) Termin: Montag, 26 November 2019, 15h30 - 16h30
Inhalt: General Idea: web as graph, dead-end nodes; teleporting; hubs and authorities; topic-specific pageRank; hyperlink induced topic search; summary.
- Herr Chen: A Vector Space Model for XML Retrieval and Result Evaluation (Chapters 10.3 - 10.4; pp. 188-196) Termin: Dienstag, 27 November 2019, , 15h05 - 16h20
Inhalt: Structured and unstructured data, XML concepts, challenges in XML retrieval, a VSM for XML retrieval, Evaluation of XML retrieval. Cumhyriyet example.
- Herr Casares: Text Classification and Naive Bayes (Chaptes 13.1 - 13.2; pp. 234-243) Termin: Mittwoch, 28 November 2019, 14h00 - 15h00
Inhalt: Text Cassification problem, Naive Bayes Text Classification: relation to multinomial unigram language model, naive bayes algorithm, laplace correction/smoothing.
Evaluation
The seminar presentation counts for 2/3 of the mark whereas 1/3 of the mark is from an oral examination (Thursday, 29 November 2018).
Aims and Goals
The course aims at the exchange of contents and experiences in the field of Information Retrieval. Every candidate should be able to understand and to critically question existing methods, but also be able to realize own concepts and ideas.
Selected References
- MAIN REFERENCE: C. Manning, P. Raghavan, H. Schütze: Introduction to Information Retrieval. Cambridge University Press. See also the companion website at Stanford to follow the book: https://nlp.stanford.edu/IR-book/information-retrieval-book.html
- S. Büttcher, C. L. A. Clarke, G. V. Cormack: Information Retrieval: Implementing and Evaluating Search Engines. MIT Press.
- R. Baeza-Yates, B. Ribeiro-Neto: Modern Information Retrieval. ACM Press Books.
Additional Information
- Die Themenvergabe hat stattgefunden zwischen dem 15. und 22. Oktober 2018).
- Contact: christoph.schommer @ {fu-berlin.de ; uni.lu}