IR Basics

I’m relating what I learn here, to what I am applying to iQ. Here I’m reiterating some key terms; follows by a definition, an IR Book example, and an iQ example.

  • Information need
  • Query
    • How the user communicates their information need to a computer
      • IR: Westlaw Search Engine: disab! /p access! /s work-site work-place (employment /3 place)
      • iQ: …
  • Document
    • Where we search; or “whatever size of unit we build the retrieval system over”
      • IR: A Shakespeare play. Or, a chapter. Or, a memo.
      • iQ: An iXBRL document. Or, a value. Or, an element.
        • Any opinions here?
    • A Document Collection is many Documents
  • Term
    • What the user might be searching for in a Document
      • IR: A word, “Friends”
      • iQ: An element, “Cash”. Or…
  • Boolean retrieval model
    • A system supporting a query in a Boolean expression of terms: and, or, not
      • IR: friends AND romans AND (NOT countrymen)
      • iQ: name(‘cash’).or().name(‘equivalent’).and().date(‘>2009’)
      • NOTE:
        • And, aka Conjunctive, aka Intersection
        • Or, aka Disjunctive, aka Union
  • Linear scanning
    • An approach to answering a query: search through all text (“linearly”) in a document collection to find the term the user wants
      • IR: Friends, Romans, countrymen ….
      • iQ: Income: $500, $600, £1,000, £1,200…
    • Works on a small scale. Called grepping on a *nix system; appropriate in case of “Find” commands, etc.
  • Index
    • A different approach to answering a query, creates a lookup before executing the query; instead of searching all text, search a smaller data-structure
      • IR: “Friends” appears in “Julius Caesar”
      • iQ “$” is used on “500”
  • Incidence Matrix
    • A specific type of index; a (2-D) table which says if a Term (row) exists in a Document (column), then the intersection of that row holds the value of 1; if the term doesn’t exist there, a 0
      • IR: All of Shakespeare’s plays are columns (Documents), All of the terms in his plays ; “Friends”, “Romans” are rows (Terms)
      • iQ: All of the Values are columns (Documents), All of the Elements (“Cash”, “Assets”) are rows (Terms)
      • This table is (number of values ) x (number of unique elements); this is a big table!
      • Here is an example.
      • Problematic: Notice the wasted “0”s. This is a “sparse” array.
  • Inverted Index
    • Another specific type of index; solves the “sparse” problem of an Incidence Matrix., aka “Inverted File”
      • All of the terms are still the Rows.
      • Conveniently works with the common data type “Dictionary” (Javascript object)
      • but instead of holding a “1” or else a “0” for All Documents…
      • it only stores the “1”‘s
      • *The term “inverted index” is admittedly “redundant”; It map terms (keys) back to the part of the document where they occur (values)
      • I think of the “uninverted index” as the map of the parts of the document (keys) to all its different terms (values)
      • Here is an example of the “uninverted index” for part of the example in Javascript/JSON format (which is conveniently Python format); mapping parts of the document (keys) to its different terms (values)
        • { “500”: “Cash”,
        •   “600”: “Cash”,
        •  “1000”: “Cash”,
        •  “2000”: “Cash”,
        •  “500” : “Assets”,
        •  “550” : “Assets”
        •  “500”: “Liabilities” … }
      • And here is the “inverted index”; mapping the terms (keys) to their location in the document (value); “inverting” the keys and index of the above;
        • { “Cash”: [“500”, “600”, “1000”, “2000”],
        •  “Assets”: [“500”, “550”],
        • “Liabilities”: [“500”]  …}
          • Like { “friends”: [“Julius Caesar”, “Midsummer Night’s Dream”], “Romans”: [“Julius Caesar”] … }
      • Another example of an inverted index:
  • Vocabulary:
    • The set of terms that represents “keys” in the Dictionary used for an inverted index
      • imagine calling Object.keys on the Javascript object above…
        • iQ: [“Cash”, “Assets”, “Liabilities”]
        • IR: [“friends”, “Romans”, “countrymen”]
        • A wonderful tool I could use to populate a datalist-style Intellisense  (think: “autocomplete”) when the user types:
        • iQ.element(‘Ca…
          • Cash
          • CashAndCashEquivalents
          • CashSettlementValueDerivatives
  • Posting / Postings
    • An item(s) representing a document, or a position in a document.
      • Imagine calling dictionary.values() on the Dictionary above, in Python…
        • iQ: [“500”, “600”, “1000”,  … ]
        • IR: [“Julius Caesar”, “Midsummer Night’s Dream”…]
    • Postings List
      • The list for a given key; Imagine calling dictionary[“Assets”]
        • iQ: [“500”, “550”]
        • IR: dictionary[“friends”] = [“Julius Caesar”, “Midsummer Night’s Dream” ]
  • Sorting and Grouping
    • Other (meta)data stored with the postings list:
    • Document Frequency
      • How often the term appears across all documents:
      • iQ: dictionary[“Cash”].length
      • IR: dictionary[“Romans”].length
    • Term Frequency
      • How often a term appears in a specific document:
      • iQ: Not applicable when I use each iXBRL “value” as a document; but if the entire iXBRL document were a document; the appearance of the element “Cash” within specific documents; i.e. used 10 times in the Google iXBRL (see above, choice of Document…)
      • IR: How often “Romans” was used in “Julius Caesar”
    • Position
      • Where position is located
      • iQ: In the entire-iXBRL-document-as-document model, whether “Cash ” appeared clustered together, or near the start of a document
      • IR: in thep lay “Julius Caesar”, whether “Romans” was only used once in the very beginning
  • Misc
    • “Vocabulary” can be stored in memory, but “Posting Lists /Postings” get large, and are stored on disk
    • The latter can use linked-lists or variable-length arrays; the latter is
    • Posting Lists are often sorted by their doc ID; in iXBRL, this would be the ID I assign them as I traverse them; and DOM Query Selectors return results in the order they appear in the document. the principal of sorted Doc IDs is required for the merging algorithm defined in the IR book
    • Indexing doesn’t improve “theta time complexity”, over a linear scan but helps by a huge constant.
    • Precision :
      • vs Recall; So elegantly stated: “What fraction of the returned results are relevant to the information need” In other words, are the results accurate.
        • General: Search for “mars” and get mostly results about Bruno Mars.
        • iQ:  Search for “cash on hand” receive results about “handover”
    • Recall :
      • vs Precision; “What fraction of the relevant documents in the collection were returned by the system?” In other words, are the results comprehensive.
    • vs Precision; “What fraction of the relevant documents in the collection were returned by the system?” In other words, are the results comprehensive

        • General: Search for “mars” and results don’t include a large number of sites dedicated to the planet Mars.
        •  iQ: Search for “cash on hand” , results don’t include “cash and cash equivalents

Now I want to model how my Workers should do their business..


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: