Indexing by ID

I have a new strategy for iQ. Inspired by Information Retrieval, HTML5 WebWorkers, and Set objects. Here’s the problem:

How many things will the user search on?

iQ will use the principle of “tranference”, by which I mean; every characteristic of a value’s characteristic, also applies to the value. For instance, a value 5 uses the element Cash. Cash is a credit. By transference, the value 5 is a credit. This means the user could search on any characteristic.

How can I maximize the computer’s performance?

Using multiple processes. The basic idea is: how can I do many things at once. I have read only a bit about threading. HTML5 offers true multi-tasking with WebWorkers.

What kind of object should the filter search?

WebWorkers are created by a Main Thread, but WebWorkers cannot access the DOM of the Main Thread. In previous posts I’ve discussed storing and filtering the native  Node/Element object representing each of the iQ DOM Nodes. But DOM objects can’t be passed around; WebWorkers can only communicate in terms of JSON objects. I read some of the first chapter of Information Retrieval. It presents the idea of the inverted index. In the picture below, the user is…

  1. searching with keykword inputs, like “Brutus AND Caesar”
    1. Notice how keywords correspond to the “Dictionary” below
  2. and expecting document outputs, like “Document 1”, “2”, and “4”
    1. Notice how documents correspond to the “Postings” below

Inverted Index.

iQ has similar use-cases;

  • the user searches by element name input (i.e. “Cash”)
  • and expects value output (i.e. all the values tagged with “Cash”)

…or the same, except with “end of year 2013”, or “Credit values” Generalize it;

  • The Dictionary represents all the key ways the user searches for input; in our case, Element Name, Date, Balance Type, etc (see transference, above)
  • The Postings represent identifiers for the documents.

If I could transfer a JSON object representing the iXBRL instance document to a WebWorker, a separate WebWorker could create an Index (Dictionary + Postings… essentially a JSON object), like above; Each separate WebWorker could do it  simultaneously; and wait for me to send functions by which this could be filtered. This JSON object has keys (aka properties, aka Dictionary), representing the characteristic the user is looking for;

  • string for Element Name
  • date for Start Date
  • date for End Date
  • string for Prefix
  • string for Unit
  • etc…

And each value (aka Posting) is an array (aka list) of results which represent our documents. But I can’t pass DOM Nodes to WebWorkers. How can I represent documents? I know! I can’t pass around DOM node objects themselves; but I can pass things (strings) which represent them! I’ll modify the DOM with the Main Thread while preprocessing; I’ll change each Node with a string, so I can use document.querySelectorAll (see previous post in which the advantages are broached) with this string and some key to connect:

  • from a flattened list of these strings… (sent back from WebWorkers once they process iQ Queries)
  • into a list of the iXBRL Nodes they represent… (when the Main Thread does the “trivial/thoughtless” work of selecting the Nodes using these strings.

This idea presumes many things. I want to test any assumptions I make. But I put the cart before the horse and I want to figure out how best to do the last part of the solution; That is, what string to use to identify Nodes; so I can most-efficiently/most-quickly convert a list of these JSON-passable strings (to and from WebWorkers) into actual DOM Nodes (once the Main Thread receives them again.) Bingo; IDs! I conducted four tests; I iterated through all iXBRL Nodes in Massive Dynamic, and I changed each of them in all of four ways;

  1. I gave each a unique ID
  2. I gave each a the same data-attribute (data-id), with unique values
  3. I gave them all the same data-attribute (data-iq-result)
  4. I gave them all the same class (‘iq-result’)

For each of the four ways I modified the iXBRL Nodes, I added to an array that represented a selector that could be used with document.querySelectorAll. For example:

  1. [‘#iQ_1’, ‘#iQ_2’, ‘#iQ_3’, ‘#iQ_4’…]
    1. (Many) ID selectors; ONE for EACH iXBRL value
  2. [‘[data-id=”iQ_1″]’, ”[data-id=”iQ_2″]’, ‘[data-id=”iQ_3]’, ‘[data-id=”iQ_4″]’,…]
    1. (Many) Attribute-Value Match selectors; ONE for EACH iXBRL value
  3. [‘[data-iq-result]’]
    1. (One) Attribute Match selector for ALL iXBRL values
  4. [‘.iq-result’]
    1. (One) Class selector for ALL iXBRL values

Then I ran each of these through document.querySelectorAll 10 times (each of them return the same NodeList; all 533 iXBRL values in the Massive Dynamic example) and measured the performance. Here are the results (in milliseconds):

  1. 219.1 (using IDs)
  2. 454.8 (using unique attributes values)
  3. 000.6 (using document.getElementsByClassName, with a single string of ‘className’)
    1. 005.7 (using document.querySelectorAll, with an array of ‘.className’ )
  4. 007.5 (using unique attributes)

The problem with approaches 3 and 4: I cannot distinguish among Nodes. It’s “ALL” or nothing. How useful is that? Not very useful at all. But it was interesting to see how lightning-fast document.getElementsByClassName operated.

  • Specifically; it took 5 ms the first time,
  • then 1 ms the second time,
  • then 0ms the remaining 8 times.

Maybe there is some caching involved! So now, among the first 2 approaches; the fastest way to take (each) unique strings and find their correspond Nodes? Between using a data-attribute and an ID, an ID selector is almost 2x as fast. Unfortunately, there is no such thing as document.getElementsById; in plural (only getElementById; the singular). But the ‘#idName’ selector if pretty fast, nonetheless. It takes 533 strings and produces. Now it’s a matter of creating the lists of these ID strings with a combination of UNION and INTERSECTION of the “Dictionaries” of indexes for Elements, Start Dates, End Dates, Units, etc…

Update 1

Holy cats, after wishful thinking above  (that I could use document.getElementByIds), I tried the next best thing; using map to “pluralize” document.getElementById. Specifically:

  • nodeList=idSelectorArray.map(document.getElementById) });
  • console.log(nodeList.length); //533
  • Time: 001.0ms! That’s right, just a single milisecond on average.
    • That’s 200x faster than using document.querySelectorAll([‘#iQ_1’, ‘#iQ_2’…]);

Cool!

Update 2

Did some searching and created my first jsperf to find the best tool for UNION and INTERSECTION;

To be clear, no set of “Properties” would ever have duplicate value IDs; a value is only processed once in relation to each characteristic. But I would UNION or INTERSECT multiple properties (i.e. multiple arrays of value IDs); if user searched for *Cash, and multiple terms contained cash.

And then again with a set of value IDs for the date match.

Depending on the user of AND() or OR().

Seems like lodash is the winner. (Does d3 have something I could use?)

Also stumbled on this which made me rethink the “Native is best” ideology.

Update 3

So I’m leaning towards assigning a series of IDs to the.

Some considerations:

What if an iXBRL author uses an ID of their own?

  • I could displace it (put the original @id into @data-id; and replace @id with my new iQ id)
  • Or I could use theirs whenever they’ve got one; it’s unlikely IDs would conflict (remember IDs must be unique)

How is document.getElementById so fast?

  • The browser is probably caching each of those objects.
  • It seems I want exactly that — but it probably comes at the cost of some caching memory. So it might consume more RAM? Processing power? Not sure which/what…
Advertisements

One thought on “Indexing by ID

  1. […] moving forward with the Indexing idea. And I’m learning a lot in the process. Here I’ve summarized a few […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: