Sunday, November 12, 2006

Merge vs Hashing

Last semester I did search engine project. That included the implementation of a search engine and a report where I had to do some theoretical analysis of the problem and describe my methodology and implementation. My previous post about this project could be found here.

I went to the exam and I got about 8(with base 5 and maximum 10). I didn't like the grade and the external examiner wanted to defend his opinion. So, he told me why I didn't get a higher grade. Let me explain the problem first.

Suppose that we have two sorted tables with integer values. At one part of the implementation I had to write an algorithm to find all the common integers at the two tables.

What I did, was something like the implementation of Merge algorithm. That algorithm is used in Merge sort in order to take two sorted tables and combine(merge) them into one, that will still be sorted. The complexity of this algorithm is O(n), where n is the total number of the elements(the sum of the lengths of the 2 tables).

What he said that I should have done in order to make it faster, is to use hash tables. What he suggested was that I can use hash tables instead of arrays, then check which one is the smaller, then take it's elements one by one and check if they exist at the other table. That will result at maximum half "operations" comparing to the Merge algorithm, because it will check only the n/2 elements and each check can be done in O(1) since we use Hash tables. The complexity in terms of big O notation is still O(n), but if we want to be a bit more precisely, we could say O(n/2).

Unfortunately for me, I didn't know much about branch predictions at that time and I couldn't defend my algorithm, which seems slower at first glance. But since they are both linear complexity algorithms, one could go more into comparisons.

Now, at my thesis, I am working again with these, and I can show by experiments and by theory based on branch predictions that my algorithm was faster. The cpu's architecture benefits predicable loops instead of accessing random blocks at the memory.

Below I have a graph of the results of some of my experiments.
The first raw of the table is the number of elements at each table(1.000,2.000..)
The numbers in the table is the time in milliseconds that each algorithm spent to find the amount of common elements.

1 comment:

Nikos said...

Ωραία πράματα που όμως δεν καταλαβαίνω...χεχεχε