1) The results may be varying, but in general I would expect to see that the hashtable provides better performance. Here are the results I got using Java with the standard API: You should see a difference around 2-3 times faster. For operations1 file I get: tree : 10.5 Seconds hash : 3.0 Seconds For operations2 file I get: tree : 6.7 Seconds hash : 3.0 Seconds ==================================================================================== 2) Below are the queries and sample timings. ------------------------- Regular Data set without indexes: a) mysql> select count(*) from Employee as E, Employee as M where E.Manager = M.Name and (E.Salary - M.Salary) >= 5000; +----------+ | count(*) | +----------+ | 3573 | +----------+ 1 row in set (56.79 sec) b) mysql> create temporary table temp select department, count(distinct name) as "NumOfStudents" from Employee, Course where name = student group by department; Query OK, 8 rows affected (30.65 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> select emp.department, avg(emp.salary) "Avg Salary" from Employee as emp where emp.department in (select department from temp where numOfStudents > 1) group by emp.department; +------------+------------+ | department | Avg Salary | +------------+------------+ | Finance | 65577.1689 | | Hardware | 63690.6934 | | HR | 65499.1119 | | Marketing | 65273.7196 | | QA | 64742.1731 | | Sales | 64492.2232 | | Software | 66089.9083 | | Switch | 63754.7826 | +------------+------------+ 8 rows in set (0.24 sec) c) select avg(Salary) from Employee as E, Course as C where E.name = C.student and C.prof = "Roe"; +-------------+ | avg(Salary) | +-------------+ | 64032.7869 | +-------------+ 1 row in set (7.65 sec) Index creation: mysql> create index NameIndx on Employee (Name); Query OK, 10000 rows affected (0.57 sec) Records: 10000 Duplicates: 0 Warnings: 0 mysql> create index ManagerIndx on Employee (Manager); Query OK, 10000 rows affected (0.45 sec) Records: 10000 Duplicates: 0 Warnings: 0 mysql> create index StudentIndx on Course (Student); Query OK, 6519 rows affected (0.33 sec) Records: 6519 Duplicates: 0 Warnings: 0 mysql> create index ProfIndx on Course (Prof); Query OK, 6519 rows affected (0.49 sec) Records: 6519 Duplicates: 0 Warnings: 0 *** There is also an "alter table ..." syntax for adding indexes. Both are acceptable. Other dbs may have a different syntax too. With index Regular data set: a) mysql> select count(*) from Employee as E, Employee as M where E.Manager = M.Name and (E.Salary - M.Salary) >= 5000; +----------+ | count(*) | +----------+ | 3573 | +----------+ 1 row in set (0.46 sec) b) mysql> create temporary table temp select department, count(distinct name) as -> "NumOfStudents" from Employee, Course where name = student group by -> department; Query OK, 8 rows affected (0.33 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> select emp.department, avg(emp.salary) "Avg Salary" from Employee as emp -> where emp.department in (select department from temp where -> numOfStudents > 1) group by emp.department; +------------+------------+ | department | Avg Salary | +------------+------------+ | Finance | 65577.1689 | | Hardware | 63690.6934 | | HR | 65499.1119 | | Marketing | 65273.7196 | | QA | 64742.1731 | | Sales | 64492.2232 | | Software | 66089.9083 | | Switch | 63754.7826 | +------------+------------+ 8 rows in set (0.17 sec) mysql> select avg(Salary) from Employee as E, Course as C where E.name = C.student and C.prof = "Roe"; +-------------+ | avg(Salary) | +-------------+ | 64032.7869 | +-------------+ 1 row in set (0.10 sec) --------------------------------- Query results with "large" data set: a) mysql> select count(*) from Employee as E, Employee as M where E.Manager = M.Name and (E.Salary - M.Salary) >= 5000; +----------+ | count(*) | +----------+ | 35483 | +----------+ b) mysql> create temporary table temp select department, count(distinct name) as -> "NumOfStudents" from Employee, Course where name = student group by -> department; Query OK, 8 rows affected mysql> select emp.department, avg(emp.salary) "Avg Salary" from Employee as emp -> where emp.department in (select department from temp where -> numOfStudents > 1) group by emp.department; +------------+------------+ | department | Avg Salary | +------------+------------+ | Finance | 64280.0580 | | Hardware | 64416.1395 | | HR | 64902.0766 | | Marketing | 64404.5612 | | QA | 64551.0241 | | Sales | 64312.5170 | | Software | 64315.6083 | | Switch | 64696.4572 | +------------+------------+ c) mysql> select avg(Salary) from Employee as E, Course as C where E.name = C.student and C.prof = "Roe"; +-------------+ | avg(Salary) | +-------------+ | 64074.5447 | +-------------+ --------------------------------- Query results with "huge" data set: a) mysql> select count(*) from Employee as E, Employee as M where E.Manager = M.Name and (E.Salary - M.Salary) >= 5000; +----------+ | count(*) | +----------+ | 356527 | +----------+ b) mysql> create temporary table temp select department, count(distinct name) as -> "NumOfStudents" from Employee, Course where name = student group by -> department; Query OK, 8 rows affected mysql> select emp.department, avg(emp.salary) "Avg Salary" from Employee as emp -> where emp.department in (select department from temp where -> numOfStudents > 1) group by emp.department; +------------+------------+ | department | Avg Salary | +------------+------------+ | Finance | 64437.4888 | | Hardware | 64516.3733 | | HR | 64460.9010 | | Marketing | 64550.8071 | | Operations | 64574.0719 | | QA | 64401.9087 | | Sales | 64523.7697 | | Software | 64534.5903 | | Switch | 64521.1729 | +------------+------------+ c) mysql> select avg(Salary) from Employee as E, Course as C where E.name = C.student and C.prof = "Roe"; +-------------+ | avg(Salary) | +-------------+ | 64540.8515 | +-------------+ ===================================================================================== 3) Without index: mysql> load data local infile 'employee_hw3.txt' into table Employee; Query OK, 10000 rows affected (0.15 sec) With regular indexes: mysql> load data local infile 'employee_hw3.txt' into table Employee; Query OK, 10000 rows affected (0.45 sec) With large data set: Without indexes: mysql> load data local infile 'employee_hw3_large.txt' into table Employee; Query OK, 100000 rows affected (0.61 sec) With indexes: mysql> load data local infile 'employee_hw3_large.txt' into table Employee; Query OK, 100000 rows affected (4.78 sec) ===================================================================================== 4) The tables sizes with a scale factor of 0.02 are: 25 nation 4000 part 16000 partsupp 5 region 200 supplier a) The answer is (((region-nation)-supplier)-partsupp)-part because the total number of intermediate tuples is 16225. region-nation would generate 25 rows; (region-nation)-supplier, 200; ((region-nation)-supplier)-partsupp), 16000. The final join would also output 16000, but we're only counting intermediate rows in our cost computation. b) We are looking for an explanation why the above order was selected. It could be based on an algorithm or a logical one. The basic intuition should be that joining tables that produce small results early is good. Below is a dynamic programming solution: ==>table cardinalities region 5 nation 25 part 4000 supplier 200 partsupp 16000 ==>2-way joins selectivity factors resulting selectivity rows factor region-nation 25 / (5*25) = .04 supplier-nation 200 / (200*25) = .04 supplier-partsupp 16000 / (200*16000) = .005 part-partsupp 16000 / (4000*16000) = .00025 ==>all remaining selectivity factors are 1 region-part 20000 region-supplier 1000 region-partsupp 80000 nation-part 100000 nation-partsupp 400000 part-supplier 800000 ==>3-way joins (don't bother considering cartesian products) (cost is number of intermediate rows) resulting cost best rows plan ----------- ------- ----------------------------------------------- region-nation-supplier 200 25 (region-nation)-supplier nation-supplier-partsupp 16000 200 (nation-supplier)-partsupp supplier-partsupp-part 16000 16000 (supplier-partsupp)-part or (part-partsupp)-supplier ==>4-way joins resulting cost best rows plan ------ -------- ------------------------------------ region-nation-supplier-partsupp 16000 200+25 ((region-nation)-supplier)-partsupp nation-supplier-partsupp-part 16000 1600+200 ((nation-supplier)-partsupp)-part ==>solution region-nation-supplier-partsupp-part 16000 16000+200+25 (((region-nation)-supplier)-partsupp)-part c) This is a pretty open item. Any predicate that significantly alters the size of an intermediate join is valid, provided that the answer is accompanied by numbers. Note that the selection may alternate the order of the joins. For example if I add a select based on a specific supplier: select PART.name, SUPPLIER.name, NATION.name, REGION.name from PART, SUPPLIER, NATION, REGION, PARTSUPP where PART.partkey = PARTSUPP.partkey and SUPPLIER.suppkey = PARTSUPP.suppkey and SUPPLIER.nationkey = NATION.nationkey and NATION.regionkey = REGION.regionkey and SUPPLIER.suppkey = 1 The supplier table will now produce a single entry. partsupp table has 9520 lines for supplier "1". The join would change: ((supplier-nation)-region)-partsupp)-part The intermediate row cost would be: 1+1+9520