And now, the moment you've all been waiting for: it's time to build our search engine!
Your goal is to create a search engine in BYOB that can index and search through a list of quotes. You can download the framework for the search engine here, which will provide you with the data that your search engine should process as well as some helper functions.
Requirements
-
Start off with the basic framework provided above.
-
Crawler. The crawler will be provided to you. It will return a list with a separate source to be analyzed by your search engine in each element.
-
Indexer. This is where most of the work is going to take place, and where most of the processing time will be spent. Create a keyword index that can be used to quickly determine which quotes contain which words. When the indexer finishes its execution, it should have produced an efficient keyword index that can be used by the Searcher. Use the hash table blocks to store the data in a way that can be searched quickly (perhaps using techniques from earlier in this lab). Remember, multiple quotes may include the same word and you need to be able to locate each quote that contains a search term. Parts of the indexer have been created for you, but you will need to set up the hash table and insert all of the data into it.
-
Searcher. Add content to the "search for keyword" operator block that can query the index for a particular word and returns a list of all of the quotes that contain that word. The searcher should be able to perform multiple searches on the index without having to regenerate any of the information.
-
Additional requirement: add a special feature to either the indexer or searcher that improves your search engine (either by speeding it up or improving the quality of its results in some way). Be creative! Some ideas:
-
Rank the returned search results by the number of times the keyword appears in each book. For example, a book that contains the word "fireplace" 12 times would appear before a book that contains the word "fireplace" 5 times.
-
Make it so that your search engine looks for different forms of the same keyword. One simpler form of this is called suffix stripping; with suffix stripping, a search for "player" would also check for words like "play," "played," and "plays."
-
Remove all punctuation from words before inserting them into your index.