Topics: Binary Search Trees, Binary Search, Sorted Data

 

Part 1: (Required)

Write a program to count the number of times each word appears in a file 'story.txt' ---

 

once upon a time there were three little pigs who lived all alone in the deep forest where they amused themselves by building little houses  one day a wolf came to the forest and blew down their little houses and ate the little pigs

 

You may assume the file has no punctuation and is in all lower-case letters.  Your program should write the results to a file called ‘wordCounts.txt’ with the following format:

 

a 2

all 1

alone 1

and 2

amused 1

ate 1

building 1

blew 1

by 1

came 1

day 1

deep 1

down 1

forest 2

houses 2

in 1

little 4

lived 1

     ---  Etc

 

You should use a Binary Search Tree to keep track of the words. Each node should have a 'word' field and a 'count' field.  A word is only entered into the tree once.  If it appears again, its count is simply incremented

 

Part 2: (Optional extra credit)

Now extend your program from part 1 to also spellcheck the words in the file.  Use the dictionary file largedictionary.txt to decide if a word is spelled correctly or not.  For a given input text file, write the same output (‘wordCount.txt’) as in part 1 but with an added “(fake word)” written in front of each fake word.  Test your program on the following file whale.txt

 

Run time requirement:  Your program must be fast.  You should check the spelling of each word with the “binary search” algorithm (or something similarly efficient).

 

Here is a (made up) example output file :

 

a 5002

aardvark 2

(fake word) abocoda 3

abysmal 103

baboon 2

(fake word) barkeranimal 53

breakfast 27

(fake word) perrocaliente 17

Etc…

 

 

Part 3: (Optional extra credit)

Extend your program from parts 1 and 2 to create an additional output file called ‘topMistakes.txt’ containing the top 100 misspelled words in whale.txt.  Your misspelled words must be written in order from highest count to lowest count.

 

Run time requirement: Design an efficient (fast) way to generate the list of 100.  Slow solutions aren’t good enough.

 

Here is a (made up) example output file:

 

ishmael 743

whaleslime 500

ahab 478

moby 423

tashtego 217

perrocaliente 17

waterfish 10

snailfish 10

queequeg 7

     etc…