Topics: Binary
Search Trees, Binary Search, Sorted Data
Part 1: (Required)
Write a program
to count the number of times each word appears in a file 'story.txt' ---
once upon a time
there were three little pigs who lived all alone in the deep forest where they
amused themselves by building little houses one day a wolf came to the forest and
blew down their little houses and ate the little pigs
You may assume
the file has no punctuation and is in all lower-case letters. Your program should write the results to a
file called ‘wordCounts.txt’ with the following format:
a
2
all
1
alone
1
and
2
amused
1
ate
1
building
1
blew
1
by 1
came
1
day
1
deep
1
down
1
forest
2
houses
2
in
1
little
4
lived
1
--- Etc…
You
should use a Binary Search Tree to keep track of the words. Each node should
have a 'word' field and a 'count' field. A word is only entered into the
tree once. If it appears again, its count is simply incremented.
Part
2: (Optional extra credit)
Now
extend your program from part 1 to also spellcheck the words in the file. Use the dictionary file largedictionary.txt to decide if a word
is spelled correctly or not. For a given
input text file, write the same output (‘wordCount.txt’) as in part 1 but with
an added “(fake word)” written in front of each fake word. Test your program on the following file whale.txt
Run
time requirement: Your
program must be fast. You should check
the spelling of each word with the “binary search” algorithm (or something
similarly efficient).
Here
is a (made up) example output file :
a
5002
aardvark
2
(fake
word) abocoda 3
abysmal
103
baboon
2
(fake
word) barkeranimal 53
breakfast
27
(fake
word) perrocaliente 17
Etc…
Part
3: (Optional extra credit)
Extend
your program from parts 1 and 2 to create an additional output file called
‘topMistakes.txt’ containing the top 100 misspelled words in whale.txt.
Your misspelled words must be written in order from highest count to
lowest count.
Run
time requirement: Design an efficient (fast) way to generate the list of 100. Slow solutions aren’t good enough.
Here
is a (made up) example output file:
ishmael
743
whaleslime
500
ahab
478
moby
423
tashtego
217
perrocaliente
17
waterfish
10
snailfish
10
queequeg
7
etc…