Now let's start working with our data. Our goal here is to generate a list of the counts of every word in a file, and to find which words appear most often. There are libraries available for Python that would make us able to do this in two lines, but for the sake of education, let's write our own code that does all this from scratch.
Modify word_analyzer.py by removing any print statements that tested the functions from the previous lab page. Also, add and complete the count_words
function as shown below:
def count_words(text):
"""Takes a text and returns a dictionary mapping each word to its count, for example:
count_words(["Fruits and Vegetables and Vegetables on a Budget and Vegetables at a Store and Vegetables to Clean Fruit and Vegetables"])
would return:
{'and': 5, 'on': 1, 'Vegetables': 5, 'Budget': 1, 'to': 1, 'Fruit': 1, 'a': 2, 'Clean': 1, 'Fruits': 1, 'Store': 1, 'at': 1}
"""
return ???
text = read_file("horse_ebooks.txt")
print(count_words(text))
Your job is to fill out the count_words function so that it returns a dictionary with the right counts. If you're stuck, look back at Exercise 6 from the Data Structures lab. Remember that dictionaries are unordered data structures so what is printed by the above call may have the keys in a different order.
Test your code using the autograder by running the following line.
$ python -m doctest word_analyzer.py
One issue with our count_words
function is that it includes punctuation as part of each word. For example "cows" and "cows." would be considered different words. As an optional exercise, write a remove_punctuation
function that returns a string without any punctuation. Modify count_words
so that it uses this new function.
Hint: Use a combination of "".join()
with a list comprehension. See the previous lab for some hints.