Now that we know how to count the number of occurrence of words in a file, let's use that to find the most popular words in a file.
At the bottom of word_analyzer.py, add the code below (not part of any function):
text = read_file("horse_ebooks.txt")
counts = count_words(text)
print(sorted(counts))
The sorted function takes as input an iterable, and reports a sorted version of this iterable. As you'll see when you run this program, it will return a list of the words in the dictionary sorted alphabetically. However, we want them sorted by their values (i.e. by their counts in the original file).
Luckily for us, sort
is a higher order function. sort
can take an optional input called key
which tells the sorted function how to sort. For example, we can sort a list of numbers by their absolute value as follows, by passing abs
function into the key
parameter. Here we're using a special syntax for optional parameters in Python, where we set the optional parameter with an equals sign. There is no equivalent for this in Snap!, but the idea should hopefully be clear.
>>> sorted([-5, -1, 2, 3], key = abs)
[-1, 2, 3, -5]
In our case, in order to find the most popular words, we want to sort each key in the dictionary by its value. This is probably a bit mind blowing, but this means that we need to use count's get
function, as shown below. Try copying and pasting this into word_analyzer.py to see what happens:
text = read_file("horse_ebooks.txt")
counts = count_words(text)
print(sorted(counts, key = counts.get))
$ python word_analyzer.py
['on', 'Budget', 'to', 'Fruit', 'Clean', 'Fruits', 'Store', 'at', 'a', 'and', 'Vegetables']
What this does is call the get function of the counts dictionary for each item, and use that for sorting. You'll see that the words are now sorted by frequency (i.e. their value in the dictionary). Because the sorted function always puts smallest items first, you'll see that the less common words come first. If we want the opposite order, we can use another optional parameter for the sorted function called reverse, as shown below:
text = read_file("horse_ebooks.txt")
counts = (count_words(text))
print(sorted(counts, key = counts.get, reverse = True))
Using the examples above as inspiration, write a function top_n_words
as defined below:
def top_n_words(counts, n):
"""Returns the top n words by count. For example:
top_n_words({'and': 5, 'on': 1, 'Vegetables': 5, 'Budget': 1, 'to': 1, 'Fruit': 1, 'a': 2, 'Clean': 1, 'Fruits': 1, 'Store': 1, 'at': 1}, 2)
would return ["and", "Vegetables"].
In the case of a tie, it doesn't matter which words are chosen to break the tie."""
Make sure to test that top_n_words
works. We haven't told you how to do this with print statements explicitly, but using what you know so far, you should try to figure out how to test top_n_words using print statements.
After playing around with print statements, you may validate your work by running the autograder:
$ python -m doctest word_analyzer.py
Hint by example: x[:5]
returns the first 5 items of a list.
Complete this section of the lab by writing a function print_top_n_words
that prints out the top n words along with their counts, with one word on each line.
def print_top_n_words(counts, n):
"""Prints the top n words along with their counts. For example:
print_top_n_words({'and': 5, 'on': 1, 'Vegetables': 5, 'Budget': 1, 'to': 1, 'Fruit': 1, 'a': 2, 'Clean': 1, 'Fruits': 1, 'Store': 1, 'at': 1}, 2)
would print:
and 5
Vegetables 5"""
Hint: Make sure to use your top_n_words
function when writing this function.
Hint 2: "and " + 5
will cause a TypeError since Python doesn't want to add a string to a number. To fix this, use the str
function to convert the number to a string, e.g. "and " + str(5)
Try printing the top words in a few of the provided files (beatles.txt, nietzsche.txt, etc.) You may need to make n kind of large to see anything interesting.
You may additionally use the autograder to check your work:
$ python -m doctest word_analyzer.py
You might find it annoying that the most common words for most texts are boring things like "the". Create another function top_n_words_except(counts, n, boring)
that returns the top words except for anything that appears in boring
, which is a list of boring words.