Classifying SPAM

Now that you know that the average length differs, build a simple classifier based on the number of words in a particular message. That is, write a block that takes in a message and, based on how many words it contains, returns either "ham" or "spam" as its classification.

classify a message that evaluates to HAM

classify a message that evaluates to SPAM

Implement in Snap!, and use it to classify the messages in our data. You can use a regular loop to call your classify on the second item in each row of the data, or you might use this faster method using the keep block :

using keep to apply a custom classify method to the data

Answer the following questions:

Play around with different threshold values— that is, the number of words above which you decide that a message is spam.

What was the best threshold you found?

How many messages did it classify incorrectly?