Until now we talked about two kinds of features: continuous features, and categorical features. There is a third kind of feature that can be found in several applications, which is text. For example, if we want classify an email message as legite or spam, the content of the email will certain contain important information for this classification task.
Text data is usually represented as strings, made up of characters. This features is clearly very different from the numeric features that we’ve discussed so far, and we will need to process the data before we can apply our machine learning to it.