Working with Text Data

Until now we talked about two kinds of features: continuous features, and categorical features. There is a third kind of feature that can be found in several applications, which is text. For example, if we want classify an email message as legite or spam, the content of the email will certain contain important information for this classification task.

Text data is usually represented as strings, made up of characters. This features is clearly very different from the numeric features that we’ve discussed so far, and we will need to process the data before we can apply our machine learning to it.

Types of Data Represented as Strings 09 Jun 2021
Representing Data as a Bag of Words 11 Jun 2021
Stop Words 11 Jun 2021
Rescaling the Data with tf–idf 11 Jun 2021

Types of Data Represented as Strings 09 Jun 2021

Representing Data as a Bag of Words 11 Jun 2021

Stop Words 11 Jun 2021

Rescaling the Data with tf–idf 11 Jun 2021