There are two downsides to this approach, however: ![]() Instead of a sparse vector, you now have a dense one (where all elements are full). You could then encode the sentence "The cat sat on the mat" as a dense vector like. Continuing the example above, you could assign 1 to "cat", 2 to "mat", and so on. Encode each word with a unique numberĪ second approach you might try is to encode each word using a unique number. To one-hot encode each word, you would create a vector where 99.99% of the elements are zero. ![]() Imagine you have 10,000 words in the vocabulary. A one-hot encoded vector is sparse (meaning, most indices are zero). To create a vector that contains the encoding of the sentence, you could then concatenate the one-hot vectors for each word. This approach is shown in the following diagram. To represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). Consider the sentence "The cat sat on the mat". One-hot encodingsĪs a first idea, you might "one-hot" encode each word in your vocabulary. In this section, you will look at three strategies for doing so. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. Machine learning models take vectors (arrays of numbers) as input. You will train your own word embeddings using a simple Keras model for a sentiment classification task, and then visualize them in the Embedding Projector (shown in the image below). This tutorial contains an introduction to word embeddings.
0 Comments
Leave a Reply. |