Creating a Spell Checker
In this article, we will implement a command line Spell Checker in Haskell and cover the basics concepts over how to suggest corrections. We will see that it can be done using different techniques and how those choices critically impact on the suggestion accuracy and performance.
String distance algorithms
First of all, we need a way to mesure how similar one string is from another, for example, using the Levenshtein distance algorithm cat
and rat
distance is 1. Several algorithms approach this problem in different ways:
- Hamming distance
- Levenshtein distance
- Damerau-Levenshtein distance
- Optimal String Alignment
- Longest Common Substring distance
- q-gram distance
- Cosine distance
- Jaccard distance
- Jaro distance
- Jaro-Winkler distance
For this project, we will use the Levenshtein distance, that is probably one of the most famous and straightforward algorithms in this category. The Levenshtein distance is the minimum number of single-character edits (i.e., insertions, deletions or substitutions) required to change one word into the other. In the previous example, to transform cat
in rat
we only need to substitute the first character from c
to r
, and that is why the distance is 1.
Searching for near-matches Strings
Now that we can metrify how distant one string is from another, we need a way to go over our dictionary and suggest possible corrections.
The problem is that brute forcing a dictionary with a million of words to find near-matches can take some time, just imagine if we had to calculate the word distance against the entire dictionary and then filter the ones that have the distance below or equal to N.
So to escape from a linear complexity, we should maintain our dictionary in the smartest structure as possible. To solve that we have some options:
- BK-Tree
- Norving
- LinSpell
- SymSpell
We are going to use the BK-Tree, as it is efficient enough for our purpose. BK-Trees or Burkhard-Keller Trees is a tree-based data structure commonly used for quickly finding near-matches to a string. As many articles already cover this topic, I will recommend one and left here my implementation.
module BkTree
( insert
, search
, BkTree.lookup
, Node (Empty, Node)
, Distance (metric)
) where
import qualified Data.Map as M
import Data.Maybe (mapMaybe)
class Distance a where
metric :: a -> a -> Int
data Node a = Empty
| Node { value :: a
, children :: M.Map Int (Node a)} deriving (Show, Eq, Read)
insert :: (Distance a) => a -> Node a -> Node a
insert str Empty = Node str M.empty
insert str (Node value children) =
Node value newChildren
where
distance = metric value str
sameDistanceNode = M.lookup distance children
newChildren = case sameDistanceNode of
Nothing -> M.insert distance (Node str M.empty) children
Just node -> M.insert distance (insert str node) children
search' :: (Distance a) => a -> Int -> [Node a] -> [a] -> [a]
search' _ _ [] acc = acc
search' input n (Node value children : tail) acc
| length acc == 3 = acc -- limiting to 3 suggestions
| otherwise =
let distance = metric input value
inRange = mapMaybe (`M.lookup` children) [(distance-n)..(distance+n)]
validNodes = inRange ++ tail
in if distance <= n then
search' input n validNodes (value:acc)
else search' input n validNodes acc
search :: (Distance a) => a -> Int -> Node a -> [a]
search input n node = search' input n [node] []
lookup :: (Distance a) => a -> Node a -> Bool
lookup input (Node value children) =
let distance = metric input value
in (distance == 0) || case M.lookup distance children of
Nothing -> False
Just x -> BkTree.lookup input x
Try to insert the elements in the tree ordered by word relevance, in that way you will get better accuracy as these words will be nearest to the root.
Loading the dictionary
Now the missing part is the dictionary; you can pick one from the LibreOffice project here. With the dictionary file in hands, we only need to load it in memory and build the BK-Tree.
module Main where
import qualified BkTree as BK
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
fileLines :: FilePath -> IO [T.Text]
fileLines p = do
content <- TIO.readFile p
return $ T.splitOn (T.pack "\n") content
bkTreeFromDictionary :: [T.Text] -> BK.Node T.Text
bkTreeFromDictionary = foldl (flip BK.insert) BK.Empty
main :: IO ()
main = do
bkTree <- fileLines "dictionary" >>= \x -> return $ bkTreeFromDictionary x
Finding typos and suggesting words
You could argue that for precise matches creating a HashSet from the dictionary could have a better performance than trying to find it in the BK-Tree, and that is true, but the time that takes to build a HashSet with almost a million of Strings is significant, so for a command line spellchecker is just faster to search in the BK-Tree for a node with distance of 0.
module Main where
import System.Environment
import Data.Char
import Data.Text.Metrics (levenshtein)
import qualified BkTree as BK
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
spellCheck :: [T.Text] -> BK.Node T.Text -> [T.Text]
spellCheck words tree = filter (\x -> not $ BK.lookup x tree) words
guess :: BK.Node T.Text -> T.Text -> (T.Text, [T.Text])
guess bktree miss = (miss, BK.search miss 1 bktree)
bkTreeFromDictionary :: [T.Text] -> BK.Node T.Text
bkTreeFromDictionary = foldl (flip BK.insert) BK.Empty
instance BK.Distance T.Text where
metric = levenshtein
main :: IO ()
main = do
bkTree <- fileLines "dictionary" >>= \x -> return $ bkTreeFromDictionary x
content <- TIO.getContents >>= \x -> return $ T.words x
print $ map (guess tree) (spellCheck content tree)
In the code above we create a list of words from the input, and then for each word we check if it exists in the dictionary. For the words that could not be found we search three possible corrections that are at a maximum distance of 1.
Using Levenshtein function from text-metrics package.
Running the application
Now we should be able to run our command line application. Follows some examples:
» echo "Testando o correror ortografico." | spell-checker
[("correror",["corredor","corretor"]),("ortografico",["ortográfico"])]
Execution Time
» wc input.txt
46 1776 11871 input.txt
» time cat input.txt | spell-checker
cat input.txt 0,00s user 0,00s system 79% cpu 0,001 total
spell-checker-exe 11,38s user 4,23s system 184% cpu 8,472 total
It took 8.472s to build the BK-Tree with 985563 nodes and process a file with 1776 words. Note that the most time consuming part of this process is to build the Tree.
In Haskell, the expressions are not evaluated when they are bound. Because of that, the execution time drops considerably when analyzing smaller texts. For example, it should not take much more than one second to check a single word as just a tiny part of the Tree will be built in the process.
Possible Optimizations
-
Save the in-memory BK-Tree to disk in a binary format. Avoiding transversing the Tree and calculating the distance between nodes at every program startup.
-
Use SymSpell instead of BK-Tree as it claims to be 1,870 times faster. A downside is that the SymSpell algorithm seems to take more time to build the Tree and may not worth the faster lookup as the startup time is the bottleneck for our command line spellchecker.
Wrap up
As said in the beginning, to create a spellchecker, technical decisions around algorithms and data structures had to be made. Those decisions had an impact on time and space complexity of the program.
To achieve a more desirable correction suggestion, you will need a good dictionary. Without one, the program will almost always suggest nonsense words. A simple set of words will not do much; it is important to understand the dictionary words relevance too.
In the end, we achieved a naive solution as we had just scratched the surface of spellchecking. I wish that this post serves for you as an introduction to the subject.
Til next time,
Felipe Beline Baravieira
at 00:44