Evaluation and Comparison

You probably saw the demo, but a demo is an imperfect way to evaluate a grammar checker. When we try out a grammar checker, we deliberately type sentences with errors, but the kind of errors DeepGrammar is designed to catch are subconscious ones. Also, DeepGrammar offers specific corrections to erroneous text, but the specific corrections offered by DeepGrammar can distract us from what is important, which is that DeepGrammar flags erroneous text as an error.

To better evaluate DeepGrammar, I compared DeepGrammar with four other methods:

  1. Word for Windows: the grammar checker in Microsoft Word. (I tested the Windows version; the Mac version seems to work slightly differently.)
  2. Grammarly: the best-known standalone grammar checker, which has been in development for over six years.
  3. Google: the grammar checker in Google Docs.
  4. Language Tool 3.1: a rule-based open-source grammar checker that has been in development for over ten years.

To use as ground truth in the evaluation, I created two sets of evaluation texts, each consisting of text snippets. Each text snippet is a small amount of text, usually a sentence, although some text snippets are phrases and some are multiple short sentences. The first set of text snippets is the snippets with grammar errors. I created the snippets with errors by using the snippets with grammar errors from the Language Tool website and adding to them and making some modifications. This resulted in 242 error snippets. Each snippet has one error and can be found at here.

Since reducing the pain from false positives is as important as finding grammar errors, the second set of text snippets consists of text that does not have grammar errors. I created this set of snippets by taking 176 of my tweets and making each tweet into a text snippet. Tweets are an interesting choice because they reflect the kind of informal writing that we actually do, where grammar checkers often give a lot of false positives. These snippets without errors can be found here.

This gives a total of 418 text snippets.

I first ran each method over the snippets with grammar errors. The method’s response to a snippet was marked as correct if the method flagged the word with the error or the word next to it. Since we are more concerned with having the error pointed out than we are with the specific correction, I still counted it as correct if the suggested correction was wrong or even if it was only underlined as a possible error. If the method flagged an error in another part of the snippet and didn’t flag the error on the word (or the words next to) where the error occurred, it was considered incorrect.

I then ran each method over the snippets without grammar errors. If a method flagged a snippet as having a grammar error, it was marked as incorrect. For Grammarly, I still counted it correct if the “errors” flagged were for ending in a preposition, using passive voice, or not writing out numbers, since some people might want to follow those rules.

The results show that DeepGrammar is comparable to these other methods, even though DeepGrammar has only been in development for under a year by a single person.

The first graph compares the methods on all 418 text snippets.

Since we are interested in finding errors when they are there, the second graph shows the results on the 242 snippets with grammar errors.

And since we don’t want to be bugged by too many false positives, the third graph shows the results on the 176 snippets with no grammar errors. It is great to see that DeepGrammar has so few false positives, since false positives often plague unsupervised learning methods.

To get a better sense of where DeepGrammar does well, the following are errors that DeepGrammar found that were missed by all of the other grammar checkers. DeepGrammar doesn’t always come up with a good correction, but it identifies that each contains an error.

  1. It may be more expensive on some filesystems then others.
  2. This is pretty much one of its main use cases so you don't have to right one.
  3. The element can have for different kinds of child elements.
  4. How is would this constructor be useful?
  5. We are looking for people who has expertise in this area.
  6. It seems that is might be straightforward to implement a multi-threaded merge.
  7. I believe the CLucene developers are more focused on providing an indexing/searching library that other can build an application with.
  8. One problem is that you often don't where a sentence starts and ends.
  9. Is sorting works properly.
  10. Here we, have a local directory on the left.
  11. This may be prove fruitful than you might think.