[Image]: Lockheed Martin - we never forget who we're working for. Home | Contact Us
  
Advanced Search   
 
TEXT LINKS HERE

Home > News > Press Releases > How information systems will 'read' handwriting in the tally of the Census

HOW INFORMATION SYSTEMS WILL 'READ' HANDWRITING IN THE TALLY OF THE CENSUS

15 April 2001

WIDNES, UNITED KINGDOM, May 15--Automated recognition technology will be used to read the handwriting on the millions of Census forms that will pour through the Census processing centre in Widnes, Cheshire.

It should come as no surprise that information technology is an integral part of the processing of the Census. But, how do scanning and recognition systems cope with the great variances in handwriting? Can computer software determine what the respondent meant to write, and do so quickly and reliably enough to capture the volumes of data that will be represented on the forms processed during this year's Census? And, will computers be able to cope with some of the "accidents" that befall Census forms enroute, such as tea stains, torn corners, or pens that leak ink?

Lockheed Martin systems engineers, who worked with the U.K. Census offices to develop a Data Capture and Coding System (DCCS) to process the 2001 Census, report that not only will the system handle the huge variances in handwriting--but that it will do so more accurately and with greater confidence levels than any previous system.

Able to handle all kinds of handwriting
"The systems we've developed will be able to deal with all kinds of handwriting types," says Fred Highland, systems architect who led the systems design for Lockheed Martin Mission Systems. "We have tested it exhaustively and validated the results in dry runs and we know that it is fully capable of handling as large an image recognition project as this." Highland notes that the technology was also used successfully to handle the data capture of some 147 million forms in the United States Census last year.

Tom Roe, service director for ICL's Census 2001 project said: "As part of this prestigious national contract, ICL has filled over 1,000 jobs in the north-west of England which will play an important role in the accurate conversion and processing of the 33 million Census forms from May until March next year when the project is due for completion. To achieve this, ICL is training these employees to utilise the handwriting recognition technology which is contributing to the speed and efficiency of this first Census of the new millennium."

How does it all work? DCCS supports the entire Census processing from check-in of returned forms to the point where the final captured data is forwarded to Census analysts. It is the scanning and recognition activity, though, that is at the heart of the operation. Forms will be scanned after a check-in process to remove them from boxes and remove all staples. This is akin to taking a computer "photo" of the information. The scanning digitises all the data entered on the form, including the handwriting.

Fat, slanted, little and tall "A's" are all recognised
The system next evaluates the zones or areas where hand written entries are expected. If anything is present in one of these zones, the system attempts to recognise it. First, a zone is segmented into characters by looking for breaks in the writing. (Note that respondents are instructed to print.) After segmentation into possible characters, the digitised hand writing images are analysed according to the size and shapes of the characters or alphabet letters written. It is a process called optical character recognition (OCR) and is done by a type of statistical analysis that is programmed to recognise each alphabet letter in a number of variations--a big fat A, a little A, an A with a decided slant, a capital A, a lower case a--all specifically programmed for UK handwriting styles.

Note to Editors: The Optical Character Recognition (OCR) engine (that is a systems term for the software that does this) comes up with its best judgment of what the letter is and also attaches a confidence factor, a number from 0 to 100 indicating just how sure it is of the letter being recognised. Sometimes the OCR engine also provides a second choice (with a lower, but still credible, confidence level). The functions of the commercial OCR engine are complemented with unique software developed by Lockheed Martin engineers that enhances the evaluation of characters in context. Included in this unique context checking are cross checking of related fields, checking the structure of postcodes and "trigram" analysis. Trigram analysis is the evaluation of all three-letter combinations in a word to determine its likelihood of being a valid English word. A table is created containing all three-letter combinations of letters in the English alphabet (i.e., AAA, AAB…ZZZ). Associated with each three-letter combination is its frequency of occurrence at the beginning, middle, and end of English words. The initial confidence of accurate recognition is then adjusted, based on the trigram analysis. If, for example, three "i's" in succession were discovered in a word, the software will reduce its confidence rating since there is no word with three "i's" in a row. Conversely, a letter combination that appears frequently would get a high rating. If possible, invalid trigrams are replaced with alternative valid trigrams, using alternate recognition choices provided by the OCR engine.

Poorly written words sometimes need human operator help
Using the combination of the OCR engine, the trigram analysis, and the confidence factors, each word is judged to be either recognised with high confidence or not. Those words recognised with high confidence are placed directly in the output data for the form being processed. Those words not recognised with high confidence are forwarded to a human operator who is presented with images of low confidence words and keys the values. Reasons for failure to recognise a word with high confidence include the presence of characters that are poorly written (an "A" that resembles an "R"), words that have been scratched out and rewritten elsewhere on the form, and words that have been written in script rather than printed.

How accurate is the whole process? "If the automated results were accepted completely, and no low confidence characters were keyed by human operators, approximately 85 percent of the characters would be entirely correct. This is pretty good, when you consider the range of hand printing that must be processed," according to Lockheed Martin's Highland. However, the ONS requires that the data be at least 98 percent accurate, so the low confidence characters are forwarded to keyers for human recognition and entry. By having human keyers enter the 15 percent of the words with the lowest automated recognition confidence, the desired accuracy will be achieved.

System reduces need for keying support by 85 percent
Accuracy of these levels means that the system can be counted on to process and verify the information without the need for extensive keying operations. In fact, the use of the DCCS system has reduced by almost 85 percent the keying support that would have been required if the past systems had been used.

Continuous quality control to flag problems
DCCS also includes ongoing quality control that can readily spot any problems that may be occurring virtually as they occur. Highland explains: "Dynamically, as data from forms is collected, we will siphon off data samples and send them to a human operator for keying. This provides a continuous rolling measure of OCR and keyer accuracy. If for any reason there is a problem, we will process the forms over again until we are sure that we have it right." Ultimately, Lockheed Martin's role is to ensure that the data collected is accurate."

And what about the tea spills on forms? "When we encounter a form with spots or tears or other damage, the system automatically identifies the problem. It alerts an operator who makes a judgement on whether the form can indeed be read. Sometimes, a corner might be missing, but not in an area where information has been written, and we will be able to continue with processing. If there is any doubt at all about the readability, it will go to a human operator to decipher."

Now that we have read it, do we know what it means? Another novel feature of the DCCS system is the process called coding. Certain critical information on the form is read and grouped into a standard set of categories for further analysis. For example, one of the questions on the form asks, "What is the full title of your main job?" If you respond "school teacher" the system must determine which code to use from the 350 possible job codes in the UK Standard Occupation Coding Index. The system uses sophisticated software trained with thousands of examples of correctly coded responses to automatically recognise approximately 70% of the responses. Unrecognised responses are sent to highly trained operators to code. "Coding is a difficult and expensive process but our automatic coding software combined with highly productive user tools makes it possible to code all of the data accurately with a small number of operators" says Highland.

Handwriting on millions of Census forms, like this representative sample, left, will be "read" by the Data Capture and Coding System. At right, an operator during system testing feeds Census forms into a scanner, which will produce a digitised record of the information on the form.

© 2007



Lockheed Martin Corporation
All rights reserved.
Disclaimer