HOW INFORMATION SYSTEMS WILL 'READ' HANDWRITING IN THE
TALLY OF THE CENSUS
15 April 2001
WIDNES, UNITED KINGDOM, May 15--Automated recognition
technology will be used to read the handwriting on the millions
of Census forms that will pour through the Census processing
centre in Widnes, Cheshire.
It should come as no surprise that information technology is
an integral part of the processing of the Census. But, how do
scanning and recognition systems cope with the great variances
in handwriting? Can computer software determine what the
respondent meant to write, and do so quickly and reliably enough
to capture the volumes of data that will be represented on the
forms processed during this year's Census? And, will computers
be able to cope with some of the "accidents" that befall Census
forms enroute, such as tea stains, torn corners, or pens that
leak ink?
Lockheed Martin systems engineers, who worked with the U.K.
Census offices to develop a Data Capture and Coding System (DCCS)
to process the 2001 Census, report that not only will the system
handle the huge variances in handwriting--but that it will do so
more accurately and with greater confidence levels than any
previous system.
Able to handle all kinds of handwriting
"The systems we've developed will be able to deal with all
kinds of handwriting types," says Fred Highland, systems
architect who led the systems design for Lockheed Martin Mission
Systems. "We have tested it exhaustively and validated the
results in dry runs and we know that it is fully capable of
handling as large an image recognition project as this."
Highland notes that the technology was also used successfully to
handle the data capture of some 147 million forms in the United
States Census last year.
Tom Roe, service director for ICL's Census 2001 project said:
"As part of this prestigious national contract, ICL has filled
over 1,000 jobs in the north-west of England which will play an
important role in the accurate conversion and processing of the
33 million Census forms from May until March next year when the
project is due for completion. To achieve this, ICL is training
these employees to utilise the handwriting recognition
technology which is contributing to the speed and efficiency of
this first Census of the new millennium."
How does it all work? DCCS supports the entire Census
processing from check-in of returned forms to the point where
the final captured data is forwarded to Census analysts. It is
the scanning and recognition activity, though, that is at the
heart of the operation. Forms will be scanned after a check-in
process to remove them from boxes and remove all staples. This
is akin to taking a computer "photo" of the information. The
scanning digitises all the data entered on the form, including
the handwriting.
Fat, slanted, little and tall "A's" are all recognised
The system next evaluates the zones or areas where hand written
entries are expected. If anything is present in one of these
zones, the system attempts to recognise it. First, a zone is
segmented into characters by looking for breaks in the writing.
(Note that respondents are instructed to print.) After
segmentation into possible characters, the digitised hand
writing images are analysed according to the size and shapes of
the characters or alphabet letters written. It is a process
called optical character recognition (OCR) and is done by a type
of statistical analysis that is programmed to recognise each
alphabet letter in a number of variations--a big fat A, a little
A, an A with a decided slant, a capital A, a lower case a--all
specifically programmed for UK handwriting styles.
Note to Editors: The Optical Character Recognition
(OCR) engine (that is a systems term for the software that does
this) comes up with its best judgment of what the letter is and
also attaches a confidence factor, a number from 0 to 100
indicating just how sure it is of the letter being recognised.
Sometimes the OCR engine also provides a second choice (with a
lower, but still credible, confidence level). The functions of
the commercial OCR engine are complemented with unique software
developed by Lockheed Martin engineers that enhances the
evaluation of characters in context. Included in this unique
context checking are cross checking of related fields, checking
the structure of postcodes and "trigram" analysis. Trigram
analysis is the evaluation of all three-letter combinations in a
word to determine its likelihood of being a valid English word.
A table is created containing all three-letter combinations of
letters in the English alphabet (i.e., AAA, AAB…ZZZ). Associated
with each three-letter combination is its frequency of
occurrence at the beginning, middle, and end of English words.
The initial confidence of accurate recognition is then adjusted,
based on the trigram analysis. If, for example, three "i's" in
succession were discovered in a word, the software will reduce
its confidence rating since there is no word with three "i's" in
a row. Conversely, a letter combination that appears frequently
would get a high rating. If possible, invalid trigrams are
replaced with alternative valid trigrams, using alternate
recognition choices provided by the OCR engine.
Poorly written words sometimes need human operator help
Using the combination of the OCR engine, the trigram
analysis, and the confidence factors, each word is judged to be
either recognised with high confidence or not. Those words
recognised with high confidence are placed directly in the
output data for the form being processed. Those words not
recognised with high confidence are forwarded to a human
operator who is presented with images of low confidence words
and keys the values. Reasons for failure to recognise a word
with high confidence include the presence of characters that are
poorly written (an "A" that resembles an "R"), words that have
been scratched out and rewritten elsewhere on the form, and
words that have been written in script rather than printed.
How accurate is the whole process? "If the automated results
were accepted completely, and no low confidence characters were
keyed by human operators, approximately 85 percent of the
characters would be entirely correct. This is pretty good, when
you consider the range of hand printing that must be processed,"
according to Lockheed Martin's Highland. However, the ONS
requires that the data be at least 98 percent accurate, so the
low confidence characters are forwarded to keyers for human
recognition and entry. By having human keyers enter the 15
percent of the words with the lowest automated recognition
confidence, the desired accuracy will be achieved.
System reduces need for keying support by 85 percent
Accuracy of these levels means that the system can be counted on
to process and verify the information without the need for
extensive keying operations. In fact, the use of the DCCS system
has reduced by almost 85 percent the keying support that would
have been required if the past systems had been used.
Continuous
quality control to flag problems
DCCS also includes ongoing quality control that can readily
spot any problems that may be occurring virtually as they occur.
Highland explains: "Dynamically, as data from forms is
collected, we will siphon off data samples and send them to a
human operator for keying. This provides a continuous rolling
measure of OCR and keyer accuracy. If for any reason there is a
problem, we will process the forms over again until we are sure
that we have it right." Ultimately, Lockheed Martin's role is to
ensure that the data collected is accurate."
And what about the tea spills on forms? "When we encounter a
form with spots or tears or other damage, the system
automatically identifies the problem. It alerts an operator who
makes a judgement on whether the form can indeed be read.
Sometimes, a corner might be missing, but not in an area where
information has been written, and we will be able to continue
with processing. If there is any doubt at all about the
readability, it will go to a human operator to decipher."
Now that we have read it, do we know what it means? Another
novel feature of the DCCS system is the process called coding.
Certain critical information on the form is read and grouped
into a standard set of categories for further analysis. For
example, one of the questions on the form asks, "What is the
full title of your main job?" If you respond "school teacher"
the system must determine which code to use from the 350
possible job codes in the UK Standard Occupation Coding Index.
The system uses sophisticated software trained with thousands of
examples of correctly coded responses to automatically recognise
approximately 70% of the responses. Unrecognised responses are
sent to highly trained operators to code. "Coding is a difficult
and expensive process but our automatic coding software combined
with highly productive user tools makes it possible to code all
of the data accurately with a small number of operators" says
Highland.

Handwriting on millions of Census forms, like this
representative sample, left, will be "read" by the Data Capture
and Coding System. At right, an operator during system testing
feeds Census forms into a scanner, which will produce a
digitised record of the information on the form.
|