Sunday, March 19, 2006

Office XP Proofing Tools for Indian Text Entry, Spell Checking, and Web pages


By Marko Malyj, August 2003

A new Microsoft product, Office XP Proofing Tools, finally opens up the evolving world of Windows to the major languages of India. You can type in Hindi, Gujarati, Kannada, Marathi, Punjabi, Tamil, and Telugu, using the global Unicode standard for world languages. This means that you can now create Indian language-documents or web sites that will be readable for many decades to come on all future computing systems.

What makes this software a must have is the spell checking capability. For the first time, you can type in any one of these Indian languages and be notified instantly of your spelling errors. You can even get a list of possible corrections, just by right clicking on the mistake.

Office XP Proofing Tools, also known as Office 2002 Proofing Tools, is available for about $75 U.S. from Microsoft. To operate Indian-style, you must have Windows XP and some version of Microsoft Office 2002.

Installation.

Run the setup program for Office XP Proofing Tools. Select all the options that you desire to use.

Configuration.

When the setup routine is done, click on Start, Control Panel, and then Regional and Language Options (available from Control Panel’s “Classic View”). Choose the Languages tab, then under “Supplemental Language Support”, check the box labeled “Install files for complex script and right-to-left languages”. Then, still on the Languages tab, under “Text Services” click on Details.

A window titled “Text Services and Input Languages” will pop up. Under “Installed services” click on the Add button, then select the language you want to work with and click OK. You may select more than one language by clicking on the Add button again. Again in the “Text Services and Input Languages”, under “Preferences” click on the “Language Bar” button, then check off “Show the Language bar on the desktop” and “Show additional Language bar icons in the taskbar”. Back under “Preferences”, click on “Key Settings”, highlight “Switch between input languages”, click on “Change Key Sequence”, then choose Left-Alt-Shift to switch input languages and Ctrl-Shift to switch keyboard layouts. Keep pressing OK and get out of Control Panel.

Keyboard Layouts and Fonts.

On the Windows XP task bar you will see a little blue box that tells you which language is currently in use. Normally it will show “EN” for English. In Microsoft Word, when you press Left-Alt-Shift, it will switch to the next available language that you installed, for example, “GU” for Gujarati. The font will also automatically change, depending on the language. Here is the list of fonts for the different languages:


LanguageFontUnicode Subrange
HindiMangalDevanagari
GujaratiShruti
KannadaTunga
TamilLatha
TeluguGautami
MarathiRaaviGurmukhi
Punjabi


These fonts are available on all Windows XP systems. A second font is also available for each of the languages: Arial Unicode MS. This font is available on Windows XP, and also on Windows 2000 computers, which means that any documents that you prepare can be read (thought not edited) on Windows 2000 machines.

The keyboard layout makes all the letters of the alphabet available for typing. For instance, the Gujarati keyboard layout is shown at the top of this article.

It is very easy to combine characters. For example, to create the Gujarati combination નિ, you type the letter ન, followed by the symbol 'િ'. The combination મો would be મ followed by 'ો'.

Conjunct consonants require you to use '્', called the “virama” key in Unicode. For example, the Gujarati ધ્ય is produced by typing ધ, the virama '્', then ય. The one letter word હ્યું is five keystrokes: the letter હ, then the virama '્', followed by the ય, 'ુ' and 'ં' symbols.

Some character combinations are not obvious by looking at the keyboard layout. For the Gujarati language, here is a list of the most commonly used conjunct consonants.


ShrutiArial Unicode MSShrutiArial Unicode MS
ભ્રભ્-રભ્રભ્-રરૂર-ૂરૂર-ૂ
હૃહ-ૃહૃહ-ૃર્બર્-બર્બર્-બ
હ્યહ્-યહ્યહ્-યર્ગર્-ગર્ગર્-ગ
ગ્રગ્-રગ્રગ્-રર્દર્-દર્દર્-દ
ગ્નગ્-નગ્નગ્-નર્પર્-પર્પર્-પ
ઘ્યઘ્-યઘ્યઘ્-યર્થર્-થર્થર્-થ
દ્બદ્-બદ્બદ્-બર્મર્-મર્મર્-મ
દ્ભદ્-ભદ્ભદ્-ભર્વર્-વર્વર્-વ
દ્ગદ્-ગદ્ગદ્-ગર્યર્-યર્યર્-ય
દ્ઘદ્-ઘદ્ઘદ્-ઘક્રક્-રક્રક્-ર
દ્દદ્-દદ્દદ્-દખ્રખ્-રખ્રખ્-ર
દ્ધદ્-ધદ્ધદ્-ધત્તત્-તત્તત્-ત
દ્રદ્-રદ્રદ્-રછ્વછ્-વછ્વછ્-વ
દ્મદ્-મદ્મદ્-મમ્રમ્-રમ્રમ્-ર
દ્નદ્-નદ્નદ્-નન્નન્-નન્નન્-ન
દ્વદ્-વદ્વદ્-વસ્ત્રસ્-ત્રસ્ત્રસ્-ત્ર
દ્યદ્-યદ્યદ્-યસ્રસ્-પસ્રસ્-પ
ધ્યધ્-યધ્યધ્-યરુર-ુરુર-ુ
જીજ-ીજીજ-ીશ્ચશ્-ચશ્ચશ્-ચ
પ્રપ્-રપ્રપ્-રશ્વશ્-વશ્વશ્-વ


You will notice that the Arial Unicode MS font does not display all the symbols in the same way – in some cases it displays the virama instead of the conjunct consonant. However, the virama is recognized by most native language readers.

If you see '?' (question marks) or ્ (square) characters in this article, it means that you do not have the Arial Unicode MS and Shruti fonts loaded on your system. The Arial Unicode MS font is not a free download (see http://lists.webjunction.org/wjlists/web4lib/2002-August/020299.html). It is bundled with Office XP/Word 2002 and Office 2003/Word 2003. To install it, see http://support.microsoft.com/kb/q287247/. If you not have these products, you can purchase the Arial Unicode MS font for $20 from http://www.myfonts.com/fonts/microsoft/arial/. The Shruti.ttf font is bundled with Windows XP and Windows Server 2003. For other versions of Windows, you can download it for free from http://www.readgujarati.com/readinghelp.asp.

Numbers.

The keyboard layout allows you to enter Western style numerals, 0 through 9. If you wish to type the number Indian-language style, open up the Windows Character Map utility by clicking on Start, then All Programs, Accessories, System Tools, and Character Map. Select the Indian font that you will be using (see the list of fonts by language above). For example, choose Shruti if you are typing in Gujarati. You can then click on all the Indian-style numerals that you want, and then click on Copy. Switch back to your Word document, click on the Edit menu, then Paste Special, and Unformatted Unicode Text. The Gujarati-style numerals look like this:
૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯.

Spell checking.

When a possible misspelling is detected, it will be underlined with a wavy red line. To get a list of possible corrections, simply right click on the misspelled word. You will get a pop-up list of many different words to choose from! This is available for Hindi, Gujarati, Kannada, Marathi, Punjabi, Tamil, and Telugu.

Producing web pages.

Of course, you could save a Microsoft Word document as an HTML file. However, you will see the typical mess of Microsoft tags, a black hole to any professional HTML developer. Also, Unicode data is better handled by XHTML rather than HTML.

So, if you are a serious web page designer, and you want to produce clean XHTML files from your Word documents, here’s one way to do this (this was written in 2003):

a) Use Microsoft Word 2002 – makes it easy to enter text and do spell checking. Copy, then
b) Use Wordpad – Paste, and right away Copy
c) Use Microsoft Frontpage 2002 – create a blank HTML file, switch from Normal to HTML view, then replace all the Frontpage-style code with the following basic set of XHTML tags:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link href="chars-descn.css" rel="stylesheet" type="text/css">
</head>
<body>
<SPAN class=gu>
Gujarati text gets pasted here
</SPAN>
</body>
</html>

The chars-descn.css file here is a Cascading Style Sheet. It would include a section that defines the “gu” style like this:

SPAN.gu {
FONT-SIZE: 1.25em; FONT-FAMILY: shruti, "arial unicode ms", sans-serif
}

Notice that it calls for the Shruti font if available, otherwise the Arial Unicode MS font. This will cover all Windows XP and Windows 2000 computers.

Now paste the Gujarati text that you had in WordPad into position between the tags, and Save your XHTML document. The XHTML tags will trick FrontPage into not converting the Gujarati text into a series of unrecognizable special characters. Otherwise અમેરિકા would become &#2693;&#2734;&#2759;&#2736;&#2751;&#2709;&#2750; !

d) Use Dreamweaver MX – You can now open up the XHTML file you created in FrontPage, format it just the way you want to, and still be able to read the Indian text when you view the HTML Code.

Conclusion.

Microsoft has finally recognized that there are a billion people in India. Many thanks to the Indian language experts and software developers that they paid to do this. And if Microsoft lowers the price on their software, they will truly achieve monumental success in India.