Lionbridge Innovation

Current Articles | RSS Feed RSS Feed

Free Hacking Tip: Use a Web Browser as an Encoding Tool

  | Share on Twitter Twitter | Share on Facebook Facebook |  Share on LinkedIn LinkedIn 

Another great post by Jim Compton. Today he shares a free hacking tip...


With the gradual proliferation of Unicode, the issue of different character encodings is fortunately becoming less-and-less of a headache, but we're not quite there yet. Pre-Unicode encodings are occasionally still a factor for projects, and it can be challenging and sometimes confusing to identify what encoding you're looking at, or to perform transcoding without causing character corruption.

It may come as a surprise to you that right now you're probably looking at a capable and free encoding-diagnostics and transcoding tool - your web browser.

The Transcoding Challenge

Let's take a typical situation where we want to convert a file from some flavor of ANSI encoding into Unicode. To do this, both you and the application that will do the actual conversion of the bytes need to know the language-class of the file that you are converting.

Unlike Unicode, ANSI does not have a built-in mechanism (such as a byte-order mark) to identify its language-class to applications. Some applications, such as the Windows Notepad will try to guess using clues such as the language of the operating system.

As is always the case when it comes to guessing, such applications can get it wrong.

Here's what it looks like when a Russian ANSI file is misinterpreted as being Western European.


Figure 1 - Russian ANSI misinterpreted as Western European ANSI

The file is not corrupted per se, but is being improperly displayed by the well-intentioned but ill-informed application. Any conversion at this point would cause corruption, however, as the resultant Unicode would essentially be exactly what we're seeing here, but with every character re-encoded using multiple bytes.

This means that it is important to perform character-encoding conversion from within applications for which you can explicitly tell the application the language-class of the ANSI.

Enter the Web Browser!

Web browsers, because they have to deal with the possibility of encountering any variety of encodings while surfing the worldwide web tend to have rich multi-lingual read/write abilities.

To make use of this functionality, open up your text file in your favorite browser using either the File->Open method or by simply dragging the text file into the browser window.

(Note: If your file has an extension that your browser doesn't recognize, such as .properties, .rc, etc. the browser may assume that the file is a binary and try to "download" it for you. Since we're talking about all text-based resource formats, you can temporarily trick the browser by first adding a .txt suffix to the file name. A more permanent solution would be to configure the browser to recognize these MIME-types as ANSI text.)

Once open in the web browser, the file may or may not be automatically recognized with the correct type of encoding. Just like Notepad, the browser takes a guess, and it would be safest to assume that its guess was wrong.

Here is what the same Russian file looks like when opened by Microsoft Internet Explorer, which incorrectly guessed that the file is a UTF-8 file sans byte-order mark.

 
Figure 2 - Russian ANSI misinterpreted as UTF-8

Fortunately, browsers include a method by which we can explicitly tell the browser what kind of encoding it should be interpreting the file as. In Internet Explorer, for example, there is an encoding selector under the View menu.


Figure 3 - Encoding selector in Microsoft Internet Explorer

Note: there are many more encodings to choose from under View->Encoding->More.

Since the file is in actuality encoded as Russian ANSI (or "Cyrillic (Windows)"), we need to tell the browser this fact by choosing this encoding from View->Encoding. Here is what our Russian ANSI file looks like when we've explicitly told the browser.


Figure 4 - Russian ANSI correctly interpreted

It is important to understand that even though the display has changed, we haven't made any actual changes to the underlying file at this point. The file is byte-for-byte identical as when it looked like a bunch of box-characters, but now the browser is interpreting the bytes correctly.

Once the browser is properly interpreting the actual encoding of the file, we can use the browser to change the encoding to Unicode. In Microsoft Internet Explorer, this can be done through the Save As dialog by choosing Unicode from the Encoding drop-down.


Figure 5 - Using "Save As" to change encoding

The resultant file should be properly encoded as Unicode (little endian byte-order) with a byte-order mark. Since the file is now explicitly Unicode, Notepad should be able to open it up without issue.


Figure 6 - Russian file encoded as Unicode

Ta-da!

Comments

Thanks for this article, Jim. I've used web browsers to view files in different encodings for a long time now, but it had never occurred to me that I could also use a web browser to convert from one encoding to another. This is good to know!  
--Tom
Posted @ Thursday, October 08, 2009 3:04 PM by Tom Roland
Good to hear from you Tom! Happy encoding!
Posted @ Monday, October 12, 2009 4:01 PM by Jim Compton
Hi Jim, just would like to say that this is absolutely a great article!  
THANK YOU VERY MUCH! I'm Russian myself, and your instruction and helpful screen shots were very clear. 
I'm not a programmer, but I was also wondering, is there any possibility to make notepad guessing more accurate, or is it all on Windows development level? Or maybe there is a program like a browser in your case, but actually more like notepad, that does recognizing better? 
(Currently using Windows 7, x64) 
Thanks again!
Posted @ Friday, December 18, 2009 2:26 PM by Rinat
Hi Rinat, 
 
Thanks for the comments and questions. I haven't made the leap to Windows 7 yet (although I'm looking forward to it), but I can tell you that under Windows XP (SP3) that Notepad in particular takes its clues from whatever is specified from the "Language for non-Unicode programs" setting from the "Advanced" tab of the "Regional and Language Options" Control Panel. 
 
This doesn't make its guessing any better per se, but it does hand-off enough information to Notepad that it will open up a natively encoded ANSI file with the correct characters. 
 
Of course, this won't help you if you don't know the language of the file to begin with, and given that changing this setting requires a system re-start, is a fairly inefficient way of making Notepad do the right thing (imho), especially if you're dealing with lots of different languages. 
 
Many applications attempt to make educated guesses about encoding (presumably using some sort of probability analysis about likely frequency of characters), including but not limited to Microsoft Word. In the example I used, Microsoft Word 2007 correctly guessed "Cyrillic Windows." 
 
But again, I've never seen this work 100%, and would use such systems with caution.If you find something that seems to consistently work all the time, please let me know! Thanks, and happy New Year!
Posted @ Tuesday, January 12, 2010 9:35 AM by Jim Compton
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Subscribe to our blog

Your email: