Free Hacking Tip: Use a Web Browser as an Encoding Tool
Posted on Tue, Sep 22, 2009
Another great post by Jim Compton. Today he shares a free hacking tip...
With the gradual proliferation of Unicode, the issue of different character encodings is fortunately becoming less-and-less of a headache, but we're not quite there yet. Pre-Unicode encodings are occasionally still a factor for projects, and it can be challenging and sometimes confusing to identify what encoding you're looking at, or to perform transcoding without causing character corruption.
It may come as a surprise to you that right now you're probably looking at a capable and free encoding-diagnostics and transcoding tool - your web browser.
The Transcoding Challenge
Let's take a typical situation where we want to convert a file from some flavor of ANSI encoding into Unicode. To do this, both you and the application that will do the actual conversion of the bytes need to know the language-class of the file that you are converting.
Unlike Unicode, ANSI does not have a built-in mechanism (such as a byte-order mark) to identify its language-class to applications. Some applications, such as the Windows Notepad will try to guess using clues such as the language of the operating system.
As is always the case when it comes to guessing, such applications can get it wrong.
Here's what it looks like when a Russian ANSI file is misinterpreted as being Western European.

Figure 1 - Russian ANSI misinterpreted as Western European ANSI
The file is not corrupted per se, but is being improperly displayed by the well-intentioned but ill-informed application. Any conversion at this point would cause corruption, however, as the resultant Unicode would essentially be exactly what we're seeing here, but with every character re-encoded using multiple bytes.
This means that it is important to perform character-encoding conversion from within applications for which you can explicitly tell the application the language-class of the ANSI.
Enter the Web Browser!
Web browsers, because they have to deal with the possibility of encountering any variety of encodings while surfing the worldwide web tend to have rich multi-lingual read/write abilities.
To make use of this functionality, open up your text file in your favorite browser using either the File->Open method or by simply dragging the text file into the browser window.
(Note: If your file has an extension that your browser doesn't recognize, such as .properties, .rc, etc. the browser may assume that the file is a binary and try to "download" it for you. Since we're talking about all text-based resource formats, you can temporarily trick the browser by first adding a .txt suffix to the file name. A more permanent solution would be to configure the browser to recognize these MIME-types as ANSI text.)
Once open in the web browser, the file may or may not be automatically recognized with the correct type of encoding. Just like Notepad, the browser takes a guess, and it would be safest to assume that its guess was wrong.
Here is what the same Russian file looks like when opened by Microsoft Internet Explorer, which incorrectly guessed that the file is a UTF-8 file sans byte-order mark.
Figure 2 - Russian ANSI misinterpreted as UTF-8
Fortunately, browsers include a method by which we can explicitly tell the browser what kind of encoding it should be interpreting the file as. In Internet Explorer, for example, there is an encoding selector under the View menu.

Figure 3 - Encoding selector in Microsoft Internet Explorer
Note: there are many more encodings to choose from under View->Encoding->More.
Since the file is in actuality encoded as Russian ANSI (or "Cyrillic (Windows)"), we need to tell the browser this fact by choosing this encoding from View->Encoding. Here is what our Russian ANSI file looks like when we've explicitly told the browser.

Figure 4 - Russian ANSI correctly interpreted
It is important to understand that even though the display has changed, we haven't made any actual changes to the underlying file at this point. The file is byte-for-byte identical as when it looked like a bunch of box-characters, but now the browser is interpreting the bytes correctly.
Once the browser is properly interpreting the actual encoding of the file, we can use the browser to change the encoding to Unicode. In Microsoft Internet Explorer, this can be done through the Save As dialog by choosing Unicode from the Encoding drop-down.

Figure 5 - Using "Save As" to change encoding
The resultant file should be properly encoded as Unicode (little endian byte-order) with a byte-order mark. Since the file is now explicitly Unicode, Notepad should be able to open it up without issue.

Figure 6 - Russian file encoded as Unicode
Ta-da!