Posted on Tue, Nov 10, 2009
Today Jim Compton's thinking about XLIFF and its use in structured translation.
In the early nineties, as CAT (computer-assisted translation) tools started to really take off, so was born the translation pivot file: an intermediary format that bridges translatable material from its native environment into the world of structured translation. The development of the pivot format facilitated the use of practices that we take for granted today, including translation memory, electronic glossaries, rule-based QA, etc.
At that time, the de-facto translation pivot format was RTF - a clever application of the RTF standard that used character styles and hidden text to perform a variety of functions, including segregation of translatable from non-translatable material, and storage of the bilingual content within the file itself.
In 2002, the translation pivot format enjoyed an evolutionary jump with the advent of XLIFF, an XML-based format that performs its entire magic using xml markup - you know, "tags."
If you're not using XLIFF with your own structured translation endeavors, here are a few reasons why you might want to consider doing so:
Solid Character Encoding
XLIFF uses Unicode to define encoding, which is straightforward, unambiguous, and relatively safe from transcoding errors and other forms of encoding corruption.
This may seem like an obvious "must", but the translation pivot format has not always been based on Unicode, and has historically been prone to transcoding and encoding issues that can be expensive and time-consuming to correct.
While the risk of encoding trouble isn't absent with an XLIFF-based translation process, it is certainly greatly minimized when compared to non-Unicode-based alternatives.
Strong, Normalized Metadata Support
The flexible support for standard and custom metadata at various levels (at the file level and at the translation unit level) opens up a world of possibilities for improving the localization process, including, but not limited to:
- Improving TM leveraging through the inclusion of context
- Pairing of content to specific glossaries or reference material
- Including length-limits and other "handling instructions" for translators
- Including information as to the content's state of completion or likely reliability
- etc.
Rules-based Pre-Processing
While technically not inextricably bound, XLIFF is closely affiliated with rules-based pre-processing, mostly because there are very few non-rule-based mechanisms for converting source files into XLIFF files.
The issue here is consistency. Rules-based pre-processing encourages pre-processing consistency, which in turn encourages segmentation consistency, which in turn encourages effective TM leveraging and therefore a lower overall translation costs.
Extensibility
As a type of XML - the format itself is extensible, meaning that it is free to evolve into whatever it needs to become. Today, as OASIS, the organization that developed XLIFF works on version 2.0 of the specification, the format continues to evolve and improve.
Openness and Standardization
That XLIFF is an open standard encourages participation from the world of developers and would-be tool developers, and ensures that their solutions are likely to be interoperable - creating operational freedom and flexibility.
And in fact, CAT tools built around the XLIFF standard continue to get better and better.
At Lionbridge, our own collection of XLIFF-based tools (which I would argue has become one of the best in existence) is helping us to automate more-and-more previously manual activities and to bring the localization process entirely online - moving us further toward the "El Dorado" of Localization 2.0.
Some Online Resources
Have I inspired you to research further? Here are some links that you may find helpful:
Of course, Lionbridge is willing and able to help you to establish an XLIFF-based structured localization process.
As always, your feedback is encouraged and appreciated. Thanks!
Posted on Tue, Sep 22, 2009
Another great post by Jim Compton. Today he shares a free hacking tip...
With the gradual proliferation of Unicode, the issue of different character encodings is fortunately becoming less-and-less of a headache, but we're not quite there yet. Pre-Unicode encodings are occasionally still a factor for projects, and it can be challenging and sometimes confusing to identify what encoding you're looking at, or to perform transcoding without causing character corruption.
It may come as a surprise to you that right now you're probably looking at a capable and free encoding-diagnostics and transcoding tool - your web browser.
The Transcoding Challenge
Let's take a typical situation where we want to convert a file from some flavor of ANSI encoding into Unicode. To do this, both you and the application that will do the actual conversion of the bytes need to know the language-class of the file that you are converting.
Unlike Unicode, ANSI does not have a built-in mechanism (such as a byte-order mark) to identify its language-class to applications. Some applications, such as the Windows Notepad will try to guess using clues such as the language of the operating system.
As is always the case when it comes to guessing, such applications can get it wrong.
Here's what it looks like when a Russian ANSI file is misinterpreted as being Western European.

Figure 1 - Russian ANSI misinterpreted as Western European ANSI
The file is not corrupted per se, but is being improperly displayed by the well-intentioned but ill-informed application. Any conversion at this point would cause corruption, however, as the resultant Unicode would essentially be exactly what we're seeing here, but with every character re-encoded using multiple bytes.
This means that it is important to perform character-encoding conversion from within applications for which you can explicitly tell the application the language-class of the ANSI.
Enter the Web Browser!
Web browsers, because they have to deal with the possibility of encountering any variety of encodings while surfing the worldwide web tend to have rich multi-lingual read/write abilities.
To make use of this functionality, open up your text file in your favorite browser using either the File->Open method or by simply dragging the text file into the browser window.
(Note: If your file has an extension that your browser doesn't recognize, such as .properties, .rc, etc. the browser may assume that the file is a binary and try to "download" it for you. Since we're talking about all text-based resource formats, you can temporarily trick the browser by first adding a .txt suffix to the file name. A more permanent solution would be to configure the browser to recognize these MIME-types as ANSI text.)
Once open in the web browser, the file may or may not be automatically recognized with the correct type of encoding. Just like Notepad, the browser takes a guess, and it would be safest to assume that its guess was wrong.
Here is what the same Russian file looks like when opened by Microsoft Internet Explorer, which incorrectly guessed that the file is a UTF-8 file sans byte-order mark.
Figure 2 - Russian ANSI misinterpreted as UTF-8
Fortunately, browsers include a method by which we can explicitly tell the browser what kind of encoding it should be interpreting the file as. In Internet Explorer, for example, there is an encoding selector under the View menu.

Figure 3 - Encoding selector in Microsoft Internet Explorer
Note: there are many more encodings to choose from under View->Encoding->More.
Since the file is in actuality encoded as Russian ANSI (or "Cyrillic (Windows)"), we need to tell the browser this fact by choosing this encoding from View->Encoding. Here is what our Russian ANSI file looks like when we've explicitly told the browser.

Figure 4 - Russian ANSI correctly interpreted
It is important to understand that even though the display has changed, we haven't made any actual changes to the underlying file at this point. The file is byte-for-byte identical as when it looked like a bunch of box-characters, but now the browser is interpreting the bytes correctly.
Once the browser is properly interpreting the actual encoding of the file, we can use the browser to change the encoding to Unicode. In Microsoft Internet Explorer, this can be done through the Save As dialog by choosing Unicode from the Encoding drop-down.

Figure 5 - Using "Save As" to change encoding
The resultant file should be properly encoded as Unicode (little endian byte-order) with a byte-order mark. Since the file is now explicitly Unicode, Notepad should be able to open it up without issue.

Figure 6 - Russian file encoded as Unicode
Ta-da!
Posted on Fri, Aug 21, 2009
Today
Jim's talking about Subversion 1.6...
Howdy!
Are you a fan and/or user of the open-source version-control system Subversion®? If you're like me, you may have missed that back in March of this year the tool underwent a fairly substantial update - version 1.6.
Someone at CollabNet must have been reading my mind, because the 1.6 update includes a significant change that addresses one long-standing pain point for me - file-level granularity in the svn:externals property. Before 1.6, you could specify folders but not files, but now the externals property supports individual file definitions too. :-)
Why do I care about being able to include individual files within an externals definition, you ask? To explain, I should briefly explain one of the common ways that Subversion is used in Lionbridge on localization projects.
Subversion in a Localization Project
Like other code development projects for which Subversion was actually designed, a localization project can be looked at as a set of files which undergo a series of changes based on the contributions of the project participants. In the localization world those contributions tend to be more focused on language activities (i.e. translation), however.
Subversion is great for managing localization projects. In addition to providing version control, an SVN Repository can act as the hub of an on-demand, pull-based file distribution system - allowing dozens of translators and other participants to be working simultaneously without the need of a file dispatcher (read: bottleneck). Combined with Logoport-based online TMs, such a mechanism supports round-the-clock progress and parallel processing on even very complex, high-volume projects.
When designed properly, files will remain in their natural directory structure throughout the process, avoiding the need to shuffle-around files into different packages at different stages - at best an inefficient process and at worst a potential source of error.
The Pain Point
One challenge with having contributors interact directly with a Repository, however, has been in trying to prevent those folks from having to check-out files which aren't relevant to them. German translators, for example, probably have little interest in the localized resources for French, Spanish, Japanese, Czech, Russian, etc - so it makes sense to try to spare them the burden of having to download these files on check-out.
This is easier said than done, unfortunately. Of course if resources are segregated by language-specific sub-directories, you could always provide the German translators the path to the subdirectory containing the German resources, but if we have many language subdirectories (one per component, for example), we're hardly making their lives easier by doing so.
Enter "Custom Modules"
One technique that we've employed is the concept of modules (borrowed from the CVS world) - essentially an empty directory that is unique to a particular role and is "loaded" with externals pointing to all relevant subdirectories for that role.
For example, the German translators would be directed to check-out ProjectName/CustomModules/GermanResourcesModule
...an empty directory that includes an externals property which "links" to the following subdirectories within the "natural directory structure":
ProjectName/NaturalDirectoryStructure/SomeComponent/GermanResources
ProjectName/NaturalDirectoryStructure/AnotherComponent/GermanResources
ProjectName/NaturalDirectoryStructure/YAComponent/GermanResources
In essence the German translators get a custom package that includes only those directories which are relevant to them, but the files themselves remain linked to their natural home within the natural directory structure - the best of both worlds!
Unfortunately, since the externals property only supported directories prior to version 1.6, this technique would only work if your languages were segregated by subdirectory. If all your language resources live together in one directory (identified by a prefix or suffix in the file name, for example) this technique would simply not work.
But now since you can specify individual files in the externals property (hopefully supporting a regular expressions syntax), hopefully this limitation is a thing of the past.
I haven't actually installed version 1.6 yet, but this upgrade would seem to be worth the effort.
What do you think?
Posted on Tue, Jul 14, 2009
Please welcome Jim Compton, today's Lionbridge blogger, who's sharing his thoughts about collective intelligence in a corporate setting...
For my inaugural contribution to this blog I thought that I would share some of the experiences we've had at Lionbridge using wiki for the purpose of collecting and capturing institutional knowledge and expertise.
First, a little internal background: Lionbridge (as I would expect is true for many companies) has historically found it challenging to:
- ensure that expertise and knowledge is documented (as opposed to locked-up in people's brains), and
- keep said documentation up-to date and accurate
Traditional approaches to documentation - asking the experts to author documents or having folks author content to be hosted through a web server (i.e. static html pages) - have always had inconvenience working to their disadvantage. Getting an expert to author a complete, holistic document on a subject was often a huge, time-consuming chore that took a back seat to actual project work (if it would ever happen at all). Adding the need to coordinate with a web-master to the mix created the situation whereby the cost of participation was just too high for many - so they opted not to.
And getting people to contribute is only part of the battle. Once you have documentation in hand, you also have the issues of needing to pair the information with those who need the information, and of making sure that the information evolves as the facts themselves evolve.
When we started experimenting with wiki several years ago, we quickly noticed a profound effect on our state of documented institutional knowledge. With the wiki-characteristics of low-cost participation (people can contribute just a single fact if they know one) and instant web-based distribution, we started to see profound levels of contribution from people who hadn't previously contributed at all.
The number of collectively-authored articles on our system started to grow rapidly, and when we hit a sort of critical mass, we started to notice an interesting phenomenon: the collected articles started to create a form of intelligence that was greater than that of the individual articles themselves.
Folks who are familiar with the phenomenon of Wikipedia will know what I'm talking about. The unique behavior of wiki-linking enables an author to create a link to information that may not even exist yet. As more authors write on various subjects, eventually the information for missing subjects will get authored. Instead of pointing to nothing, those articles which contain wiki links become more valuable by acting as referral articles. By following these links, a user can often glean more information than if they were to have actually consulted the experts themselves.
I liken this phenomenon of articles organically linking themselves together to my layman's understanding about the way synapses work in a brain to create intelligence. Intelligence isn't just about collecting facts (or experiences); it is about growing meaningful connections between them.
Anyhow, I feel that we've just scratched the surface utilizing the benefits of these technologies. We live in exciting times, don't you think?!
If you've had similar (or contradictory) experiences with wiki in your organization, I'd love to hear your story.