Very rarely do technology releases “break the internet” in the way that popular culture can (what color was that dress again?), but the release of ChatGPT just 4 months ago has thrust itself into the collective consciousness so quickly and deeply that it already feels like “generative” and “AI” are two words that we’ve all been using in close proximity to one another since time immemorial!
The achievements of the team at OpenAI, both in the Large Language Model (LLM) based technology itself and in how effectively they were able to drive almost instantaneous global adoption, are truly remarkable by any yardstick. It does an amazing job of producing extremely fluent, mostly correct, and highly useful answers to any question you may choose to ask it on essentially any topic, with a user interface that is super simple and accessible that has precipitated the breakthrough of language model AI technology into the mainstream. Now literally everyone is trying to figure out how they can put this new set of “superpowers” to work to boost productivity and efficiency in all industries, and all walks of life, but especially when it comes to website translation tools.
That said language models are nothing particularly new in the world of Natural Language Processing. The fundamental innovation that led to ChatGPT is the "transformer" model architecture which uses a deep neural network with encoder/decoder architecture and self-attention mechanisms, per the paper published in 2017 by a gaggle of Googlers (Vaswani et al). This same innovation powered the move from statistical machine translation to neural machine translation (NMT) around the same time as that paper, with Google Translate starting its move to NMT in 2016, delivering a step change in quality of machine translation that remains the foundational technology for state of art machine translation today.
ChatGPT can also translate but it is not a bigger or better NMT and LLMs are not the future of website translation. While both NMTs and LLMs are transformer-based models trained to generate a response to a prompt, the nature of the prompts and responses they are trained to handle are very different, as is the volume of the training data used, and the cost of training, both in terms of sheer computational resources and human resources, all of which are many orders of magnitude higher for LLMs than for NMTs. LLMs are very useful multi-purpose tools, which like Swiss Army knives and duct tape are unbelievably useful things, but if you want to put a nail in the wall, you want a hammer. NMT is a hammer for translation - a task-specific tool built to do one thing really well.
Compared to LLMs, NMTs offer vastly better economics for producing the highest possible quality of translation across high, medium, and low resources languages, especially when looking to incorporate specific brand voice requirements like adherence to style guides and glossaries.
Where the LLM tech can be really useful though is with many tasks around the edges of translation and localization, including how we prepare training data to drive up the quality of output of NMTs i.e. the Swiss Army knife can be used to build a better hammer, as well as doing some other things that hammers can’t (e.g. transcreation–writing new content with the same objective as some existing content within a different cultural context and language).
Building Better Hammers
The process of building a high quality NMT engine has two main stages:
- Establish a Base "Generic" model for the language pair. Either building and training from scratch or using a pre-trained model.
- Domain-adaptation. Fine-tuning the “Generic” model with more specific training data to make it perform better within a given domain (such as adopting the vernacular of an industry segment of the brand voice of a specific company).
Domain adaptation materially improves the quality of translations produced, especially when you start to measure quality not just in terms of grammatical fluency and correspondence of meaning, but in terms of vectors like the level of adherence to industry or company-specific style guides and glossaries.
While LLMs like ChatGPT can do translation at a quality level that is comparable to Generic NMT models (e.g. Google Translate), at least for high resource languages, the more relevant compare for most organizations is to a well-trained domain adapted NMT model, which will produce much better translation quality, much faster and much cheaper. This is because the model size (number of parameters in the neural net) and volume of training data that needs to be prepped and then used, is all so much larger in an LLM that the computational cost of both ongoing training and of inference (doing a translation) are orders of magnitude higher than with NMTs. This is also why they are much slower than NMTs.
Where LLM models are helpful is in data augmentation-generating synthetic training data to add to real training data you have curated for the purpose of training NMTs. This is most useful in medium to low resource languages for which it is hard to source enough aligned sentence pairs to train an effective NMT. The LLM may have enough knowledge of the target language to be able to generate synthetic data to augment your real data such that an NMT trained with the augmented data produces higher quality results than one trained with just your real data.
Similarly, it can be used to generate synthetic data for domain-adaptation training where you cannot curate enough real data to do domain-adaptation training that is effective in improving quality. An LLM, well prompted with enough examples from your real data, can produce more data that is different but similar, which is useful.
At MotionPoint, our NMT team has been training brand-adapted NMT models for our customers for some time now and are observing very significant increases in the quality of output from our brand-adapted models in comparison with the generic models that we started from, as we apply a variety of techniques to how we source, enhance and clean training data sets.
- We have seen BLEU score improvements from Generic models scoring 30-40 range (intelligible but not great) to 70-80 range (essentially human quality) in high resource languages. While the scores for both generic and domain adapted NMT are lower in lower resource languages, the improvement is still approximately doubling the BLEU score.
- Linguist assessment was that the domain-adapted models produced significantly improved consistency in formality of tone, improved readability, and significantly decreased level of effort for human post-editing. And in fact, in many cases achieved human quality.
The advances in NMT, supercharged by all the noise around GPT, is leading to rising expectations that translation should be shifting rapidly to machine translation, with linguist involvement being the exception rather than the rule, and the cost of the whole affair should be plummeting.
But it’s not that simple.
Releasing the Hounds
Moving from Human Translation to Machine Translation with Human Post-Editing (MTPE) is a relatively easy first step on the path to leveraging machine translation to reduce costs. It generates the same high-quality output at a slightly lower cost due to the productivity acceleration that the NMT generated starting point provides the linguist. It's a low-risk way to get started.
Much more challenging is harnessing the potential of using Machine Translation (MT) as-is without the post-edit. This is where the much higher potential cost savings are seemingly within reach as well-trained domain-adapted NMTs are demonstrably capable of producing translations that are perfectly good enough to use as-is without human post-edit, in many cases and for many contexts.
But this potential remains largely untapped in large organizations because NMT translation quality is fundamentally variable. Even if it produces high quality translations eight times out of ten, the other two matter and require a smart approach to managing the cost-risk dynamic.
One way to attempt to get some benefit from MT only workflow while containing risk is to send the most quality-sensitive content (e.g. high traffic pages in websites, legal statements etc.) for MTPE workflow, and the less quality-sensitive content for MT only workflow, and then rely on a fast fail-forward approach to correcting egregious errors in the machine translated content post-publishing, as and when they get discovered.
There are several problems with this though. The most obvious being that most large organizations, especially in verticals like finance and healthcare, simply cannot tolerate even short periods of bad translations on their site for commercial, legal or compliance reasons. And in non-website content, making corrections after the content has left the stable is much harder than with website content. Further, even if you can tolerate mistakes getting out into the wild, you still have two problems with the split workflow approach.
- For the quality-sensitive translations being put through MTPE workflow, some of the machine translations will be perfectly good enough to use as-is without post-editing and in those cases, you are paying for a human post-edit you don't need. This is fundamentally wasteful.
- For the rest going through MT only workflow, you carry some risk of unacceptably low-quality translations being published which is potentially reputationally harmful, even if relatively few people see it. In today's world, one person sees it, memes it on twitter, and your brand is a laughingstock.
There has to be a better way to eliminate waste, manage the risk, and fully release the potential of machine translation to materially reduce costs.
The better approach to this is going to be to put all translations through a single dynamic workflow designed to deliver a translation of at least the required quality at the lowest cost possible. This would consist of:
- Determining the minimum quality required based on the content itself and/or the context it is presented and accessed in.
- Doing the best possible machine translation.
- Doing a rapid and cheap quality assessment of the machine translation.
- If, and only if, the quality assessment is below the required quality sending it for human post-edit.
This approach maximizes the cost advantage of harnessing good machine translation to minimize human post-editing costs, and fully mitigates the risk of bad machine translation, for all content types in all contexts.
The first problem to solve here is the codification of quality for the purposes of steps 1, 3 and 4. Several industry standards around this exist (MQM, DQF, etc.) but to some extent the scoring needs to be adaptable to the specific sensitivities of the organization or industry vertical and the market/language.
Determining what level of quality requirement to apply to each translation task can be relatively simple as in the split workflow example previously given. That was deciding between two possible classifications: do MTPE vs do MT based on whether the content is in a high traffic part of the site or not. But in the dynamic workflow world, any piece of content in a given context could be tagged with any minimum quality requirement on a sliding scale like 0-100, and far more nuanced approaches could then be taken that consider many other variables such as user behavioral analytics or other signals that provide context for how business critical the quality of the translation really is. This could get highly tuned to optimize for translation cost vs business outcome, as opposed to just translation cost vs translation quality.
The quality evaluation and routing decisions could in principle be done by linguists but to do this cost-effectively on a large volume of translations is another good use case for applying AI classification technology both to the problem of evaluating quality, and to assessing whether it is “good enough”.
There are clearly several technical challenges here and this is not a solved problem in the industry today. But it is a solvable problem, which is why at MotionPoint we have invested heavily in our R&D to develop Adaptive Translation™, including a superior NMT working in concert with AI capabilities for Translation Quality Evaluation and Dynamic Workflow Routing, with the clear goal of enabling our customers to fully realize the cost-saving potential of AI, stretch their localization budgets to supporting additional markets, and deliver more business value than ever before.
Learn more about the future of website translation in light of AI from our recent webinar. Download it for free here.Last updated on April 14, 2023