LLM Case Study: Lessons Learnt from translating 500 mio. datapoints to 25 languages.

At Visable, in 2024 my big LLM project and success was to translate 500 mio. datapoints to 25 languages. Here's what I learned implementing it.

What did I learn?

Be prepared for difficult feedback
Everyone uses LLM and everyone knows they are experts on how to prompt. Especially, when it comes to language and translations, internal users often just saw one example and said "That's not a good translation" or said "I can get better translations with my own prompt". Plan for time to show people why your case is more complex than meets the eye and have examples ready that illustrate this.

This leads me to the second learning:

Show: Shit in, shit out
When I presented the solution to the full org, one bold public feedback was "Ah, this translation thing is really bad. This sounds extremely unnatural." They even linked to the sentence they found and thought that it was representative. It read, in English: "We. Manufacture. Solutions." It presented them with the original, in German: "Wir. Erstellen. Lösungen." and said: "Well, it's very questionable originally, already. How should the translation make it better than this? I gave the audience 3 hints, but the responsibility was actually on me: 1. Look at a representative sample, not single items. 2. Look at the input before judging the prompt. 3. Ask yourself: How much better could a human translator really do?" (and that is not even considering price and consistency).

Break tasks down
One of the first prompts contained: "Translate this string into the following 25 languages". This was a bad idea for multiple reasons: 1. Output tokens limits were reached very quickly, but more importantly, 2. the quality of translations was consistently lower than what we got when we tested manually. Breaking it down into single tasks improved the quality a lot, so e.g. "Translate this string into DE".

This also led to better control over the quality via…

Use deterministic tooling where you can

Early prompts were ran against GPT3.5 (best price/value in Jan 24) and it often did not really recognize the input language (when you went beyond the EN/DE/ES/FR biggies) and created gibberish output. One idea was then to ask GPT3.5 first what it thought which language it got. Bad idea. Slow, inconsistent results. In general, bad results. LLMs are not good at detecting languages. And, actually, there are much better tools for this that are not driven by statistics but rules. Lingua-py e.g. was much better and much, much faster. Eventually, we dropped the "detect language" alltogether as GPT-4o-mini did not need an input language. But the lesson still holds: LLMs might not be the best tool for a part of your problem.

This is especially true to testing the quality of your outcome. Yes, you can ask the LLM to evaluate the outcome ("Does this sentence look like a good translation of "Original" into "Language?") but the problems we faced a lot were a lot simpler, and we did not have any guardrails at first. At some point, around 6% of translation tasks created empty translations. It took the reviewers some time to come across them (because of randomness), but these errors are classical areas of deterministic easy tests for every task. Raise red flags if "Is it empty? or: "Is it shorter/longer than x chars?" or "Does it (not) contain y?". If any of these flags are raised, dismiss the task and rerun it.

Questions and Answers

Why did we do it?
Translations were done by human translators and that was both time and money-wise too expensive to translate all content on two big e-Commerce platforms. We catered for 25 countries already and knew that localized content was very important for B2B buyers. When we did our yearly calculation of how expensive it would be to translate everything at once, we finally got to a "Let's do it now!"

What was my role?
I was Head of Product and acting product manager for this project. As I spent one week ("Bildungszeit") at the end of 2023 only with LLM, I was already an expert on prompting and had some first-hand private experience (what I did back then was extract company information from websites). While not planned, I also wrote quite a bit of code myself, e.g. this translation POC (in Javascript) and this Python tool to evaluate the translations.

Who did I work with? I worked with a group of backend engineers at Alibaba, a small team of data scientists who were mainly on another LLM project and could give good directional feedback. Also, I had a small number of content and language experts from the company that I could ask to review samples of what we produced.

Did this help you? 👍 👎