Discovering Data Treasures for Your LLM
It is fun to demo ChatGPT for the first time and witness the awe in people’s eyes. However, the following day at the office, a more somber realization sets in – ChatGPT does not fully grasp the intricacies of your business. Of course it would not, because it has not been trained on your proprietary data, which hopefully sits safely behind firewalls.
Crafting Your Own LLM
Language models are like new employees. They need to get accustomed to the corporate jargon, and some of them will also need to polish up their multilingual skills. To achieve this, LLMs immerse themselves in vast quantities of relevant, high-quality data. Much like us, they also need to engage in life-long learning. As technology advances and products evolve, also language continually adapts and changes. Therefore, LLM training must be established as an ongoing and consistent process.
It’s not a Bug
Sceptics often raise concerns about generative AI’s tendency to make things up. At that same time its creativity is praised. What some call “hallucination” is actually not a bug, it’s a feature! When the model’s temperature is lowered to ice cold, the generated texts become dead boring. However, with careful and precise instructions we can strike the right balance and achieve engaging and relevant results. This process has its own challenges, much like the coder joke where one programmer quips, “Omg, customers only have to precisely articulate their requirements and ChatGPT will simply code it,” to which the other responds, “Our jobs are safe!”
Look no further
So, we need terminology to teach the model the corporate lingo, lots of content to teach it style and substance, and structured knowledge to smartly prompt it and verify later it has not gone too far on the creative side. Ideally, this data is available in all the languages your customers speak. There’s a data haven within most enterprises where, with some luck, you find all these ingredients: your content localization department.
Capitalizing on Linguistic Assets
Modern localization departments maintain terminology databases in the commercially used languages. They store bilingual sentence pairs in translation memories and keep repositories of translated files. These databases represent a treasure trove for the training of LLMs. More advanced teams use multilingual knowledge systems like Coreon, which combine terminology with knowledge graphs. These graphs in turn guide LLMs in text generation and subsequent verification for correctness.
Often the data will need some cleansing and enrichment. If you are really lucky, your forward looking localization managers are already implementing a LangOps strategy. LangOps is an approach that parallels the principles of DevOps, but for language. In our particular scenario, LangOps plays a crucial role in facilitating the training and deployment of LLMs through streamlined, efficient, and secure processes.
Embrace the Future, Today
LangOps, still a burgeoning concept, is currently being explored by many businesses as they grapple with figuring out how to make use of LLMs. We all know or at least intuitively sense, that generative AI will transform all aspects of our operations, and this very soon! If you want to learn more, contact us. We would love to organize a workshop with your localization department, evaluate the available data, and develop or enhance a LangOps strategy. Additionally, we can establish a robust data pipeline to drive your AI initiatives forward. Let’s pioneer this exciting new frontier together!