Anfangen
2 min readJul 10, 2021

--

Let us start (Lass uns anfangen)

Table of contents

  • What is truecaser?
  • Method
  • Conclusion

What is truecaser?

Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages). (https://en.wikipedia.org/wiki/Truecasing)

Method

Huggingface truecaser model from https://huggingface.co/flozi00/byt5-german-grammar

First of all install the dependencies

# %%capture
# !pip install sacremoses

Unfortunately, this model is not working as a truecaser

import re
import phonetics # Use the metaphone algorithm to create the phonetic key of the source string.
import random

chars_to_ignore_regex = "[^A-Za-z0-9\ö\ä\ü\Ö\Ä\Ü\ß\-,;.:?! ]+"
broken_chars_to_ignore_regex = "[^A-Za-z0-9\ö\ä\ü\Ö\Ä\Ü\ß\- ]+"

def do_manipulation(string):
text = re.sub(chars_to_ignore_regex, '', string)
broken_text = re.sub(broken_chars_to_ignore_regex, "", text.lower())

if(random.randint(0,100) >= 50):
for xyz in range(int(len(broken_text.split(" "))/4)):
if(random.randint(0,100) > 30):
randc = random.choice(broken_text.split(" "))
if(random.randint(0,10) > 4):
broken_text = broken_text.replace(randc, ''.join(random.choice('abcdefghijklmnopqrstuvxyz') for _ in range(len(randc))).lower())
else:
broken_text = broken_text.replace(randc, phonetics.metaphone(randc).lower())

return text, broken_text
string = 'ich liebe das leben'
do_manipulation(string)
('ich liebe das leben', 'ich liebe das leben')

MosesTruecaser (https://github.com/alvations/sacremoses)

  • Simple to train and fast
  • Import de text
  • Wiki_Versicherungswesen.xml(specific insurance domain contexts)
  • Wikidump de

Import wiki_versicherungswesen.xml

By using the BeautifulSoup, texts can be extracted from the messy wiki-xml-file. Afterwards, texts can be simply cleaned by regular expression to filter the letters that are German and remove paratheses wthin the contexts.

from bs4 import BeautifulSoup as bs
with open('wiki_versicherungswesen.xml', 'r', encoding="utf8") as file:
bs_content =bs(file)
text_list = []
for i in bs_content.find_all(['text']):
text_list.append(i.get_text(' ', strip=True))
from collections import Counter
import pandas as pd
import re

# cleaning processes
contexts = "".join(text_list)
contexts = contexts.replace('|',' ')
contexts = re.sub(r"[^A-Za-zäöüÄÖÜß ]+", ' ', contexts)
contexts = re.sub(r'\b\w{1,2}\b', '', contexts)
# have a look
contexts[:100]
' Lückenhaft Darstellung bezieht sich nur auf Erstversicherer nicht auf Rückversicherer qualifizie'# save the file as txt
with open('wiki_versicherung.txt', 'w') as f:
f.write(contexts)

Train MosesTruecaser

%%capture
from sacremoses import MosesTruecaser, MosesTokenizer

# Train a new truecaser from a 'wiki_versicherung.txt' file.
mtr = MosesTruecaser()
mtok = MosesTokenizer(lang='de')

# Save the truecase model to 'big.truecasemodel' using `save_to`
tokenized_docs = [mtok.tokenize(line) for line in open('wiki_versicherung.txt')]
mtr.train(tokenized_docs, save_to='wiki_versicherung.truecasemodel')

Results

The truecaser is working even the results are not perfect. Some further improvement can be processed. For instance, the test is the first word in the sentences. Therefore, it should be capital. (not a big deal and it can be solved easily) Nevertheless, overall looks still very promising.

text_list = mtr.truecase("Ich habe eine frage")
" ".join(text_list)
'ich habe eine Frage'

--

--