Alignment methods

Cecilia Hemming

Department of Languages, University College of Skövde

Swedish National Graduate School of Language Technology

 

 

 

 

1.  Introduction

 

Parallel inscriptions of the same text in two or more languages, are known since the earliest writing systems. This has of course been a very important information source when trying to decode ancient and since long disappeared languages and writing systems. Nowadays, with the increasing importance of mulltilinguality, due to business globalization and worldwide information exchange, corpora of parallel texts become a useful and important resource. The problem of finding relations between two texts that are mutual translations is called text alignment. Text alignment is important when making use of multilingual text corpora in several tasks:

 

machine translation and computer assisted translation

word sense disambiguation

terminology

lexicography

multilingual information retrieval

language learning.

 

 

In chapter 2, two principal approaches to alignments methods are introduced by a presentation of an early study of each. In the next chapters it is shown how researchers have combined and developed these methods further, and finally some important approaches are presented.

 

 

2. Early principal approaches to sentence alignment

 

The first sentence alignment methods were developed in the eighties. One can roughly divide them into two different types of studies: alignments based on  sentence length on the one hand and lexical characteristics (cognates or anchor words) on the other. In sentence alignment, one tries to find a group of sentences in one language that correspond in content to a group of sentences in another language. This grouping is often referred to as a bead of sentences (Brown et al. 1991).

 

2.1 Sentence length

This group of studies is based on statistical modelling of translations that only consider the length of sentences and paragraphs. These length-based methods rely on the assumption that corresponding text segments tend to have a relationship regarding either the number of words (Brown et al. 1991) or the number of characters. Gale and Church explained their choice of length measure by: 

 

Apparently, the performance is better with characters because there is less variability in the differences of sentence lengths so measured.

(Gale and Church, 1993:16)

 

They found a very strong correlation between the number of characters in a paragraph and its correspondent translation, and used it as a strong clue for sentence alignment.

 

Figure1.       The horizontal axis shows the length of English paragraphs and the vertical axis shows the corresponding German translation. The correlation is quite high: 0.991.

(Gale and Church, 1993:6)

 

Thus, short sentences in the source language (SL) tend to be translated as short sentences in the target language (TL) and long sentences as long sentences. It is also shown that the character length ratio is approximately the same in different kinds of texts (Véronis, 2000). Best results are achieved with alignments ­­of related languages with a high proportion of literal translation. In fact most alignment methods, even modern ones, rely on some further common assumptions, of which the most important are that:

 

- the order of the phrases in the parallel texts, are identical or very similar

- the alignments are mostly 1:1-relations (one SL-sentence -> one TL-sentence)

- the few n:m-relations are limited to small values of m and n (typically 2)

- the translation has few omissions and insertions

                                                                                                             (Véronis, 2000:11)

 

The goal of the methods is to find alignment A with the highest probability given two parallel texts S and T: arg maxA P(A|S, T)=argmaxA P(A, S, T). To estimate the probabilities, the aligned text is transformed to a sequence of aligned beads. Each bead is assumed to be independent of all others, depending only on the sentences within it: P(A, S, T) » Pk=1K P(Bk).

 

To allow the information to be combined in a consistent way, the distance measure in Gale and Church’s algorithm is based on a probabilistic model. Each proposed pair of sentences is assigned a probabilistic score based on the ratio of the number of characters in the two sentences, and the variance of this ratio. The measure is a matching estimate of -log Prob(match|d), where d is given by comparing the difference of the sentence-lengths to the mean, μ, and the variance, v2, of the whole corpus text: d = (t - s μ) / Ö s v2. The prior probability of an alignment type match, Prob(match), is determined from the maximum likelihood score calculated from a hand marked subcorpus. The cost for each alignment is determined from the formula: cost(s, t ) = - log P (a align | d(s, t,μ, v2) where a align is the corresponding alignment type. Gale and Church use the following alignment types:

 

1:1 (substitution  Si->Ti)    1:0 (deletion Si->Ř)

0:1 (insertion  Ř->Ti)        2:1 (contraction Si+Sj->Ti)

1:2 (expansion Si->Ti+Tj)   2:2 (merging Si+Sj->Ti+Tj)       i = 1-I, j = 1-J

 

A dynamic programming algorithm compares the final scores of all possible alignments and chooses the most likely among them (= the alignment with the minimum cost). Brown et al., see above, use instead hidden Markov models to formulate the problem and their approach was another: they did not want to align entire articles, but just a subcorpus, big enough for further research.

 

Gale and Church were rather surprised to find that their simple approach led to such good results. They applied their method on a trilingual corpus of 15 economic reports issued by the Union Bank of Switzerland (UBS) in English, French and German, and 96% of the 725 sentences were correctly aligned. They found though, that there were more errors in the English-French UBS-subcorpus than in the English-German one. When applying the method to the Canadian Hansard Corpus they produced an even higher success rate, this because the corpus contains many 1:1 alignments and literal translations.

 

2.2 Lexical characteristics

1988 Kay and Röscheisen proposed a method based on the hypothesis that if two phrases in different languages correspond to each other, then the words within these phrases also will correspond to each other (Véronis, 2000). Their system starts with a couple of initial alignment candidate phrases. An easy choice is to use the first and the last phrase in the texts.  They have good chances to correspond to each other. Then an iterative process is run:

 

  1. Form an envelope of alignment candidates from the cross product of the sentence lists in the SL and the TL.

 

  1. Choose word pairs that tend to co-occur in these potential partial alignments.

 

  1. Find those pairs of SL- and TL-sentences that contain many possible lexical correspondences. The most reliable of these pairs are used to induce a set of partial alignments that will be part of the final result. Repeat the steps above.

(Manning and Schütze, 1999:479)

 

So, partial word alignments are used to induce a maximum likelihood sentence alignment, which is used again to refine the word level alignment, and so forth in an iterative process. Typically, around 5 iterations are needed to obtain fair resluts.

 

Though the method can be computationally intensive (depending on the chosen envelope size) it is very robust, non possible alignments will never match. It also shows good results: in Kay and Röscheisen’s initial study the method was used on Scientific American articles and achieved 96% coverage, the remaining segments were of the alignment type 1:0 or 0:1. When used on 1000 sentences from the Hansard Corpus, the algorithm produced 7 errors after 5 iterations. 5 of these errors were due to the fact that the program had missed some sentence boundaries. (Manning & Schütze, 1999:478-480).

 

 

3. Method developments and combinations

 

Debili and Sammouda (1992), extended Kay and Röscheisen’s method by using a dictionary to guide the initial word correspondence hypothesis. This is an efficient way to reduce the algorithm’s search space, the envelope.

 

Stanley Chen applies the basic ideas in Gale and Church’s model above but uses instead a simple translation model to estimate the cost of each alignment (for simplicity and efficiency reasons, according to Chen). The model ignores issues of word order and the possibility that a SL word can correspond to more than one TL word. The word beads used are restricted to the alignment types 1:0, 0:1 and 1:1. To find the best word beading, the algorithm makes a greedy search instead of deriving it from the sum of possible word beadings in the sentences. Nevertheless, the best total alignment is found by using dynamic programming. (Chen, 1993).

 

Several other researchers have presented methods based on Gale and Church’s program presented above. Simard, Foster and Isabelle 1992; Church 1993; Johansson, Ebeling and Hofland 1993 and McEnery and Oakes 1995, have all included anchor methods to find cognates: identical or graphically similar occurrences in the SL and TL texts. They use date expressions, various symbols or graphically similar words like the English-French pair language-langue. Formally, cognates are words with etymological ties, often reflected both in meaning and orthography. Simard and his colleagues considered as cognates, words that had at least four initial letters in common. This excluded though pairs like government-gouvernement. (Véronis, 2000).  McEnery and Oakes proposed instead to calculate the similarity of the two words using the Dice’s coefficient, which corresponds to: the number of characters in common in the SL and TL words multiplied with 2 and divided by the total sum of the two words characters:

 

2 x nbr of characters in s ∩ t / nbr of characters in s + t

 

A draw back of most of the length-based methods was that they depended on their ability to identify paragraph and sentence boundaries. This creates problems when dealing with text divisions as headers, footnotes, tables, etc. or noisy output of optical character recognition.

 

3.1 Char-Align

Proper segmentation is not a trivial problem and incorrect segmentation easily lead to wrong alignments. Kenneth Church therefore developed the Char-align program, inspired by the work of Simard et al. and working not with sentences or words but at the character level (Simard and Plamondon, 1996). It consists of an approximate mapping between offsets (regions) in the two texts. The SL and TL are concatenated and a two-dimensional xy-array is made. For each SL 4-gram at position x having an identical TL correspondence at position y, a dot is placed in the array, see figure 2.

 

 

Figure 2.    A SAMPLE DOT PLOT XY-ARRAY.

 

For each 4-gram correspondence between position x and position y, the coordinate (x,y) is marked with a dot. The darker shades at the upper left and lower right corner show that the SL-text has more random correspondences with itself than with the TL-text. The two diagonals in the dark shades are very black because there is a perfect match between each text with itself. The light regions represent the matches between the two languages and the diagonals here show the correspondences between the SL and the TL.

 

A heuristic search finds the best path along the diagonal and thus provides an alignment in terms of offsets in the texts.

                                                                 (Church, 1993:5)

 

In addition to this, another set of time saving heuristics is used: as some matches are more interesting than others, the dots are weighted. Content words are much more important and thus higher ranked than function words. This is especially important for large input files. Church also obtained good results using this cognate method program with non-Roman writing systems such as an English-Japanese corpus, which in the first view can be quite confusing. Cognates presuppose related languages even at the character level. But, the success was due to the fact that the texts treated regarded a technical manual with many identical technical terms in both languages, e.g. BEGIN, END, getline, print, printf.

 

Applying Gale and Church’s original sentence-length based method on Chinese, Dekai Wu found that the statistical correlation of sentence lengths varied much more in English-Chinese corpora than in corpora with other Indo-European languages. He argues that the length-based methods not are that language independent as previously stated: “if translated sentences share cognates, then the character lengths of those cognates are of course correlated”. Wu proposed to extend the algorithm using information about domain specific lexical indexes (Wu, 1993).

 

Haruno and Yamazaki (1996) argue that for structurally very different languages, function words impede alignment. That is why they eliminate all function words in the corpus using a POS Tagger. They also suggest the use of matching word pairs from a dictionary to avoid sparse data problems in aligning short texts with the Kay & Röscheisen-method. (Manning and Schütze, 1999:483)                                       

 

3.2 The Kvec and DKvec methods

Pascal Fung (Fung and Church, 1994) present an alternative alignment strategy called Kvec. The bitext is split into K segments and binary vectors are created for the SL and TL words. It starts by estimating lexicon candidates. If a word in question (with its translation) is found in a segment, a flag is set in the vector. Then the similarities of the words are calculated with the mutual information method:

 

I(s,t) = prob(s ,t ) / prob(s) prob(t)

 

prob(s ,t ) is the probability that the SL word s and the TL word t occur in corresponding text segments. prob(s) prob(t) are the respective probabilities that s and t occur in the whole text. These values are then used to rank the translation candidates, from which the most likely pair will be chosen. The program discovers, for example, that the English word fisheries is similar to the French word Pęches, by noting that the distribution of fisheries in the English text is similar to the distribution of pęches in the French text. Kvec does not depend on sentence boundaries. (Fung and Church, 1994:1) 

 

All the chosen pairs are then used as reference points when aligning the parallel texts. The Kvec method was later developed into the DKvec method, to handle parallel text documents with missing or incomplete segments, so called noisy parallel corpora. In the algorithm the distances between the occurrences of the SL and the TL words are stored in recency vectors. As the word frequencies differ, the size of the recency vectors also differs. First these recency vectors are computed for each s and t. To locate the translation candidates, all candidate pairs that occur for the first time in the second half of the text, are removed. Then, all pairs with a vector size less than half of the other vector lengths also are removed. The absolute difference of the remaining vectors is computed by Dynamic Time Warping, and so closely correlated pairs can be identified. As with the Kvec method, these pairs are then used as reference points to align the parallel text. (Ahrenberg et al. 1998).

 

3.3 The Smooth Injective Map Recognizer - SIMR

Dan Melamed has developed a method for mapping bilingual text correspondences, which like Char-align and Simard’s Salign (Simard and Pamondon, 1996) uses cognates to align bilingual corpora at the character level (Melamed, 1997). The bitext corpus used is seen as a two-dimensional space with the SL and the TL character positions as the x and y axes. In this space the algorithm searches for “true points of correspondence” (TPC’s), identified by machine-readable dictionaries  or cognate based matching algorithms. The most linear and constant path between the original and the terminal bitext space is chosen by filtering path candidates in several steps. The obtained alignment path is then used to align the sentences in the texts (Ahrenberg et al. 1998).

 

 

4. Summary

 

Text alignment is the first and important step when using multilingual text corpora in several tasks: e.g. machine translation, computer assisted translation (e.g. Brown et al., 1990), terminology (Tiedemann, 2000) or (Dagan and Church, 1994), lexicography (Melamed, 1996), word sense disambiguation (Ide and Véronis, 1998) or multilingual information retrieval (Nie et al. 1998), just to mention some specific work.

 

Many research teams are working with different solutions.  Still, sentence alignment is generally processed in the two following steps: first the score for all pairs of candidate sentences are calculated, and then the minimal-cost alignments are selected. One can distinguish  three main methods to calculate these scores. The first uses sentence lengths calculated by the number of words or characters, which is due to the assumption that corresponding sentences in SL and TL are of similar length. Although it has been suggested that length-based methods are language independent, deeper studies have shown that the correlation more is due to historical relationships of the languages. In Chinese for instance, as opposed to Indo-European languages, the most words are formed by only one or two characters.

 

The second approach in alignment methodology relies on lexical characteristics to perform an initial partial word alignment, used to obtain the final sentence alignment. These methods are very robust, non possible alignments will never match, they can be computationaly very intensive though. Finally, there are mixed methods using proportions of lexical and formal characteristics trying to both gain the advantages and leave out the drawbacks of the two approaches. Most modern systems can be said to join this group.

 

 

References

 

 

Ahrenberg, L., Merkel, M., Ridings, D., Sĺgvall Hein, A., Tiedemann, J. . 1998. Automatic Processing of Parallel Corpora: A Swedish Perspective. Department of Computer and Information Science, Linköpings University, Department of Swedish Language, Göteborg University, Department of Liguistics, Uppsala University.

 

Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J.D., Mercer, R. L. and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics, 16(2): 79-85, June.

 

Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. and Mercer, R.L. 1991. Word -sense disambiguation using statistical methods. In ACL 29, pp. 264-270.

 

Chen, S. F. 1993. Aligning sentences in bilingual corpora using lexical information. In Association for Computational Linguistics, ACL 31, pp. 9-16.        

 

Church, K. W. 1993. Char-align: A Program for Aligning Parallel Texts at the Character Level. In Proceedings of the Workshop on Very Large Corpora: Acedemic and Industrial Perspectives. Association for Computational Linguistics, ACL.

 

Dagan, I. and Church K. W. 1994. TERMIGHT: Identifying and Translating Technical Terminology. In proceedings of the 4th Conference on Applied Natural Language Processing (ANLP).

 

Debili, F. and Sammouda, E. 1992. Appariement des phrases de textes bilingues français-anglais et français-arabe. In Proceedings of COLING-92, Nantes.

 

Fung, P. & Church, K. W. 1994. Kvec: A new approach for aligning parallel texts. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94), 1096--1102, Kyoto.

 

Gale, W. A., Church K. W. 1993. A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(3): 75-102.

Ide, N., Véronis, J. 1998. Word Sense Disambiguation:The State of the Art. Laboratoire Parole et Langage,  Université de Provence, Aix-en-Provence

 

Kay, M. and Röscheisen, M. 1988. Text-Translation Alignment, Technical report. Xerox Palo Alto Research Center (Martin Kay and Martin Röscheisen (1988). Text-Translation Alignment. Technical Report, Xerox Palo Alto Research Center.

Published in Computational Linguistics, MIT Press, March 1993.)

 

Klavans, J. and Tzoukermann, E. 1995. Combining Corpus and Machine-readable Dictionary Data for Building Bilangual Lexicons. Machine Translation, 10(3).

 

Maklovitch, E. and Hannan, M.-L. 1996. Line ‘Em Up: Advances In Alignment Technology And Their Impact On Translation Support Tools. In Proccedings of the Second Conference of the Association for Machine Translation in the Americas (AMTA-96). Montréal, Québec.

 

Manning, C., Schütze, H. 1999. Foundations of statistical Natural Language Processing. Cambridge, MA: MIT Press. (http://nlp.stanford.edu/links/statnlp.html Last modified: Sun Dec 30 10:32:52 PST 2001)

 

Melamed, I. D. 1997. A Portable Algorithm for Mapping Bitext Correspondence. 35th Conference of the Association for Computational Linguistics (ACL'97), Madrid, Spain.

 

Nie, J., Isabelle, P., Plamondon, P., Foster, G. 1998. Using a Probabilistic Translation Model for Cross-language Information Retrieval. In proceedings of the 6th ACL Workshop on Very Large Corpora (WVLC). Montréal, Canada.

 

Simard, M. 1999. Text-Translation Alignment: Three Languages Are Better Than Two. RALI, http://www.iro.umontreal.ca/~simardm/emnlp99/final.html

 

Simard, M., Plamondon, P. 1996. Bilingual Sentence Alignment: Balancing Robustness and Accuracy. RALI, Université de Montréal.

 

Tiedemann, J. 2000 Extracting Phrasal Terms using Bitext. In Proceedings of the Workshop on Terminology Resources and Computation, held in conjunction with LREC 2000, Athens/ Greece, May 2000.

 

Véronis, J. 2000. Chapitre 6 : Alignement de corpus multilingue. In Ingénierie des langues, Traité IC2-Série Informatique et SI par Jean-Marie Pierrel. Čditions Hermes Science, Paris.

 

Wu, D. 1993. Aligning Parallell English-Chinese Texts Statistically with Lexical Criteria. Technical report, HKUST-CS93-9. The Hong Kong University of Science & Technology.