Cecilia Hemming
Department of Languages,
University College of Skövde
Swedish National Graduate
School of Language Technology
Parallel inscriptions of the same text in two or more languages, are
known since the earliest writing systems. This has of course been a very
important information source when trying to decode ancient and since long
disappeared languages and writing systems. Nowadays, with the increasing
importance of mulltilinguality, due to business globalization and worldwide
information exchange, corpora of parallel texts become a useful and important
resource. The problem of finding relations between two texts that are mutual
translations is called text alignment. Text alignment is important when making
use of multilingual text corpora in several tasks:
machine translation and
computer assisted translation
word sense disambiguation
terminology
lexicography
multilingual information
retrieval
language learning.
In chapter 2, two
principal approaches to alignments methods are introduced by a presentation of
an early study of each. In the next chapters it is shown how researchers have
combined and developed these methods further, and finally some important
approaches are presented.
The first sentence alignment methods were developed in the eighties. One
can roughly divide them into two different types of studies: alignments based
on sentence length on the one
hand and lexical characteristics (cognates or anchor words) on
the other. In sentence alignment, one tries to find a group of sentences in one
language that correspond in content to a group of sentences in another
language. This grouping is often referred to as a bead of sentences
(Brown et al. 1991).
This group of
studies is based on statistical modelling of translations that only consider
the length of sentences and paragraphs. These length-based methods rely on the
assumption that corresponding text segments tend to have a relationship regarding
either the number of words (Brown et al. 1991) or the number of characters. Gale
and Church explained their choice of length measure by:
Apparently,
the performance is better with characters because there is less variability in
the differences of sentence lengths so measured.
(Gale and Church, 1993:16)
They found a very
strong correlation between the number of characters in a paragraph and its
correspondent translation, and used it as a strong clue for sentence alignment.

Figure1. The horizontal axis shows the
length of English paragraphs and the vertical axis shows the corresponding German translation.
The correlation is quite high: 0.991.
(Gale and Church,
1993:6)
Thus, short sentences in the source language (SL) tend to be translated as
short sentences in the target language (TL) and long sentences as long
sentences. It is also shown that the character length ratio is approximately
the same in different kinds of texts (Véronis, 2000). Best results are achieved
with alignments of related languages with a high proportion of literal
translation. In fact most alignment methods, even modern ones, rely on some further
common assumptions, of which the most important are that:
- the order of the
phrases in the parallel texts, are identical or very similar
- the alignments
are mostly 1:1-relations (one SL-sentence -> one TL-sentence)
- the few
n:m-relations are limited to small values of m and n (typically 2)
- the translation
has few omissions and insertions
(Véronis, 2000:11)
The goal of the methods is to find alignment A with the highest probability
given two parallel texts S and T: arg maxA P(A|S, T)=argmaxA P(A, S, T).
To estimate the probabilities, the aligned text is transformed to a
sequence of aligned beads. Each bead is assumed to be independent of all others,
depending only on the sentences within it: P(A, S, T) » Pk=1K P(Bk).
To allow the
information to be combined in a consistent way, the distance measure in Gale
and Church’s algorithm is based on a probabilistic model. Each proposed pair
of sentences is assigned a probabilistic score based on the ratio of the
number of characters in the two sentences, and the variance of this ratio. The measure is a matching
estimate of -log Prob(match|d), where d is given by comparing
the difference of the sentence-lengths to the mean, μ, and
the variance, v2,
of the whole corpus text: d =
(t - s
μ) / Ö
s v2. The prior probability of an alignment type match, Prob(match), is
determined from the maximum likelihood score calculated from a hand marked subcorpus.
The cost for each alignment is determined from the formula: cost(s, t ) = - log P (a align | d(s, t,μ, v2) where a align is the
corresponding alignment type. Gale and Church use the following alignment
types:
1:1 (substitution Si->Ti) 1:0 (deletion
Si->Ř)
0:1 (insertion Ř->Ti) 2:1 (contraction Si+Sj->Ti)
1:2 (expansion Si->Ti+Tj) 2:2 (merging Si+Sj->Ti+Tj) i = 1-I, j = 1-J
A dynamic
programming algorithm compares the final scores of all possible alignments and
chooses the most likely among them (= the alignment with the minimum cost). Brown
et al., see above, use instead hidden Markov models to formulate the problem and
their approach was another: they did not want to align entire articles, but
just a subcorpus, big enough for further research.
Gale and Church
were rather surprised to find that their simple approach led to such good
results. They applied their method on a trilingual corpus of 15 economic
reports issued by the Union Bank of Switzerland (UBS) in English, French and
German, and 96% of the 725 sentences were correctly aligned. They found though,
that there were more errors in the English-French UBS-subcorpus than in the
English-German one. When applying the method to the Canadian Hansard Corpus
they produced an even higher success rate, this because the corpus contains
many 1:1 alignments and literal translations.
1988 Kay and
Röscheisen proposed a method based on the hypothesis that if two phrases in
different languages correspond to each other, then the words within these
phrases also will correspond to each other (Véronis, 2000). Their system starts
with a couple of initial alignment candidate phrases. An easy choice is to use
the first and the last phrase in the texts.
They have good chances to correspond to each other. Then an iterative
process is run:
(Manning and Schütze, 1999:479)
So, partial word
alignments are used to induce a maximum likelihood sentence alignment, which is
used again to refine the word level alignment, and so forth in an iterative
process. Typically, around 5 iterations are needed to obtain fair resluts.
Though the method
can be computationally intensive (depending on the chosen envelope size) it is
very robust, non possible alignments will never match. It also shows good
results: in Kay and Röscheisen’s initial study the method was used on
Scientific American articles and achieved 96% coverage, the remaining segments
were of the alignment type 1:0 or 0:1. When used on 1000 sentences from the
Hansard Corpus, the algorithm produced 7 errors after 5 iterations. 5 of these
errors were due to the fact that the program had missed some sentence
boundaries. (Manning & Schütze, 1999:478-480).
Debili and Sammouda
(1992), extended Kay and Röscheisen’s method by using a dictionary to guide the
initial word correspondence hypothesis. This is an efficient way to reduce the
algorithm’s search space, the envelope.
Stanley Chen applies
the basic ideas in Gale and Church’s model above but uses instead a simple
translation model to estimate the cost of each alignment (for simplicity and
efficiency reasons, according to Chen). The model ignores issues of word order
and the possibility that a SL word can correspond to more than one TL word. The
word beads used are restricted to the alignment types 1:0, 0:1 and 1:1. To find
the best word beading, the algorithm makes a greedy search instead of deriving
it from the sum of possible word beadings in the sentences. Nevertheless, the
best total alignment is found by using dynamic programming. (Chen, 1993).
Several other researchers
have presented methods based on Gale and Church’s program presented above. Simard,
Foster and Isabelle 1992; Church 1993; Johansson, Ebeling and Hofland 1993 and
McEnery and Oakes 1995, have all included anchor methods to find cognates:
identical or graphically similar occurrences in the SL and TL texts. They use
date expressions, various symbols or graphically similar words like the English-French
pair language-langue. Formally, cognates are words with etymological
ties, often reflected both in meaning and orthography. Simard and his
colleagues considered as cognates, words that had at least four initial letters
in common. This excluded though pairs like government-gouvernement. (Véronis,
2000). McEnery and Oakes proposed
instead to calculate the similarity of the two words using the Dice’s
coefficient, which corresponds to: the number of characters in common in the SL
and TL words multiplied with 2 and divided by the total sum of the two words
characters:
2 x nbr of characters in s ∩ t /
nbr of characters in s + t
A draw back of most
of the length-based methods was that they depended on their ability to identify
paragraph and sentence boundaries. This creates problems when dealing with text
divisions as headers, footnotes, tables, etc. or noisy output of optical
character recognition.
Proper segmentation
is not a trivial problem and incorrect segmentation easily lead to wrong
alignments. Kenneth Church therefore developed the Char-align program, inspired
by the work of Simard et al. and working not with sentences or words but at the
character level (Simard and Plamondon, 1996). It consists of an approximate
mapping between offsets (regions) in the two texts. The SL and TL are
concatenated and a two-dimensional xy-array is made. For each SL 4-gram at
position x having an identical TL correspondence at position y, a dot is placed
in the array, see figure 2.
Figure 2.
A SAMPLE DOT PLOT XY-ARRAY.
For each 4-gram
correspondence between position x and position y, the coordinate (x,y) is
marked with a dot. The darker shades at the upper left and lower right corner
show that the SL-text has more random correspondences with itself than with the
TL-text. The two diagonals in the dark shades are very black because there is a
perfect match between each text with itself. The light regions represent the
matches between the two languages and the diagonals here show the
correspondences between the SL and the TL.
A heuristic search
finds the best path along the diagonal and thus provides an alignment in terms
of offsets in the texts.
(Church, 1993:5)
In addition to
this, another set of time saving heuristics is used: as some matches are more
interesting than others, the dots are weighted. Content words are much more
important and thus higher ranked than function words. This is especially
important for large input files. Church also obtained good results using this
cognate method program with non-Roman writing systems such as an
English-Japanese corpus, which in the first view can be quite confusing.
Cognates presuppose related languages even at the character level. But, the
success was due to the fact that the texts treated regarded a technical manual
with many identical technical terms in both languages, e.g. BEGIN, END,
getline, print, printf.
Applying Gale and
Church’s original sentence-length based method on Chinese, Dekai Wu found that
the statistical correlation of sentence lengths varied much more in
English-Chinese corpora than in corpora with other Indo-European languages. He argues
that the length-based methods not are that language independent as previously
stated: “if translated sentences share cognates, then the character lengths
of those cognates are of course correlated”. Wu proposed to extend the
algorithm using information about domain specific lexical indexes (Wu, 1993).
Haruno and Yamazaki
(1996) argue that for structurally very different languages, function words
impede alignment. That is why they eliminate all function words in the corpus using
a POS Tagger. They also suggest the use of matching word pairs from a
dictionary to avoid sparse data problems in aligning short texts with the Kay
& Röscheisen-method. (Manning and Schütze, 1999:483)
Pascal Fung (Fung
and Church, 1994) present an alternative alignment strategy called Kvec. The
bitext is split into K segments and binary vectors are created for the
SL and TL words. It starts by estimating lexicon candidates. If a word in
question (with its translation) is found in a segment, a flag is set in the
vector. Then the similarities of the words are calculated with the mutual
information method:
I(s,t) = prob(s ,t ) / prob(s) prob(t)
prob(s ,t ) is the probability that the SL word s and the TL word
t occur in corresponding text segments. prob(s) prob(t) are the
respective probabilities that s and t occur in the whole text. These values are
then used to rank the translation candidates, from which the most likely pair
will be chosen. The program discovers, for example, that the English word fisheries is similar to the French word Pęches, by noting that the distribution
of fisheries in the English text is
similar to the distribution of pęches
in the French text. Kvec does not depend on sentence boundaries. (Fung and
Church, 1994:1)
All the chosen pairs are then used as reference points when aligning the
parallel texts. The Kvec method was later developed into the DKvec method, to
handle parallel text documents with missing or incomplete segments, so called
noisy parallel corpora. In the algorithm the distances between the occurrences
of the SL and the TL words are stored in recency vectors. As the word
frequencies differ, the size of the recency vectors also differs. First these
recency vectors are computed for each s and t. To locate the translation
candidates, all candidate pairs that occur for the first time in the second
half of the text, are removed. Then, all pairs with a vector size less than
half of the other vector lengths also are removed. The absolute difference of
the remaining vectors is computed by Dynamic Time Warping, and so closely
correlated pairs can be identified. As with the Kvec method, these pairs are
then used as reference points to align the parallel text. (Ahrenberg et al.
1998).
Dan Melamed has developed a method for mapping bilingual text
correspondences, which like Char-align and Simard’s Salign (Simard and
Pamondon, 1996) uses cognates to align bilingual corpora at the character level
(Melamed, 1997). The bitext corpus used is seen as a two-dimensional space with
the SL and the TL character positions as the x and y axes. In this space the
algorithm searches for “true points of correspondence” (TPC’s), identified by
machine-readable dictionaries or cognate
based matching algorithms. The most linear and constant path between the
original and the terminal bitext space is chosen by filtering path candidates
in several steps. The obtained alignment path is then used to align the
sentences in the texts (Ahrenberg et al. 1998).
Text alignment is the first and important step when using multilingual
text corpora in several tasks: e.g. machine translation, computer assisted
translation (e.g. Brown et al., 1990), terminology (Tiedemann, 2000) or (Dagan
and Church, 1994), lexicography (Melamed, 1996), word sense disambiguation (Ide
and Véronis, 1998) or multilingual information retrieval (Nie et al. 1998),
just to mention some specific work.
Many research teams are working with different solutions. Still, sentence alignment is generally
processed in the two following steps: first the score for all pairs of
candidate sentences are calculated, and then the minimal-cost alignments are selected.
One can distinguish three main methods
to calculate these scores. The first uses sentence lengths calculated by the
number of words or characters, which is due to the assumption that corresponding
sentences in SL and TL are of similar length. Although it has been suggested
that length-based methods are language independent, deeper studies have shown
that the correlation more is due to historical relationships of the languages.
In Chinese for instance, as opposed to Indo-European languages, the most words
are formed by only one or two characters.
The second approach in alignment methodology relies on lexical
characteristics to perform an initial partial word alignment, used to obtain
the final sentence alignment. These methods are very robust, non possible
alignments will never match, they can be computationaly very intensive though. Finally,
there are mixed methods using proportions of lexical and formal characteristics
trying to both gain the advantages and leave out the drawbacks of the two
approaches. Most modern systems can be said to join this group.
Ahrenberg, L., Merkel, M., Ridings, D., Sĺgvall Hein, A., Tiedemann, J. .
1998. Automatic Processing of Parallel Corpora: A Swedish Perspective. Department of
Computer and Information Science, Linköpings University, Department of Swedish
Language, Göteborg University, Department of Liguistics, Uppsala University.
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J.,
Jelinek, F., Lafferty, J.D., Mercer, R. L. and Roossin, P. S. 1990. A
statistical approach to machine translation. Computational Linguistics,
16(2): 79-85, June.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. and Mercer, R.L.
1991. Word -sense disambiguation using statistical methods. In ACL 29,
pp. 264-270.
Chen, S. F. 1993. Aligning sentences in bilingual corpora using lexical
information. In Association for Computational Linguistics, ACL
31, pp. 9-16.
Church, K. W. 1993. Char-align: A Program for Aligning Parallel Texts
at the Character Level. In Proceedings of the Workshop on Very Large
Corpora: Acedemic and Industrial Perspectives. Association for
Computational Linguistics, ACL.
Dagan, I. and Church K. W. 1994. TERMIGHT: Identifying and Translating
Technical Terminology. In proceedings of the 4th Conference on
Applied Natural Language Processing (ANLP).
Debili, F. and Sammouda, E.
1992. Appariement
des phrases de textes bilingues français-anglais et français-arabe. In Proceedings of
COLING-92, Nantes.
Fung, P. & Church, K. W. 1994. Kvec: A new approach for aligning
parallel texts. In Proceedings of the 15th International Conference on
Computational Linguistics (COLING 94), 1096--1102, Kyoto.
Gale, W. A., Church
K. W. 1993. A Program for Aligning Sentences
in Bilingual Corpora. Computational Linguistics, 19(3): 75-102.
Ide, N., Véronis,
J. 1998. Word Sense Disambiguation:The State of the Art. Laboratoire Parole et
Langage, Université de Provence,
Aix-en-Provence
Kay, M. and Röscheisen, M. 1988. Text-Translation Alignment,
Technical report. Xerox Palo Alto Research Center (Martin Kay and Martin
Röscheisen (1988). Text-Translation Alignment. Technical Report, Xerox Palo
Alto Research Center.
Published in Computational Linguistics, MIT Press, March 1993.)
Klavans, J. and Tzoukermann, E. 1995. Combining Corpus and
Machine-readable Dictionary Data for Building Bilangual Lexicons. Machine
Translation, 10(3).
Maklovitch, E. and Hannan, M.-L. 1996. Line ‘Em Up: Advances In
Alignment Technology And Their Impact On Translation Support Tools. In Proccedings
of the Second Conference of the Association for Machine Translation in the
Americas (AMTA-96). Montréal, Québec.
Manning, C., Schütze, H. 1999. Foundations of statistical Natural Language Processing. Cambridge, MA:
MIT Press. (http://nlp.stanford.edu/links/statnlp.html Last modified: Sun
Dec 30 10:32:52 PST 2001)
Melamed, I. D. 1997. A Portable Algorithm for Mapping Bitext
Correspondence. 35th Conference of the Association for Computational
Linguistics (ACL'97), Madrid, Spain.
Nie, J., Isabelle, P., Plamondon, P., Foster, G. 1998. Using a
Probabilistic Translation Model for Cross-language Information Retrieval.
In proceedings of the 6th ACL Workshop on Very Large Corpora (WVLC).
Montréal, Canada.
Simard, M. 1999. Text-Translation Alignment: Three Languages Are
Better Than Two. RALI,
http://www.iro.umontreal.ca/~simardm/emnlp99/final.html
Simard, M., Plamondon, P. 1996. Bilingual Sentence Alignment:
Balancing Robustness and Accuracy. RALI, Université de Montréal.
Tiedemann, J. 2000 Extracting Phrasal Terms using Bitext. In Proceedings
of the Workshop on Terminology Resources and Computation, held in conjunction
with LREC 2000, Athens/ Greece, May 2000.
Véronis, J. 2000. Chapitre 6 : Alignement de corpus multilingue. In Ingénierie
des langues, Traité IC2-Série Informatique et SI par Jean-Marie Pierrel. Čditions Hermes
Science, Paris.
Wu, D. 1993. Aligning Parallell English-Chinese Texts Statistically
with Lexical Criteria. Technical report, HKUST-CS93-9. The Hong Kong
University of Science & Technology.