Monolingual corpora

(2008), who provided numerous justifications, one of which is particularly illuminating. De-noising auto-encoders were introduced by Vincent et al. I shall describe these in the following sections.Ī de-noising auto-encoder is a function optimized to map a corrupted sample from some dataset to the original un-corrupted sample. The probability the decoder assigns to a sentence is then the product of the probabilities computed for each word in this manner. x is the target, C(x) is the noisy input, \hat), the LSTM outputs a probability distribution over words, which should be interpreted as the distribution of the next word according to the decoder. On the left, the model is trained to reconstruct a sentence from a noisy version of it in the same language. The key idea here is to build a common latent space between languages.

Second, it provides a strong lower bound performance on what any good semi-supervised approach is expected to yield.Ī toy example of illustrating the training process which guides the design of the objective function.

First, this is applicable whenever we encounter a new language pair for which we have no annotation.

This set up is interesting for two reasons. Based on the assumption that there exists a monolingual corpus (explained earlier) on each language. This paper investigates whether it is possible to train a general machine translation system without any form of supervision whatsoever.

A reconstruction loss encourages the model to improve on the translation model of the previous epoch.

It is intended that the latent-space representation of a sentence should reflect its meaning, and not the particular language in which it is expressed.

An adversarial loss encourages the latent-space representations of source and target sentences to be indistinguishable from each other.

A de-noising auto-encoder loss encourages the latent-space representations to be insensitive to noise.Sentences from the source and target language are mapped to a common latent vector space by an encoder, and then mapped to probability distributions over sentences in the target or source language by a decoder.The word-vector embeddings of the source and target languages are aligned in an unsupervised manner.The unsupervised translation scheme has the following outline: Overview of unsupervised translation system A corpus may contain texts in a single language (monolingual corpus) or text data in multiple language (multilingual corpus). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In linguistics, a corpus (plural corpora) or text corpus and structured set of texts (nowadays usually electronically stored and processed). To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield.To translate between languages for which large parallel corpora does not exist.

The authors offer two motivations for their work: The paper Unsupervised Machine Translation Using Monolingual Corpora Only by Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato proposes an unsupervised neural machine translation system, which can be trained without such parallel data. Neural machine translation systems are usually trained on large corpora consisting of pairs of pre-translated sentences. 3 Overview of unsupervised translation system.2.1 Note: What is a corpus (plural corpora)?.