Computational Linguistics Laboratory

Research

STILVEN

STILVEN is a project approved December 2007 which started its activities in February 2008. It is accessible at the following link. The task was creating a computational infrastructure to be directed to the analysis and translation of Veneto language. Veneto is a dialect nowadays but was the official language of the Veneto Republic for as long as 8 centuries, up to the moment in which the Republic occupied by the French and then by the Austrian became part of newborn Italian nation at the end of the XIXth century. Since then, Veneto has been slowly abandoned in favour of Italian. The same happened all over Italy.
Then the first problems to be coped with are:
- accounting for the varieties
- in the lexicon
- in the grammar
- in the orthography
- accounting for the orthographic variations
As to the first such problems, we have implemented a number of different lexica which refer at the same time to the four main varieties, to Italian and to English.
As to the second problem, it has been solved partially. The currently implemented solution takes into account possible orthographic ambiguities and produces a uniform
output to be matched against the main translation lexicon.
STILVEN aims at translating free text input by taking advantage of a combination of statistical, pattern-matching and rule-based methods. The following goals and premises were defined for the project:
1. use simple NLP tools and resources,
2. use bilingual hand-made dictionaries,
3. use Italian as intermediate language,
4. use translation units at sentence boundaries,
5. use different tagsets for source language (SL) and target language (TL).
The first task we completed was that of collecting as much text as possible from the web and from people collaborating on a voluntary basis. Texts collected were then homogenized as to the orthography. Obviously, texts belonging to different varieties were kept separate. As a whole, we collected texts for 200,000 tokens. This was then used to compile frequency lists. The lists were then the basis for the wordform lexicon of Veneto which we compiled following similar lexica we have available in our
laboratory, for Italian, English, German and French. The wordform lexicon has been compiled on the basis of the one of Italian, thus comprising in each entry the corresponding Italian wordform and lemma. Semantic and syntactic properties of the Veneto wordform would then be derived directly from the Italian fully specified
subcategorized lexicon. We then normalized a big – 50,000 entries - translation lexicon containing lemmas of Veneto paired with Italian and English. This lexicon will then be used to generate all wordforms of Veneto in this year activities; it is also our current task, the implementation of a morphological analyser for Veneto. The need of the analyser is clear if we think that Veneto makes use of enclitics as Italian and other Romance languages.
Problems related to Veneto translation into English and viceversa are very close to those encountered when translating from/into Italian. Basically we can think of the following most interesting types of problems:
a. Subject Clitic Doubling
b. Complementizer Doubling in questions
c. Amalgams (prepositions + article; verb + enclitic)
d. Order of clitics dative/accusative
e. Ambiguous 3rd person singular/plural inflection in present tense
f. Proper Noun preceded by article
g. Subject clitic erased with unambiguous verb inflection (1st sing/plur)
h. Subject adjoined as enclitics in interrogative sentences
(1) Go poduo dargheo. /I managed to give it to him/her
(2) Ti te parli massa. /You speak too much
(3) No so cossa che fassa e£ Giani /I don’t know what John is doing
(4) Partito de boto? /Do you leave at once?
(5) I bocia i magna £e carame£e. /The kids eat sweets
(6) Qua ghe dorme e£ Giani. /Here sleeps John
(7) Dime chi che xe vegnuo. /Tell me who has come
(8) Cossa xe che i fa? /What do they do
What we get here is the lexically unexpressed subject pronoun; then we have a dative clitic pronoun “GHE” which is ambiguous between feminine and masculine. This clitic must be detached from the verb and separated from the accusative “o” or “lo”/it. Most importantly, the order of “accusative/dative” case which is required in English sequences of pronouns, in Veneto is reverted and is identical to Italian.
Another case of ambiguity which requires additional information is constituted by 3rd person plural and singular verb form which are identical. Now we know that English only remarkable morphological marker is the “S” for the singular third person of present tense. In this case, the agreement needs to be recovered from the subject if linguistically expressed, or else from the context. One such case is presented below. The presence of these features in Veneto do not guarantee the effectivity of statistical models due to the high sparcity of data.
During the project we used GETARUNS as parser