Identification of Lithuanian Arbitrary Collocations

  • Jolanta Kovalevskaitė
  • Erika Rimkutė
  • Jurgita Vaičenonienė
Keywords: arbitrary collocations, Lithuanian, lexical restrictedness, vector space model, semi-automatic approach, linguistic manual approach


    The article aims to describe the methodological approaches of arbitrary collocation (AC) identification developed within the framework of the project “Arbitrary Collocations of Lithuanian: Identification, Description and Usage (ARKA)”. The source of the research was the “Database of Lithuanian Multiword Expressions” encompassing over 12.000 collocations. The structural composition of the identified 2400 arbitrary collocations was as follows: adjective (participle)+noun, verb+noun, and noun+noun. Arbitrary collocations were determined by combining manual and semi-automatic methods of computational linguistics. The manual approach included two major AC identification criteria: (1) lexical restrictedness and (or) (2) meaning transfer. Lexical restrictedness was measured using two tests: (a) the synonym substitution of pre-modifier and (or) (b) semantic field comparison of the head noun. The semi-automatic approach consisted of three stages: (1) automatic generation of vector strings with potential synonyms; (2) manual vector string editing, and (3) collocation pre-modifier comparison to the generated synonym pairs. Approximately half of ACs were detected by using the manual methods, about one-third of ACs were identified using the semi-automated methods and one fifth of ACs were identified using a combination of both approaches. This suggests that the best results in the Lithuanian arbitrary collocation identification are acquired when combining both manual and semi-automatic methodological approaches.