One miraculous feature of natural language design is that speakers can understand perviously unseen words and coin novel words that have never been used before. This brings evidence that in the process of word comprehension and production, people operate on some sublexical elements, that is, chunks of word structure. How these chunks are recognised, interpreted, and arranged in the brain is one of the most intriguing questions of modern linguistics. Traditionally, it was assumed that sublexical elements, called morphemes, have independent cognitive status, that is, are stored alongside holistic words, have their own meanings, and can be combined with each other to produce new words. More recently, a new family of theories emerged that posit words, not morphemes, as basic units of lexical processing. Under this approach, sublexical elements are believed to be parsed out of individual lexemes due to human sensitivity to statistical regularities in the input data.
The morpheme-based approach provides an intuitive explanation for regular patterns of language structure, but many complex words are idiosyncratic. For example, in English, there are several hundred bound bases of Romance origin, most of which are semantically so obscure that it is hard to see how speakers can recognise and use them as independent units. The word-based approach seems to overcome many problems of the explicit representation of morphemes. It is strongly associated with creating full-fledged computational models that capture morphological effects through iterative whole-word processing. In the models, words can be encoded in several ways, from simple n-grams to units obtained through complex character-convolution algorithms. This approach, however, is not without its problems. Crucially, it remains unclear exactly what is represented in the mental lexicon and how speakers acquire chunks of certain sizes in the process of learning.
In this project, I propose a new computational model of morphology that is based on graph theory and is intended to elaborate the word-based approach. The model represents a network of morphological elements segmented out of individual words through distributional analysis which is driven by two general factors: formal similarity and co-occurrence frequency. Hence, when several related words include overlapping parts, these parts can be identified as sublexical units. In the model, a single learning mechanism is used to account for the emergence of morphological structure and the formation of complex words. It is based on a key concept of graph theory, the notion of shortest path, which refers to the task of finding optimal paths between two nodes in a network. Semantics is also represented in the model distributionally. I contend that just as word meanings are inherently context-determined, so are the meanings of those units of word structure that are singled out from words by the shortest path analysis.

Duration | 01.05.2025-30.04.2028 |
Funding Funding program | FWF ESPRIT |
Grant amount | € 346.505 |
Unit | Department of Linguistics |
Profile area Uni Graz | |
Core research area of the Faculty | |
Principal investigator | Sergei Monakhov |
Project staff | |
Project homepage |