Morphology is the study of word forms. Morphological parsers are computational tools that automatically produce a morphological analysis for a given word form. Such tools have proven to be quite useful as spelling checkers, as morphological grammar checkers, in producing interlinear text and in adaptation of a text from one related language to another. This document is designed to help the reader do morphological parsing using the approach allowed by Stage 1 of the FieldWorks Language Explorer parser.
The purpose of this documentation is to provide an introduction to the key concepts and notions in the FieldWorks Language Explorer approach to morphological parsing. It is divided into two main sections: morphotactics and morphophonemics. The first has to do with controlling which morphemes can co-occur with which other morphemes within a well-formed word. The second has to do with controlling the phonological shape of individual morphemes. (There is one other main section that deals with some issues related to lexical entries.)
Please note that the mechanisms described here are the ones available for Stage 1, the first, rather simple-minded (linguistically-speaking) instantiation of FieldWorks Language Explorer. Later stages will provide much more power and capabilities.[1] The main reason why we have stages in the FieldWorks Language Explorer development project is to avoid trying to develop tools with all the user interface challenges all in one fell swoop. Doing that would be quite a daunting task and take a long time before any product could be released. Instead, we are staging the development to handle the basic items first. Then we'll add more and more as we go along.
We begin by addressing some of the key issues that any general morphological parser must face. Before we can tell the computer what to do, we need to understand what is going on linguistically. What kinds of language phenomena must such a computational tool be able to handle if it will indeed be a general tool?
Many, if not most, languages inflect verbs and/or nouns. Consider the nominal Orizaba Nahuatl forms shown in (1) and the verbal ones shown in (2).[2]
| (1) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| (2) |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Notice how each possessed noun in (1) has at least a possessor prefix. Certain nouns require this possessor inflection. Similarly the verbs in (2) require subject markers (with the possible exception of 3rd person). A morphological parser must account for such inflectional items.
Consider the English forms[3] in (3). What is happening here? How do you get a dumb computer to “understand” these forms correctly?
| (3) |
|
In (3a) institute is a verb root (e.g. We need to institute some changes around here.). By adding the suffix ‑ion as in (3b), the word is changed to a noun. The suffix ‑al can be added to a noun stem to change it to an adjective, as in (3c). The suffix ‑ize changes an adjective into a verb (3d). Further category changes occur with the addition of each suffix in (3e-g). From this English example, we have seen that the computer needs to be able to distinguish between roots and suffixes, with each one restricted as to what category it attaches to and what category it changes the stem to. (Note, for example, that the suffix ‑ly cannot be added to either a verb stem or a noun stem: *institutely, *institutionly.)
A Huallaga Quechua example showing similar category changes along with various types of verbal and nominal affixes is given in (4). The verb root meaning ‘to see’ has the imperfective aspect marker added, followed by the first person object marker, yielding ‘to see me.’ The addition of the nominalizer changes the form to a noun meaning ‘seeing me.’ The noun form can now be possessed by the second person possessive marker and then the purpose marker may optionally follow, finally giving ‘in order that you might be seeing me.’[4]
| (4) |
|
|||||||||||||||||||||||||||||||||
A morphological parser must account for such derivational items.
Ambiguity is also apparent in (3a), since institute can be either a verb, as above, or a noun, as in Australian Institute of Marine and Power Engineers. Note that there are different types of ambiguity in natural language as well. For example, the word bank (among other things) can mean either the side of a river or a building that holds money. With either meaning, bank is a noun.
Now consider the following word:
Note that cooks is ambiguous not only in the root meaning but also as to the suffix: the -s is a nominal plural morpheme in (5a) but a verbal third person singular present tense morpheme in (5b).
A morphological parser must be able to deal with the fact that individual words can legitimately be ambiguous. That is, a morphological parser must be able to discover and report all possible analyses of a word form. In many cases, the ambiguity is eliminated when the word is seen in context, so ideally a morphological parser is used in the context of computational tools that look beyond a single word.
There are still other types of challenges for morphological parsing. For example, consider the Caquinte word in (6):[5]
| (6) |
|
|||||||||||||||||||||||||||||||||
The (t) in two places on the second line (which shows the word broken into morphemes) are not really morphemes at all. Instead, they are epenthetic consonants added to serve as onsets to syllables. Caquinte does not allow vowel clusters nor syllables without onsets (in this part of the verb), so whenever two vowels come together at a morpheme break, an epenthetic t is inserted. A morphological parser needs to be able to correctly account for forms that include epenthetic segments inserted to preserve syllable structure.
Now consider the Caquinte form in (7), which is the same word as in (6), but changed to future tense:
| (7) |
|
|||||||||||||||||||||||||||||||||
What is the challenge here? The future tense is realized as a discontinuous morpheme: it is composed of the prefix n‑ and the suffix ‑e. The computer must be able to check these noncontiguous parts of the word to correctly analyze the future tense in Caquinte; one part cannot be present without the other.
The Tagalog forms (from Spencer (1991:12-13) in (8) illustrate another challenge:
| (8) |
|
What is happening here? This is a case of infixation, where the root sulat splits into two parts so that one of the focus morphemes, ‑um‑ or ‑in‑, can be inserted. A parser must correctly recognize the root even though it is broken apart by the infix.
Look at the additional Tagalog forms in (9) to determine how the imperfective aspect is marked:
| (9) |
|
We know from (8a) that sulat means ‘to write’. So in (9a) it appears that the imperfective marker is su, but we cannot tell if it is a prefix or an infix without looking at other forms. In example (9b) the causative ‘to make someone’ is the prefix pa‑. The mag‑ is what some call the actor focus or actor voice morpheme. But the imperfective of this causative form is not *sumagpasulat, *magsupasulat, nor *magpasusulat as we would expect from either prefixing or infixing su. Instead, we have magpapasulat in (9c) where it is clear that the marker for imperfective is the extra pa. The correct analysis is therefore that imperfective aspect is marked in Tagalog by reduplicating either the first syllable of the stem or the initial consonant and vowel of the first syllable of the stem.
A morphological parser must be able to recognize reduplication within a word form.
Semitic languages pose a special challenge with their root and pattern morphology. These languages have roots composed of three consonants, as exemplified in the Silt'i data in (10), where ‘buy’ is the root wkb. The aspect markers are composed of vowel patterns that fit between or around the root consonants, such as the a-a vowel pattern indicating the perfective aspect shown in (10). The parser needs to be able to find the root consonants and corresponding vowels of the aspect, even though they are intermingled in the surface form of the word.[6]
| (10) |
|
|||||||||||||||||||||||||||||||||
Now study the following Caquinte word.
| (11) |
|
|||||||||||||||||||||||||||||||||
What change takes place at the juncture between the final two morphemes? Notice that where one might expect the sequence keahi, what surfaces is kehai, where the h and a switch positions.[7] Such a transposition of phonemes is called metathesis. Furthermore, notice that the metathesis process in (11) crosses morpheme boundaries.
Such data imply that a morphological parser must be able to correctly identify morphemes even when some segments within the morphemes may have switched positions.
For a final challenge, consider these Caquinte forms (you do not need to understand all the morpheme glosses here; just concentrate on the initial subject prefixes):
| (12) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What is the problem with the subject prefixes? In (12a) we see that the first person inclusive subject marker is a‑, and in (12b) the third person feminine subject marker is o‑. Yet, in (12c), the gloss shows ambiguity between ‘we’ and ‘she’ as the subject, and both of these are represented as null. This is because both subject prefixes are vowels and the stem in (12c) is vowel-initial, yielding two vowels together. Recall from (6) that Caquinte generally does not allow vowel clusters, and therefore adds an epenthetic ‑t‑ when necessary to avoid such clusters. It turns out that epenthesis is only used in the suffixes. Within the prefixes, the initial vowel of a cluster deletes, causing the ambiguity seen in (12c).
This means that a morphological parser must be able to identify a morpheme even when the morpheme has no overt segments.
Given the challenges of morphological parsing exemplified in the preceding section, how can a computer program go about analyzing words into their constituent morphemes? Let's say that the task of a morphological parser is to take a form like itsavetacojitiro from (6) above and
What are some of the things our parser is going to have to know and what are some of the things that it is going to have to do?
Things the parser needs to KNOW:
Things the parser needs to DO:
Clearly, properly using and controlling the constraints is the major task in implementing a parser for a given language. Since a morphological parser must model linguistic reality, it is a good idea to use constraints that model appropriate linguistic notions. Two major concepts for morphology are morphotactics and morphophonemics. Morphotactics deal with what morphemes can co-occur with what other morphemes. Morphophonemics deal with what shape a given morpheme will have in various phonological and morphological environments. The next two major sections outline the constraints available with the Stage 1 FieldWorks Language Explorer parser and how to use them.
Morphotactics has to do with controlling the order of the morphemes in a well-formed word and controlling which morphemes can co-occur with which other morphemes. As examples of the former, one would not expect to find a prefix at the end of a word or a suffix at the beginning of a word. As an example of the latter, while one would expect a tense affix to appear with a verb root in a verbal word, one would not expect a tense affix to show up on a pronoun. The morphotactic mechanisms described in this section delineate what one can do within the FieldWorks Language Explorer model to control such things. The idea is to use the morphotactic mechanisms to correctly describe the facts of the language and thereby not only provide correct parses, but also rule out false parses.
By the way, correctly describing the facts of the language also provides the basis for a grammatical description, something that FieldWorks Language Explorer provides. By making a correct description of the facts we can both generate a description that people can read to learn about the language and we can feed the information to a parser that can put our description to work checking spellings, adapting to other languages, and verifying the fit of our description.
Note that for words which consist solely of a single morpheme, there are no special morphotactic considerations. One merely adds appropriate lexical entries for these and ensures that the morpheme type of the allomorph(s)[9] in the entry is(are) set to a root or stem type.
This section has four major sub-sections. The first deals with handling affixation to stems (section 2.1). The second deals with stem compounding (section 2.2). The third discusses issues related to clitics (section 2.3). The fourth is for those cases where the parser is producing parses that are incorrect, but the Stage 1 mechanisms do not allow any other way to eliminate the false parses (section 2.4).
This section discusses issues relating to adding affixes to stems. Linguists typically divide affixes into two major categories: inflectional and derivational. Therefore, FieldWorks Language Explorer allows you to declare a given affix as being either inflectional or derivational. In the process of analyzing a language, however, sometimes one does not yet know whether a given affix is inflectional or derivational. There are certain affixes which are truly difficult to classify in this fashion. For this reason, FieldWorks Language Explorer also allows you to label a given affix as being unclassified with respect to inflection and derivation. As you study the language more, you should eventually figure out whether such affixes are inflectional or derivational and then you can change their status from being unclassified to the appropriate one.
You can label an affix as “unclassified” when you do not know if it is derivational or inflectional. Please understand, though, that when you do this, the affix is relatively unconstrained as to where it can appear. As a result, the FieldWorks Language Explorer parser may return a number of incorrect parses for some word forms which happen to contain a sequence of characters that match one or more allomorphs of an unclassified affix. One partial solution to this is to indicate the category of the stem to which the affix may attach. The best solution, of course, is to classify the affix as being either inflectional or derivational so it will only show up where it should.
Inflectional affixes typically reflect what some call “grammatical meaning.” These are things like person, number, case, gender, tense, aspect, etc. One can also typically create a paradigm of word forms with the various inflectional categories as labels on the chart.[10]
For example, consider the information for a possessed noun in Orizaba Nahuatl given in (1) above, but this time displayed in a different fashion:
| (13) |
|
What are the inflectional affixes here? Given that every form has the sequence kal, it appears that there are six possessor prefixes which occur before the noun stem. Similar paradigms for other singular possessed nouns would show the same situation (ignoring any morphophonology). Therefore we could posit that the singular possessed noun has an inflectional template that consists of a possessor prefix followed by the stem. We could diagram this as in (14).
| (14) |
|
Now consider the plural possessed noun data from (1) above, but displayed in a similar fashion to (13).
| (15) |
|
What are the inflectional affixes here? Notice that there is the same stem (kal) and the same set of six possessor prefixes as in (13). In addition, there is a plural suffix ‑van. Similar paradigms for other plural possessed nouns would show the same situation (ignoring any morphophonology). Therefore we could posit that the plural possessed noun has an inflectional template that consists of a possessor prefix followed by the stem which, in turn, is followed by a plural suffix. Since plural is an instance of the notion of number, we could diagram this as an inflectional template as shown in (16).
| (16) |
|
Notice what we have described here: for a particular category (possessed noun), we have an inflectional template with one prefix slot (for possessor) and one suffix slot (for number). The possessor slot can be filled by any of the inflectional prefixes listed in (13). The number slot can be filled by the plural suffix.
Now you may well have noticed that there is a potential problem here with the template in (16). If we treat each slot in the template as being obligatory, then the template says we must have a number suffix in order for the template to be satisfied. This means that a possessed singular noun will not meet the requirements of this template because it does not have a suffix in the number slot. It turns out that FieldWorks Language Explorer actually does treat each slot as being obligatory unless it is overtly marked as being optional.
What can we do about this? There are at least three options available within the FieldWorks Language Explorer approach:
Which of these three should we use? Options 1 and 2 will effectively give the same result, although option 1 is definitely simpler. Following the general principle known as Occam's Razor,[11] option 1 is thus better.
Option 3 requires us to posit a null suffix and some argue that if an affix is always null (as it would be here) then what we really have is a default feature: unless there is an overt number suffix, assume that the number is singular. While Stage 1 of FieldWorks Language Explorer does not allow us to mark such default features, later stages of FieldWorks Language Explorer will.
Therefore, from a long term perspective, we recommend following option 1.
This means that to model this inflectional template, we will need to do the following:
| (17) |
|
Once we have done this, we will have successfully set up the inflectional morphotactics for possessed nominals in Orizaba Nahuatl.
In the previous section we suggested that using optional affix slots in a template was a good choice for handling Orizaba Nahuatl nominal possession. Since we noted that within the FieldWorks Language Explorer approach, one could add more than one template to a category, one might wonder when it would be appropriate to choose such an option.
Orizaba Nahuatl happens to provide such a case. Consider the information for an intransitive, present tense verb given in (2) above, but this time displayed in a fashion more conducive to our purposes here:
| (18) |
|
What are the inflectional affixes here? At least under one analysis, there are four subject prefixes and a plural suffix. Third person subject is the default or is null. Similarly, singular number is the default or null.
Where do these inflectional affixes appear? Notice that all the subject ones appear just before the stem and that the plural suffix appears right after the stem. Similar paradigms for other intransitive verbs would show the same situation (ignoring any morphophonology). Therefore we could posit that the present tense, intransitive verb has an inflectional template that consists of a subject inflectional affix followed by the stem which is followed by a number inflectional suffix. We could diagram this as in (19).
| (19) |
|
At first glance, this is very much like what we saw for possessed nominals in example (16) above. We might think initially that we can do exactly what we did for possessed nominals and merely mark the Number slot as optional for these intransitive verbs. If we were to do that, however, notice what would happen for a form like timiki which is supposed to only mean ‘you(sg.) die.’ Because the Number slot would be optional, the FieldWorks Language Explorer parser would allow a parse of 1PlSubj-to.die as well (this, of course, is because both 2SgSubj and 1PlSubj have the same shape: ti‑). At this point, we would have nothing to prevent this incorrect parse.[12]
To eliminate this problem (as well as to eliminate the possibility of the parser allowing a parse for an ill-formed word such as *anmiki), we can create two inflectional templates: one for singular and one for plural. The singular one will be like this:
| (20) |
|
The plural one will be like this:
| (21) |
|
Notice how this method places the singular subject markers in the singular template and puts the plural subject markers in the plural template. This way we force the presence of the plural suffix for the plural subject prefixes.
What needs to be done to handle the 3rd person cases? We will need to mark the subject slot as optional in both templates in order to allow for the 3rd person cases.
This means that to model this inflectional template, we will need to do the following:
| (22) |
|
In section 1.1.5 above, we noted that in Caquinte, the future tense is realized as a discontinuous morpheme: it is composed of the prefix n‑ and the suffix ‑e. We repeat the example here:
| (23) |
|
||||||||||||||||||||||||||||||||||
How do we fulfill this requirement that both the future prefix and future suffix appear? One way is to create a future tense inflectional template which has both the prefix and the suffix required. The template might look like this:
| (24) |
|
The categories in FieldWorks Language Explorer are organized in a hierarchical fashion. For example, one can have a major category of verb and then nest other verb types underneath it (e.g. intransitive verb, transitive verb, etc.) One can even nest other types under these if one so wishes (e.g. one might put bitransitive verb under transitive verb.).
The exact hierarchy one uses can make a difference for how FieldWorks Language Explorer handles the inflectional templates and their slots. When one defines the slots for a given category, those slots may be used in any template for this category and any of its nested categories. The same is true for templates. You may well need to keep this in mind as you design your category hierarchy.
Now consider the Yalálag Zapotec data given in (25)‑(26):[13]
| (25) |
|
| (26) |
|
What is the phonological shape of the Future marker? It appears to be u‑ in (25) but the “fortifier” segment/feature :‑ (i.e. a colon) in (26). Notice that there do not appear to be any phonological reasons for the different allomorphs. In fact, the stem has the same phonological shape in (25a) and in (26a).[14] This problem is not isolated to these pairs of forms; it turns out that verb roots in general divide into two groups, those that take the u‑ future and those that take the :‑ future.
How do we handle this kind of allomorphy when the choice of allomorphs is not motivated by the phonological environment but by the choice of the lexical root? The FieldWorks Language Explorer approach is to use inflection classes. An inflection class is “a set of lexemes whose members each have the same type of inflectional forms.” Aronoff (1994:64). They correspond to the traditional idea of declension classes or conjugation classes. For Yalálag Zapotec, we would create two inflection classes within the verb category (so that it applies to all verbs, not just one particular subtype of verb). One class would be for roots that select the u‑ allomorph and the other would be for those that take the “fortifier”:‑ allomorph.
This means that to model these inflectional classes, we will need to do the following:
| (27) |
|
Now consider the following Latin data which also illustrates the use of inflection classes.[15]
| (28) |
|
Note that while there are five distinct declensions in Latin, there are only three forms for the dative plural: ‑is, ‑ibus, and ‑ebus. In particular, notice that ‑is is used for both declension class I and II and, similarly, ‑ibus is used for both declension class III and IV. So to model this Latin data in FieldWorks Language Explorer, we will need to do the following:[16]
| (29) |
|
Another mechanism offered by FieldWorks Language Explorer can be illustrated by the Spanish noun data given in (30) below:
| (30) |
|
Notice that the main difference between these nouns is the gender suffix. If the ‑a ‘Feminine’ suffix is used, then the cas root means ‘house’. On the other hand, if the ‑o ‘Masculine’ suffix is used, then the cas root means ‘case’.
For a human, it is not necessarily difficult to keep these facts straight, but for a morphological parser, we need some way to prevent it from thinking that casa has the masculine root cas that means ‘case’. Similarly we need a way to keep the parser from thinking that caso has the feminine root cas that means ‘house’. That is, we need a way to prevent the parser from giving “analyses” such as the ones shown in (31), where the asterisk (*) indicates that the analysis is incorrect.
With the FieldWorks Language Explorer parser we use inflection features to deal with this issue. Inflection features are typically characteristics of a morpheme that play a role in the inflection of a word and/or play a role in the syntax (such as agreement within a noun phrase or agreement between a verbal affix and the noun phrase it agrees with). Note that if you use the Morphological Glossing Assistant tool for glossing inflectional affixes, then FieldWorks Language Explorer will automatically add some inflectional features for you.
Coming back to the Spanish data in (30) and (31) above, how exactly does one use inflection features to rule out incorrect parses such as the ones in (31)? The problem here is that there is mismatch between the gender of the root and the gender of the affix. If we can mark the root for the correct gender and also mark the suffixes for the gender they agree with, then the FieldWorks Language Explorer parser will only produce the correct parses.
Note that for cases where a noun has noun class, say, and in addition, has a possessive affix which has a different noun class, then we must be careful to avoid the two noun classes from clashing with each other. If we merely use a feature of “Class” for both the noun and the possessive affix, then the values will differ and the parser will not analyze the word. Instead, we need to use separate noun agreement and possessor agreement complex features. Within each of these complex features, we use the “Class” feature and its values. In this way, not only does the parser correctly analyze the word, it also will have the correct features demarcated for eventual syntactic analysis.
How does one create and use an inflection feature in FieldWorks Language Explorer?
| (32) |
|
Many languages will use one or more of the inflection features listed in the chart shown in (33) below.
| (33) |
|
These are just some examples. Your language may use these or may need others. You may want to check with a linguistic consultant who is familiar with your language family for ideas as to which inflection features are appropriate for your language. Or you may just want to add them only when you find a need for them, such as when the FieldWorks Language Explorer parser gives incorrect parses for forms.
The Spanish data illustrates how we can use gender inflection features to rule out incorrect parses when a gender affix shows up incorrectly on a root. Some possible situations where inflection features could play a similar role in ruling out incorrect parses include those shown in (34).
| (34) |
|
When modeling a given language, one may well wonder if a given phenonmenon should be handled by inflection classes or by inflection features. Here are some guidelines to help one decide:
Look at the various affixes involved.
| If they ... | then use ... | |||
|---|---|---|---|---|
|
inflection class | |||
| have semantic differences (i.e. actually have different meaning) | inflection features | |||
| are involved in (syntactic) agreement | inflection features | |||
| are really declension classes or conjugation classes | inflection classes | |||
| are noun classes or gender | inflection features |
Derivational affixes typically reflect what some call “lexical meaning.” They go on a stem to produce a new stem. The new stem may then be inflected (if the category of the new stem has inflection). Derivational affixes often change syntactic category. See Bickford (1998:135ff) for more on this.
The English data from example (3) is repeated below with more information:
| (35) |
|
What do we have here? We have five derivational suffixes, each of which changes the major category of the resulting stem. Recall that these suffixes only go on stems of a certain category. For example, the ‑al suffix only goes on noun stems. It does not go on other stems (*institutal, *institutionalal, and *quicklyal). These affixes are summarized in (36) below.
| (36) |
|
How do we model these category changing affixes in FieldWorks Language Explorer? We need to do the following:
| (37) |
|
Now consider the pairs of data in (38)-(40) from Turkish:[19]
| (38) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| (39) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| (40) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What is the key difference in each pair? It is the addition of the passive morpheme. Notice how the number of arguments changes from two (subject and object) to one (just subject) with the addition of the passive.
Is passive, then, a category changing derivational affix? While it does not change major category (i.e. it does not change a verb into a noun, say) it does change a transitive verb into an intransitive verb. That is, passive is a case where the sub-category is changed. Many languages have other such sub-category changing derivational affixes such as causatives, applicatives, and transitivizers. As far as FieldWorks Language Explorer is concerned, these are category changing derivational affixes since the result of the derivation produces a different sub-category that potentially requires a different inflectional template to complete the word form.
How do we model these sub-category changing affixes in FieldWorks Language Explorer? We need to do the following:
| (41) |
|
Now consider the following Yalálag Zapotec data:[21]
| (43) |
|
The addition of the repetitive prefix does not change either the major category or the sub-category of the words in (42)-(43). One might wonder, then, if the repetitive in Yalálag Zapotec is actually an inflectional prefix. The evidence that it is derivational is that it actually changes the inflection class of the resulting stem. In (42a) the stem is inflection class 2 (because it takes the “fortifier”:‑ allomorph of the future prefix). After the a‑ repetitive prefix is added in (42b), the resulting stem uses the inflection class 1 allomorph of future (u/w‑).
How do we model these non-category changing affixes in FieldWorks Language Explorer? We need to do the following:
| (44) |
|
Notice that in this case the from‑ and to‑ categories will be the same, but we do need to deal with the change in inflection class. This leads us to the next topic below.
If the language you are studying has inflection classes (see section 2.1.2.6), then what happens when derivational affixes are attached? Does the inflection class of the stem stay the same or does it change?
As we saw from the Yalálag Zapotec data in 2.1.3.3, the inflection class can indeed change. How do we model this? In addition to what we've done for the categories, we need to do the following:
| (45) |
|
There are cases, though, where a derivational affix is attached and it does not change the inflection class of the resulting stem. For example, consider the following data from Atzingo Popoloca:[22]
| (46) |
|
The applicative suffix Apl adds an argument to the verb, but it does not change the inflection class of the resulting stem. The root in (46) belongs to inflection class 1 and so takes the t‑ allomorph of the present tense morpheme. Adding the applicative does not change this (46b). Similarly, the root in (47) belongs to inflection class 2 and so takes a null allomorph of the present tense. Once again, adding the applicative does not change the inflection class of the resulting stem (47b).
To model this in FieldWorks Language Explorer, one does the following:
| (48) |
|
If the language you are studying has inflection features (see section 2.1.2.7), then what happens when derivational affixes are attached to a stem with, say, agreement features? Or what happens when a derivational affix changes the category of the stem to a category that has agreement features? For example, consider the Spanish data in (49) and (50):[23]
| (50) |
|
Here we have a verb (e.g. apretar) and a noun derived from that verb (e.g. apretón). Recall from section 2.1.2.7 that Spanish nouns are marked for gender (masculine or feminine). While Spanish verbs are not marked for gender, a noun derived from a verb will have gender. In the case of the ‑ón derivational suffix, the resulting noun has masculine gender. To properly model this, we would need to indicate that the resulting noun has this gender.
How does one mark a derivational affix for inflection features in FieldWorks Language Explorer?
| (51) |
|
As we noted in section 2.1.2.5, the categories in FieldWorks Language Explorer are organized in a hierarchical fashion.
The exact hierarchy one uses can make a difference for how FieldWorks Language Explorer handles the categories of derivational affixes. When one indicates the “from category”, FieldWorks Language Explorer will allow the derivational affix to apply to stems of this category and any of its nested categories. You may well need to keep this in mind as you design your category hierarchy.
Sometimes this implies that one will need to have more than one mapping for a given derivational affix. For example, one might need a causative to map as follows:
| “from category” | “to category” |
|---|---|
| intransitive verb | transitive verb |
| transitive verb | ditransitive verb |
| noun | transitive verb |
To do this, you need to add a separate mapping for each possible from/to pair.
If a derivational affix only changes meaning (i.e. it does not change the category or the sub-category), then one can use the highest level category for both the “from category” and the “to category”. In this case, the FieldWorks Language Explorer parser will pass on the (sub-)category of the stem to which the derivational affix attaches as the resulting category of the new stem. For example, if one chooses to model an adverbial affix on a verb as being derivational, then if one marks both the “from category” and the “to category” as "verb," then when this affix attaches to an intransitive verb, the resulting stem will still be intransitive. If it attaches to a transitive verb, then the resulting stem will still be transitive.
Derivational affixation tends to be close to the root. Since derivation sometimes changes the category of a stem, this is not surprising. Derivational affixes, then, normally occur inside of inflectional ones.
However, there are cases in some languages where a stem will be inflected, then a category changing derivational affix will be attached and the resulting stem will be inflected.
The Quechua example we saw in (4) is such a case. It is repeated below in (52).[4]
| (52) |
|
||||||||||||||||||||||||||||||||||
At least under one analysis, the verb root meaning ‘to see’ has the imperfective aspect marker added, followed by the first person object marker, yielding ‘to see me.’ We thus have a verb stem inflected with an aspect and an object marker. To this inflected form, the nominalizer derivational affix is attached, resulting in a noun meaning ‘seeing me.’ The noun form then has the second person possessive marker and the purpose marker added, finally giving ‘in order that you might be seeing me.’ That is, the resulting noun stem is now inflected by a possessive and a (kind of) case marker. We could diagram this process as in (53).
In (53) the Infl nodes represent inflected forms. Note how the derivational suffix ‑na changes the inflected verb into a noun stem (Stem[n]). This stem is then inflected.
It turns out that while the Infl[n] node is a fully inflected noun, the Infl[v] is actually only a partially inflected verb - it lacks a required subject suffix. That is, a form such as rikaykaamaa with the analysis of to.see‑Imp‑1Obj is ill-formed. Thus, the verbal inflectional template given in (54) is a special kind of template - it does not represent a fully inflected form. Rather, it requires that there be a derivational affix attached in order for the word to be well-formed. We will refer to this kind of template as a “non-final” template; that is, this kind of template cannot be the final part of a word. Note that all of the other templates we have seen are final templates. In fact, FieldWorks Language Explorer will assume that an inflectional template is final unless it is told otherwise.
| (54) |
|
How does one handle such derivation outside of inflection in FieldWorks Language Explorer? One needs to perform the following:
| (55) |
|
Even when one has correctly classified the affixes in a language as being derivational or inflectional, sometimes a morphological parser will find combinations of stem and affix that are simply incorrect. This may be due to historical or some other seemingly arbitrary reasons.
For example, consider the following Orizaba Nahuatl data:
| (56) |
|
Notice that in this data, the “Absolutive” suffix (which normally goes on singular, unpossessed nouns) appears to derive a noun from a verb. When one models this, one may find that other nouns which have the absolutive suffix now analyze as derived nouns. For example, one might get these:
| (57) |
|
The FieldWorks Language Explorer parser allows one to rule out such incorrect combinations via what have sometimes been called exception “features.”[24] The basic idea is to tag the affix with an exception “feature.” The only time the FieldWorks Language Explorer parser will then allow this affix to occur is when the stem to which it attaches also has been tagged with the same exception “feature.” Thus you can restrict the productivity of the affix to only occur on certain stems. Note that this is only possible for affixes which have been classified as either being derivational or inflectional. Exception “features” are not available for unclassifed affixes.
If a given affix has two or more exception “features,” then the stem to which it attaches must be tagged with all of the exception “features” that the affix has. Note that if an affix does not have any exception “features” but the stem to which it is being attached does have one or more exception “features,” then the affix will still be allowed to attach (as far as the exception “features” are concerned).
To tag affixes and stems with exception “features,” do the following:
| (58) |
|
This section relates to the compounding of two or more stems within a single orthographic word.[25]
There are two basic kinds of compounds: headed compounds (section 2.2.1) and non-headed compounds (section 2.2.2). We also discuss issues relating to incorporation (section 2.2.3), issues relating to compounding when stems contain affixes (section 2.2.4), and issues relating to the organization of categories (section 2.2.5).
Consider the following Orizaba Nahuatl data:[26]
| (59) |
|
What are the categories of the two members of the compound? The left one is an adjective and the right one is a noun. What is the category of the compound? It is a noun. Thus the examples in (59) show an adjective compounding with a noun where the result is the right member of the compound. Thus, we can say that the “head” of the compound is the right member of the compound.
Now consider the following Orizaba Nahuatl data:
| (60) |
|
In (60) the left member is a noun and the rightmost member is an adjective. Like in (59), the result is a noun. Thus the “head” of the compound is the left member in the cases in (60).
Both of these are instances of headed compounds. Either the left or the right member of the compound is the head of the compound. That is, the category of the resulting compound is the same as either the left or the right member of the compound.
How do we model these kinds of rules for Stage 1 of FieldWorks Language Explorer?
| (61) |
|
Now consider the following Spanish data:
| (62) |
|
Which member of the compound is the head? Clearly it is not the left member since the resulting compound in both cases is a noun and the left members are a preposition in paracaidas and a verb in sacamuelas. But is the head really the right member of the compound? While the right member is a noun, this noun is not inflected for the correct gender and/or number. Thus, these examples show the need for the other kind of compound rule: non-headed compounds. In non-headed compounds, the category and/or agreement features of the resulting stem are not merely the same as the head. Instead, the new stem may be something different.
To model this in FieldWorks Language Explorer, we do the following:
| (63) |
|
Some languages allow the incorporation of lexical roots within the stem. The resulting stem may or may not differ from the non-incorporated stem in terms of category and/or features. This means that if the language you are modeling has incorporation, you will need to consider whether to use a headed or a non-headed compound rule for it.
Consider the following Yalálag Zapotec data:[27]
| (64) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| (65) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At least under one analysis, in (64b) the adverb to is incorporated onto the verb ej. In (65b), a different adverb, :cha:ch, is incorporated.
Notice that the resulting stem appears to have all of the characteristics of the verbal stem which is the left member of the compound as indicated by (64a) and (65a). Therefore, this kind of data can be modeled as a left-headed endocentic compound rule.
Now consider the following Orizaba Nahuatl data:[28]
| (66) |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| (67) |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What is happening here? Notice how the nouns in (66a) and (67a) replace the 3Obj marker in (66b) and (67b) to produce the forms in (66c) and (67c). In particular notice that the resulting stem no longer requires a transitive verb inflectional template, but rather an intransitive verb one. We can say that this is because the noun has been incorporated as the object and the result is an intransitive stem. We can model this in FieldWorks Language Explorer as a headed compound, but override the category of the head stem. We could diagram it something like this (where “[vt]” means a transitive verb stem and “[vi]” means an intransitive verb stem):
| (68) |
|
That is, we create a right-headed compound rule and set the “Overriding category” in the rule to be an intransitive verb. The rule will use all the characteristics of the head stem except for the category. It will override the category of the head stem with the specified “Overriding category”.
Consider the Wanca Quechua form given in (69) below:[29]
| (69) |
|
|||||||||||||||||||||||||||||||||
Here we have a (reduplicated) compound consisting of a root, a suffix, the same root, and the same suffix. This forms a compound as shown in (70):
In Stage 1 of FieldWorks Language Explorer, we must treat suffixes like ‑n in a special way. Affixes which can appear between roots in compounds we call “interfixes.” In order to tell the Stage 1 FieldWorks Language Explorer parser that a suffix like ‑n can appear in compounds like it does in (69), we must give it a morpheme type of “suffixing interfix”. This tells the Stage 1 FieldWorks Language Explorer parser that this suffix can appear either as a “regular” suffix (merely after a root) or as a suffix before another root in compound. Note that it is the leftmost instance of ‑n that is crucial here.
There are three varieties of interfixes:
| Type | Description |
|---|---|
| infixing interfix | An infixing interfix is an infix that can occur between two roots or stems. |
| prefixing interfix | A prefixing interfix is a prefix that can occur between two roots or stems. |
| suffixing interfix | A suffixing interfix is an suffix that can occur between two roots or stems. |
If the language you are modeling has these kinds of compounds and you want the parser to analyze them via a compound rule, then you will need to mark any affixes which can appear between roots with these special morpheme types.
As we noted in sections 2.1.2.5 and 2.1.3.6 above, the categories in FieldWorks Language Explorer are organized in a hierarchical fashion.
The exact hierarchy one uses can make a difference for how FieldWorks Language Explorer handles the categories in compound rules. When one indicates the information for a left or right member of a compound, FieldWorks Language Explorer will consider stems of this category and any of its nested categories to match. You may well need to keep this in mind as you design your category hierarchy.
We turn now to consider clitics. Consider the Shipibo data below[30] and notice the ‑ra morpheme. Where does it occur and on what kinds of words does it appear?
| (71) |
|
|||||||||||||||||||||||||||||||||||||||
| (72) |
|
|||||||||||||||||||||||||||||||||||||||
In both (71) and (72), the indicative ‑ra morpheme appears at the end of the first word. In (71) it attaches to a subject and in (72) it attaches to the object.
This morpheme can also attach to other categories as the following examples demonstrate:
| (73) |
|
|||||||||||||||||||||||||||||||||||||||
| (74) |
|
||||||||||||||||||||||||||||||||||||||||||
| (75) |
|
|||||||||||||||||||||||||||||||||||||||
| (76) |
|
||||||||||||||||||||||||||||||||||||||||||
The ‑ra morpheme attaches to an adjective in (73), a postposition in (74), a verb in (75), and an adverb in (76). Notice that it actually appears at the end of the first constituent (a noun phrase in (73) and a postposition phrase in (74)).
Morphemes like this are often analyzed as being clitics. How do we model such clitics in FieldWorks Language Explorer?
| (77) |
|
FieldWorks Language Explorer will do the rest: such morphemes will be allowed to appear at the end (for enclitics) or at the beginning (for proclitics) of words. More than one clitic may appear on a single word. There is no ordering restriction between clitics (other than ad hoc morpheme co-prohibitions).
When one uses a morphological parser, it is not unusual for the parser to sometimes return a parse that is simply incorrect. These are sometimes due to allomorphs matching in places one would not have expected them to match. When one has used all the mechanisms provided by the parser to the best of one's ability and such incorrect parses continue to surface, one may well wish for some kind of mechanism to rule them out. FieldWorks Language Explorer provides “Ad hoc Co-prohibitions” for such situations. Note that it may well be the case that later stages of FieldWorks Language Explorer will provide more well-motivated means to rule out these infelicitous parses, but for now, these ad hoc solutions may have to do.
There are two main types of ad hoc prohibitions: morpheme-oriented ones and allomorph-oriented ones. This section deals with morpheme-oriented ones (see section 3.7 for allomorph-oriented ones). The basic idea is to list a key morpheme and then to list one or more other morphemes that cannot co-occur with the key one. One can constrain these other morphemes to never occur in one of the following ways with respect to the key morpheme:
| (78) |
|
Note that when there are two or more morphemes listed for “other morphemes,” their relative order is significant. They should be listed in the same linear order they have in a word.
How does one create a morpheme-oriented ad hoc co-prohibition in FieldWorks Language Explorer?
| (79) |
|
Occasionally one finds a situation where a set of ad hoc constraints have a common theme. Perhaps they all relate to a particular morpheme or to particular morphemes of a certain variety. This may be a hint as to what is really happening and may lead you to discover a linguistically-motivated way to model them. Or it could be that the FieldWorks Language Explorer model (or the currently implemented stage of FieldWorks Language Explorer) just does not happen to provide the appropriate linguistic mechanism to model the phenomenon correctly.
Yalálag Zapotec dependent pronominal suffixes exemplify such a situation (see López y Newberg 1990:9). In Yalálag Zapotec, a verb may have both a subject and an object person suffix on it. Being a VSO language, the subject occurs before the object. What is different here is that there is a pronominal hierarchy among these dependent pronominal suffixes. Given the subject suffix, the only dependent object suffixes which may follow are those that are lower down on the person hierarchy. This is illustrated in (80).
| (80) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
How would one model such a hierarchy in FieldWorks Language Explorer? Well, one could create a number of different transitive verb inflectional templates in order to force the hierarchy to come out. But this does not really capture the facts all that well and also complicates and obscures what is common in the transitive verb template. (By the way, neither the subject nor the object is required to be filled by a suffix.) Probably the better approach is to create a morpheme ad hoc co-prohibition group and place the set of appropriate ad hoc co-prohibitions for the hierarchy in that group. This way one can document the fact of the hierarchy and have it all in one place. It also documents the fact that the FieldWorks Language Explorer model does not have an overt mechanism to deal with such a hierarchy.
How does one create such a group?
| (81) |
|
Finally, note that FieldWorks Language Explorer allows one to group both allomorph and morpheme ad hoc co-prohibitions together. Please be sure to only do so if these co-prohibitions truly do have something in common.
Besides constraining the overall positions where morphemes can occur (i.e. deal with morphotactics), we need to be able to account for the surface forms that the morphemes have and the particular environments where an allomorph is legitimate.
Consider the following Orizaba Nahuatl data:
| (82) |
|
What are the shapes of the 1SgSubj and the 2SgSubj allomorphs? The first person singular subject marker appears to be ni‑ before consonants and n‑ before vowels. Similarly, the second person singular subject marker alternates between ti‑ and t‑.
How can we encode this information? There are at least two ways to deal with such phonological information:
Generative phonology uses the first approach (also known as the item and process approach, Hockett (1954)). Stage 1 of FieldWorks Language Explorer, however, chooses the second approach (also known as the item and arrangement approach, Hockett (1954)), while plans for Stages 2 and 3 of FieldWorks Language Explorer include also allowing the first.[32]
For Stage 1 of FieldWorks Language Explorer, then, the basic mechanism available is to list allomorphs and then have the option to constrain individual allomorphs by their environment. To define an environment, one may well want to use natural classes of segments (e.g. consonant, vowels, voiceless stops, nasals, etc.). To define such natural classes, we need to know what the possible segments are.
In order to use environments which refer to phonemes or which have natural classes, you need to create a list of all the phonemes in your language. For each phoneme, you need to indicate one or more representations that represent them. For example, in Greek, the /s/ phoneme has two such representations: ς (which is used word finally) and σ (which is used everywhere else).
In addition to these phonemes, you may also need to refer to word boundaries in an environment. For this reason, Stage 1 of FieldWorks Language Explorer comes with a predefined word boundary marker: the # symbol.
Stage 1 of FieldWorks Language Explorer also comes with a potential set of phonemes already defined. That is, you do not need to start from scratch when building the list of phonemes for your language. However, you may well need to edit the list of phonemes initially included for a new language project. This initial set of phonemes is given in (83) below.
| (83) |
|
To define the set of phonemes for the language you are modeling, do what is shown in the following:
| (84) |
|
Once you have the phonemes defined, then you can create natural classes of phonemes. Some common ones include such things as consonants, vowels, voiceless stops, back vowels, etc. To do this in FieldWorks Language Explorer do the following:
| (85) |
|
We highly recommend that you seek to give unique abbreviations for these. While it is possible to have two or more natural classes with abbreviations spelled exactly the same way, we do not recommend that you do so on purpose. Having two or more natural classes with the same abbreviation will not confuse FieldWorks Language Explorer because FieldWorks Language Explorer uniquely identifies every natural class internally. That does not imply, however, that either you or a reader of your grammar will not be confused as a result.
Once the set of phonemes and natural classes are defined for the language you are modeling, you can define environments for allomorphs. You can add them either in the environment editor or with a given allomorph.
In Stage 1 of FieldWorks Language Explorer, you key these environments using a special notation. This notation is one that is reminiscent of what is used in many generative-style rules. The basic rules of thumb are:
| (86) |
|
Example (87) gives some sample environments along with what they mean.
| (87) |
|
A given allomorph may have more than one environment, in which case the various environments are logically ORed with each other. For example, if a given allomorph can appear either before a consonant or word finally, then you can list both an environment for “before a consonant” and one for “before a word boundary.” Example (88) shows what this might look like, assuming that you have a natural class of consonants with an abbreviation of C.
| (88) |
|
It is crucial to note that allomorphs are ordered in the sense that their respective environments are disjunctively ordered. For example, for the Nahuatl 1SgSubj allomorphs above in example (82), we could list the two allomorphs in any of the ways shown in (89)-(92).
| (89) |
|
| (90) |
|
| (91) |
|
| (92) |
|
Note in particular that for the two implicit methods, one does not have to overtly state the environment for the last allomorph. This is because each allomorph automatically inherits the negation of the environments of any preceding allomorphs. Thus, for the Implicit 1 method, the n is automatically treated as having an environment of "not before a consonant." Similarly, for the Implicit 2 method, the ni is automatically treated as having an environment of "not before a vowel." For more on this, see section 4.1.2.
In the next five sections, we will address five issues brought up in section 1.1. First, we deal with reduplication.
Consider the following data from Bahasa Indonesia:[34]
In examples (93)-(97) note that the entire word is reduplicated, no matter what its syllablic shape might be. This is what is often called full reduplication.[35]
In examples like (93)-(97), one cannot tell whether the reduplication morpheme is a prefix or a suffix. However, sometimes a stem will reduplicate and other affixes may be adjoined. For example, consider (98)‑(99):
Notice the ‑nya suffix which comes after the reduplicated stem. The way we are modeling full reduplication in FieldWorks Language Explorer, we need to make it so that all additional affixes are either all prefixes or all suffixes.[36] Thus, in modeling examples (98)‑(99), we would make the reduplication morpheme be a suffix.
How do we indicate full reduplication for Stage 1 of FieldWorks Language Explorer?
| (100) |
|
As we saw in the Tagalog data from (9) from section 1.1.7, it is not always the case that the entire stem is reduplicated. The Tagalog data is repeated here.
| (101) |
|
Recall that we saw that this is a case where the imperfective aspect is realized by reduplicating the first CV syllable of the stem to which it attaches.
Now consider the following Orizaba Nahuatl data:[39]
What is the reduplication pattern here? It is the initial CV of the stem followed by an h. In the Nahuatl case of reduplication, there is not only the copied material, but also some fixed segmental material.
The kind of reduplication illustrated in (101)-(103) above is often referred to as partial reduplication. How do we model such partial reduplication in Stage 1 of FieldWorks Language Explorer?
| (104) |
|
For Stage 1 of FieldWorks Language Explorer, we use a special notation to indicate a partial reduplication pattern.[40] The idea is to list a sequence of specially marked natural class names. The special marking consists of the following:
| (105) |
|
Suppose we have a natural class for consonants with an abbreviation of C and one for vowels abbreviated as V. Then the reduplication patterns for our Tagalog and Orizaba Nahuatl reduplication examples above in (101) and (102)‑(103) would be as in (106).
| (106) |
|
For the Orizaba Nahuatl case, notice the use of the h (the fixed segmental material) in the allomorph pattern. It is not included in the environment pattern for the simple reason that the h does not show up in the environment.
Note that if a language has a CVC reduplication pattern, then one would want to use a pattern of [C^1][V^1][C^2], where the distinct indices on the consonant natural classes makes it clear that they can be different.
We now address another issue from section 1.1: infixation. We repeat here the Tagalog data from example (8) in section 1.1.6.
| (107) |
|
Recall that there are two focus morphemes here, ‑um‑ and ‑in‑, both of which are infixes.
How does one create such infixes in Stage 1 of FieldWorks Language Explorer?
| (108) |
|
Infix environments describe the location within the sequence of characters where the infix is to go.[41] For example, in (107), it would be within sulat between the initial s and ulat. The environment would then be / # [C] _ [V] where # indicates the beginning of the sequence within the stem, [C] is the natural class of consonants and [V] is the natural class of vowels.
In section 1.1.8 we noted the Silt'i data repeated here from (10):[6]
| (109) |
|
|||||||||||||||||||||||||||||||||
We noted that such Semitic languages have roots composed of three consonants, as exemplified in the Silt'i data in (109), where ‘buy’ is the root wkb. The aspect markers are composed of vowel patterns that fit between or around the root consonants, such as the a-a vowel pattern indicating the perfective aspect shown in (109).
How does one model this in Stage 1 of FieldWorks Language Explorer? The basic idea is to treat each vowel as an infix.
| (110) |
|
| (111) |
|
| (112) |
|
In section 1.1.4 we noted the Caquinte data repeated here from (6).[5]
| (113) |
|
|||||||||||||||||||||||||||||||||
Recall that this is an instance of epenthesis. Many languages have certain syllable well-formedness constraints that require the insertion of either a vowel or a consonant to preserve syllable structure (see Itô 1989 for an interesting discussion). In the data above it is a consonant t.
How can one model such epenthetic segments within Stage 1 of FieldWorks Language Explorer? There are at least two ways:
| (114) |
|
The advisability of the use of the first method is debatable. If the epenthetic segment is rather common, then one might want to model it as a pseudo-morpheme. Such an approach allows you to use the output of FieldWorks Language Explorer to explore where it occurs and perhaps glean some insights about its true nature. The second approach captures the fact that epenthesis has no meaning whatsoever (as one would expect with a true morpheme) but it misses the generalization that the presence of the segment is due to syllabification considerations (by adding otherwise unnecessary allomorphs to many dictionary entries). Stage 1 of FieldWorks Language Explorer does not model syllables.
Another morphophonemic issue we noted in section 1 was metathesis. We repeat the Caquinte word in (11) given in section 1.1.9.
| (115) |
|
|||||||||||||||||||||||||||||||||
Recall that the h and a in the final two morphemes switch positions.
How does one model such metathesis processes in Stage 1 of FieldWorks Language Explorer? Since Stage 1 does not have any way to model processes, one must use allomorphy. For data like that in Caquinte, one would
| (116) |
|
Recall that in section 1.1.10 we noted some other Caquinte data in (12) repeated here (you do not need to understand all the morpheme glosses here; just concentrate on the initial subject prefixes):
| (117) |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What is the issue with the subject prefixes? In (117a) we see that the first person inclusive subject marker is a‑, and in (117b) the third person feminine subject marker is o‑. Yet, in (117c), the gloss shows ambiguity between ‘we’ and ‘she’ as the subject, and both of these are represented as null. This is because both subject prefixes are vowels and the stem in (117c) is vowel-initial, yielding two vowels together. Recall from (113) that Caquinte generally does not allow vowel clusters, and therefore adds an epenthetic ‑t‑ when necessary to avoid such clusters. It turns out that epenthesis is only used in the suffixes. Within the prefixes, the initial vowel of a cluster deletes, causing the ambiguity seen in (117c).
How does one model such allomorphy in Stage 1 of FieldWorks Language Explorer?
| (118) |
|
When one uses a morphological parser, it is not unusual for the parser to sometimes return a parse that is simply incorrect. These are sometimes due to allomorphs matching in places one would not have expected them to match. When one has used all the mechanisms provided by the parser to the best of one's ability and such incorrect parses continue to surface, one may well wish for some kind of mechanism to rule them out. FieldWorks Language Explorer provides “Ad hoc Co-prohibitions” for such situations. Note that it may well be the case that later stages of FieldWorks Language Explorer will provide more well-motivated means to rule out these infelicitous parses, but for now, these ad hoc solutions may have to do.
There are two main types of ad hoc prohibitions: morpheme-oriented ones and allomorph-oriented ones. This section deals with allomorph-oriented ones (see section 2.4 for morpheme-oriented ones). The basic idea is to list a key allomorph and then to list one or more other allomorphs that cannot co-occur with the key one. One can constrain these other allomorphs to never occur in one of the following ways with respect to the key allomorph:
| (119) |
|
Note that when there are two or more allomorphs listed for “other allomorphs,” their relative order is significant. They should be listed in the same linear order they have in a word.
The English plurals in (120) show some cases where we might choose to use an allomorph ad hoc co-prohibition for Stage 1 of FieldWorks Language Explorer.[42]
The exceptional case, of course, is the ‑en allomorph (there are other exceptional plurals in English, but this one will do for our example here). Suppose you have these allomorphs in your dictionary and that you also have the noun molt as well as the verb molt in your dictionary. Then the word form molten would be parsed at least two ways as shown in (121).
The parse in (121b), of course, is incorrect. To rule out this incorrect parse, one could create an allomorph ad hoc co-prohibition for the en allomorph of the plural with the molt allomorph of the noun molt.
How does one create an allomorph-oriented ad hoc co-prohibition in FieldWorks Language Explorer?
| (122) |
|
By the way, when you are indicating the allomorph, be sure that the particular allomorph is for the correct morpheme, too. FieldWorks Language Explorer maintains a distinction between identically shaped allomorphs; only those for the particular morpheme will actually be constrained.
Occasionally one finds a situation where a set of ad hoc constraints have a common theme. Perhaps they all relate to a particular allomorph or to particular allomorphs of a certain variety. This may be a hint as to what is really happening and may lead you to discover a linguistically-motivated way to model them. Or it could be that the FieldWorks Language Explorer model just does not happen to provide the appropriate linguistic mechanism to model the phenomenon correctly.
One can group such ad hoc co-prohibitions together. How does one create such a group?
| (123) |
|
Finally, note that FieldWorks Language Explorer allows one to group both allomorph and morpheme ad hoc co-prohibitions together. Please be sure to only do so if these co-prohibitions truly do have something in common.
This section lists a few items that one should keep in mind while adding lexical entries.
There are two things to keep in mind while keying allomorphs (or forms).
Generally speaking, one wants to avoid having null allomorphs if for no other reason than that they can make the parser run rather slowly. If having a null allomorph is indeed the best analysis, then please keep the following in mind:
| (124) |
|
In Stage 1 of FieldWorks Language Explorer, the order of allomorphs is quite significant. Consider the following English data.
Under one possible analysis, we can say that the allomorphs for the English plural are:
If we have a natural class for stridents and one for voiced segments (including stridents) and create two environments (one for “after stridents” and one for “after voiced segments”), then we can order and condition the allomorphs as follows:
Because of the ordering and the fact that the first two are conditioned, the third (elsewhere) case will automatically be constrained to not occur after stridents as well as to not occur after voiced segments. The second allomorph will be conditioned to not only occur after voiced segments, but also to not occur after stridents.
Do you see how it works? For a given allomorph, FieldWorks Language Explorer applies the condition of this allomorph and, at the same time, negates the conditions of all preceding allomorphs. This is why the ordering of allomorphs is crucial.
Morpheme types are things like “root,” “prefix,” “clitic,” etc. Stage 1 of FieldWorks Language Explorer keys on certain ones of these in order to tell the parser how to handle the particular allomorph. The types in the following list are significant to the parser.[45]
One should keep this in mind when applying a type to an allomorph.
In some languages, there is a special class of affixes. The segmental material represented by these affixes appears at both ends of the stem at the same time. It is as if there are two parts of such an affix: one part is typically a prefix and the other part is a suffix. These are called circumfixes. Consider the following data from Bahasa Indonesia.
| (126) |
|
|||||||||||||||||||||||||||||||||
The peng‑ prefix and the ‑an suffix act together to form a single morpheme even though they are on opposite ends of the ketik stem.[46] Another way of looking at this is to say that the nominalizer (NMLZR) morpheme is realized by a circumfix whose left member is the peng‑ prefix and whose right member is the ‑an suffix.
How does one create such circumfixes in Stage 1 of FieldWorks Language Explorer?
| (127) |
|
The following is a screen shot showing how the Bahasa Indonesia circumfix in (126) above might be keyed.
Note the following about this entry:
When one keys a circumfix in this manner, the FieldWorks Language Explorer parser will require both the left and right member affixes to appear simultaneously for them to be parsed as an instance of this entry. Circumfixes may be classified as derivational, inflectional, or as unclassified. The FieldWorks Language Explorer parser handles all three varieties correctly.
While it is possible to have two or more glosses somewhere in your lexicon spelled exactly the same way, we do not recommend that you do so on purpose. Having two or more morphemes with the same gloss will not confuse FieldWorks Language Explorer because FieldWorks Language Explorer uniquely identifies every gloss internally. That does not imply, however, that either you or a reader of your glossed texts will not be confused as a result.
The FieldWorks Language Explorer approach has been purposely designed to allow you to incrementally build up the morphological description piece by piece (with one exception; see 5.1 below). For example, you can add inflectional templates as you discover them. It is not the case that once you start to use inflectional templates, you must define inflectional templates for all categories at once. You can define them one by one if you need to or all at once (if you happen to already know what they are).
The exception to this general case is compound rules. Once you define your first compound rule, the FieldWorks Language Explorer parser will then only allow compounds for which there are rules. In particular, this means that you may have a number of word forms that will suddenly fail to analyze once you write your first compound rule. To get them to analyze, you will need to define appropriate compound rules for them. We wish we could allow the discovery and development of compound rules to also be incremental, but we have not figured out how to do it.
| [1] |
The basic features of each stage are outlined in the following chart (note that if we have time, we might include some items listed under Stage 2 in Stage 1):
|
||||||||||||||||||||||||||||||||||||||||||
| [2] |
Data are from Tuggy (1991). The abbreviations used in the Nahuatl data are:
|
||||||||||||||||||||||||||||||||||||||||||
| [3] |
These are taken from Spencer (1991:9). |
||||||||||||||||||||||||||||||||||||||||||
| [4] |
The data are from Weber, Black, and McConnel (1988:8). See also Weber (1989). The abbreviations used in the Quechua form are:
|
||||||||||||||||||||||||||||||||||||||||||
| [5] |
All Caquinte data are from Ken Swift, p.c. and Swift (1988). The abbreviations used in the Caquinte forms are:
|
||||||||||||||||||||||||||||||||||||||||||
| [6] |
The data are from Gardner (1994). The abbreviations used in the Silt‘i form are:
|
||||||||||||||||||||||||||||||||||||||||||
| [7] |
This metathesis process is actually optional. The word is from Swift (1988:133). |
||||||||||||||||||||||||||||||||||||||||||
| [8] |
Another thing a parser could produce would be the actual word structure which could be shown via a tree diagram. While the FieldWorks Language Explorer parser actually produces such a structure, we do not plan to make it visible in Stage 1. |
||||||||||||||||||||||||||||||||||||||||||
| [9] |
We use the term “allomorph” here as a cover term for any form in a lexical entry. |
||||||||||||||||||||||||||||||||||||||||||
| [10] |
For more on this, see Bickford (1998:113ff). |
||||||||||||||||||||||||||||||||||||||||||
| [11] |
Occam's Razor states “one should not increase, beyond what is necessary, the number of entities required to explain anything.”. See Principia Cybernetica Web (1997) for more detail. |
||||||||||||||||||||||||||||||||||||||||||
| [12] |
In addition, the form timikih could parse as 2SgSubj‑to.die‑Plural. This, too, is incorrect. If we used the Morphosyntactic Glossing Assistant tool to create the glosses, then this parse would not appear: the subject number agreement feature would have a value of ‘singular’ which would conflict with the number agreement feature value of the suffix; namely ‘plural.’ Stage 1, however, does not have any way to indicate default features for a category (e.g. marking ‘singular’ as the default) in order to prevent the form timiki from parsing as 1PlSubj‑to.die. |
||||||||||||||||||||||||||||||||||||||||||
| [13] |
The data are from López y Newberg (1990). The abbreviations used in the Yalálag Zapotec data are:
The orthography used here is slightly different from what is used in López y Newberg (1990). In particular, fortis consonants are preceded by a colon (:). Lenis consonants are not (and use the voiceless equivalent instead of the voiced one). |
||||||||||||||||||||||||||||||||||||||||||
| [14] |
Also note that the difference in future allomorphy is not due to transitivity. |
||||||||||||||||||||||||||||||||||||||||||
| [15] |
The data are from http://www.thelatinlibrary.com/decl.html and http://www.slu.edu/colleges/AS/languages/classical/latin/tchmat/grammar/decl-c.html. |
||||||||||||||||||||||||||||||||||||||||||
| [16] |
Of course, one would want to model the full nominal paradigm if one were working on Latin, but this limited usage here illustrates the point about letting a given allomorph refer to more than one inflection class. |
||||||||||||||||||||||||||||||||||||||||||
| [17] |
We recommend using only two types: “Agreement” for agreement features and “Inflection” for all others. |
||||||||||||||||||||||||||||||||||||||||||
| [18] |
In beta version 0.8 of FieldWorks Language Explorer, go to the Grammar area, Features tool. Insert a new feature (either feature or complex feature - it does not matter, both call up the catalog). |
||||||||||||||||||||||||||||||||||||||||||
| [19] |
The data are from Inkelas (2001).[20] The abbreviations used in the Turkish data are:
|
||||||||||||||||||||||||||||||||||||||||||
| [20] |
(I wish I had access to a more standard Turkish grammar to get examples, but this is the best I could find on the net. I also changed the glosses of two items per my Turkish Ample files which were based on Underhill's grammar.) |
||||||||||||||||||||||||||||||||||||||||||
| [21] |
The data are from López y Newberg (1990). The abbreviations used in the Yalálag Zapotec data are:
The orthography used here is slightly different from what is used in López y Newberg (1990). In particular, fortis consonants are preceded by a colon (:). Lenis consonants are not (and use the voiceless equivalent instead of the voiced one). |
||||||||||||||||||||||||||||||||||||||||||
| [22] |
Data are from Austin, Kalstrom, and Hernández (1995). The abbreviations used in the Atzingo Popoloca data are:
|
||||||||||||||||||||||||||||||||||||||||||
| [23] |
This data is taken from Velásquez (1974:16). |
||||||||||||||||||||||||||||||||||||||||||
| [24] |
The way we have implemented these in FieldWorks Language Explorer is to create a separate list for these objects. Technically, these are not true features. Internally, we are calling these “productivity restrictions” because they restrict the productivity of an affix. Another way of looking at them is as restricting the distribution of an affix. |
||||||||||||||||||||||||||||||||||||||||||
| [25] |
Compounds involving more than one orthographic word (e.g. student film society) are not dealt with here since they are properly outside the realm of morphology. |
||||||||||||||||||||||||||||||||||||||||||
| [26] |
The data are taken from Tuggy (1991:76-77). The abbreviations used in the Nahuatl data are:
|
||||||||||||||||||||||||||||||||||||||||||
| [27] |
The data are from López y Newberg (1990). The abbreviations used in the Yalálag Zapotec data are:
The orthography used here is slightly different from what is used in López y Newberg (1990). In particular, fortis consonants are preceded by a colon (:). Lenis consonants are not (and use the voiceless equivalent instead of the voiced one). |
||||||||||||||||||||||||||||||||||||||||||
| [28] |
Data are from Tuggy (1991:77-8). The abbreviations used in the Nahuatl data are:
|
||||||||||||||||||||||||||||||||||||||||||
| [29] |
This data is from Rick Floyd, p.c. The gloss of 3P is for “third person possessive.” |
||||||||||||||||||||||||||||||||||||||||||
| [30] |
Data are from Black (1992). The abbreviations used in the Shipibo data are:
|
||||||||||||||||||||||||||||||||||||||||||
| [31] |
One approach to this is to strive to make the tightest constraint possible (i.e. use one of the adjacency ways first if possible; if not, then try the somewhere case; if that does not work, then try the anywhere case). That way, should you encounter another case involving these particular morphemes, then you will now know more: it is now clear that you need looser constraints. You can then add some comments/annotations to document what you have learned (or put the information in the description). |
||||||||||||||||||||||||||||||||||||||||||
| [32] |
One main reason why Stage 1 does not allow for phonological rules is that we could then use a modified form of an existing SIL tool (AMPLE) and not have to spend any time building a special phonological processor. We have done some research and it appears that the Xerox Parser will serve our purposes for a tool that allows a more generative style approach. Time will tell. Time will also tell whether there will be a cost per seat involved in using the Xerox Parser, something else we have been looking into and wishing to avoid. |
||||||||||||||||||||||||||||||||||||||||||
| [33] |
This is another reason why you should use unique abbreviations for natural classes. If you have two or more natural classes with the same abbreviation, it is not clear which one you mean. FieldWorks Language Explorer will automatically select one, but it may not be the one you intended. |
||||||||||||||||||||||||||||||||||||||||||
| [34] |
The data are from Howard Sheldon, p.c. and Jonathan Coombs, p.c. |
||||||||||||||||||||||||||||||||||||||||||
| [35] |
It is also called total reduplication and sometimes general reduplication. |
||||||||||||||||||||||||||||||||||||||||||
| [36] |
There is a technical reason for this. The parser matches the entire rest of the word (for a prefix) or the entire beginning of the word (for a suffix). It cannot match if there is additional material. |
||||||||||||||||||||||||||||||||||||||||||
| [37] |
This is the same notation as used in Shoebox and Toolbox. AMPLE uses <...>. |
||||||||||||||||||||||||||||||||||||||||||
| [38] |
Thus, it would be keyed as -[...] where we would put the hyphen before the indicator because the hyphen would be part of the suffix. You can put anything before or after the indicator. For example, if you used t[...]- and made it be a prefix, then this would match a full reduplication morpheme in a form such as tabrak-menabrak ‘keep on running into,’ where we would model the men as an infix and the abrak would be the truncated allomorph of the stem tabrak ‘to.collide.’ |
||||||||||||||||||||||||||||||||||||||||||
| [39] |
Data are from Tuggy (1991:41). |
||||||||||||||||||||||||||||||||||||||||||
| [40] |
It is the same notation as used in AMPLE. |
||||||||||||||||||||||||||||||||||||||||||
| [41] |
The notation used for these infix environments is the same notation as used for infixes in AMPLE. |
||||||||||||||||||||||||||||||||||||||||||
| [42] |
Admittedly, this is not the greatest example. One could use inflection classes for these or, perhaps better, one could merely use an environment to constrain the exceptional allomorphs for the roots to which they attach. |
||||||||||||||||||||||||||||||||||||||||||
| [43] |
One approach to this is to strive to make the tightest constraint possible (i.e. use one of the adjacency ways first if possible; if not, then try the somewhere case; if that does not work, then try the anywhere case). That way, should you encounter another case involving these particular allomorphs, then you will now know more: it is now clear that you need looser constraints. You can then add some comments/annotations to document what you have learned (or put the information in the description). |
||||||||||||||||||||||||||||||||||||||||||
| [44] |
The empty set character is Unicode hex code 2205. |
||||||||||||||||||||||||||||||||||||||||||
| [45] |
Another way of saying this is that the parser recognizes all morpheme types except for simulfix, suprafix, and circumfix. For circumfix, however, see section 4.3. |
||||||||||||||||||||||||||||||||||||||||||
| [46] |
There is a process that some call “bidirectional partial fusion” whereby the final nasal of the prefix portion assimilates to the point of articulation of the initial consonant of the stem and then this consonant deletes (or one could view it as the nasal merging with the consonant). |
Aronoff, Mark.1994. Morphology by Itself. Linguistic Inquiry Monograph Twenty-Two. The MIT Press. Cambridge, Massachusetts.
Austin Krumholz, Jeanne, Marjorie Kalstrom Dolson, and Miguel Hernández Ayuso.1995. Diccionario poploca de San Juan Atzingo Puebla. Instituto Lingüístico de Verano, A.C. Tucson, AZ.
Bickford, J. Albert.1998. Tools for Analyzing the World's Languages. The Summer Institute of Linguistics. Dallas.
Black, H. Andrew.1992. “South American Verb Second Phenomena: Evidence from Shipibo.” Syntax at Santa Cruz 1:35-63.
Gardner, Simon.1994. “A Problem in Boundary Morphophonemics for Computer Analysis.” Notes on Computing 13.6:44-48.
Hockett, Charles.1954. “Two models of grammatical description.” Word 10:210-231.
Inkelas, Sharon.2001. “Derivational Morphology Handout.” (http://ist-socrates.berkeley.edu/~aclyu/ling115/handout07.pdf).
Itô, Junko.1989. “A prosodic theory of epenthesis.” Natural Language and Linguistic Theory 7:217-259.
López L., Filemón y Ronaldo Newberg Y.1990. La Conjugación del Verbo Zapoteco; Zapoteco de Yalálag. Instituto Lingüístico de Verano, A.C. México, D.F.
Principia Cybernetica Web.1997. “Occam's Razor.” (http://pespmc1.vub.ac.be/OCCAMRAZ.html).
Spencer, Andrew.1991. Morphological Theory. Basil Blackwell. Cambridge.
Swift, Kenneth.1988. Morfología del Caquinte. Serie Lingüística Peruana, No. 25. Instituto Lingüístico de Verano. Yarinacocha, Péru.
Tuggy T., David.1991. Curso del Nájuatl Moderno. Universidad de las Américas. Puebla, México.
Velásquez de la Cadena, Marciano, Edward Gray, Juan L. Iriba, Ida Navarro Hinojosa, Manuel Blanco-González, and Richard John Wiezell.1974. New Revised Velásquez Spanish and English Dictionary. Follett Publishing Company. Chicago.
Weber, David John.1989. A Grammar of Huallaga (Huánuco) Quechua. Linguistics Volume 112. University of California Press. Berkeley.
Weber, David J., H. Andrew Black, and Stephen R. McConnel.1988. AMPLE: A Tool for Exploring Morphology. Occasional Publications in Academic Computing No. 12. Summer Institute of Linguistics. Dallas, Texas.