Personal Exploration: word-similarity identification on word spaces

From Lojban
Jump to navigation Jump to search

Without any restrictions beyond phonotactics and morphology, gismu space is extremely large. It would be the set of all words which are exactly of the form 'CCVCV' or 'CVC/CV'. 'V' can be any of five different options (nV = 5), a single 'C' can be any of seventeen different options (nC = 17); 'CC' (permissible initial consonant pairs) can be any of forty-eight options (nCC = 48). 'C/C' (generic permissible consonant clusters) are a bit more involved to calculate. At present, there are ten voiced consonants and eleven unvoiced consonants (where the cardinality of the intersection of these two sets is four). Consonants can never be adjacent if they differ in voicing, so we must calculate permissible pairs for each of these subsets; no pair of adjacent consonants can consist of a single lerfu being repeated ("cc" is prohibited), so if we are drawing lerfu from a bag, then no replacement would be allowed; order does matter ("sl" is distinct from "ls"); thus we need to calculate a permutation. Specifically, the number of generic permissible consonant pairs is given by nC/C = Permute(11,2) + Permute(10,2) - k, for some k which I will describe shortly. Therefore, nC/C = 11!/((11-2)!) + 10!/((10-2)!) - k = 11*10 + 10*9 - k = 110 + 90 - k = 200 - k. There is a set K of specially forbidden consonant pairs; it is defined explicitly as K = {"cs", "sc", "jz", "zj", “cx”, “kx”, “xc”, “xk”, “mz”}; we define k = |K| = 9. Therefore, nC/C = 200 - 9 = 191. We can calculate the number of 'CCVCV' potential-gismu by evaluating the product N1 = nCC nV nC nV = 48*5*17*5 = 20400; likewise, the number of 'CVC/CV' potential-gismu is N2 = nC nV nC/C nV = 17*5*191*5 = 81175. Call the set of all strings of form 'CCVCV' or form 'CVC/CV' which is subject to only phonotactic restrictions "naïve gismu space"; it is the set of all unformed/primordial prototype potential-gismu strings (which are 'words' in the concept of this space). Thus, naïve gismu space has N = N1 + N2 = 20400 + 81175 = 101575 distinct words in it (not all of them actualized).

But it turns out that actual gismu space has structure additional to that of naïve gismu space; it is a subspace of the latter which is endowed with additional properties - and restrictions. The main such property is that of a word-similarity identification between its members, which shall be explored shortly.

To be clear before we begin, though: actual gismu space (which will just be called "gismu space" hereinafter whensoever is possible without confusion) is partitioned into two sets, the first being of all gismu which are actualized (documented somewhere or accepted as official by some standard) and the second being of all possible strings which belong to naïve gismu space and which are neither actualized nor in similarity conflict with actualized gismu.

The original gismu creation process involved two primary steps. The first was an explicit listing of every cultural or handpicked gismu which was to be actualized. The second was the original gismu creation algorithm which knew of the list generated by the first step. By design, it turned out that no two words generated by the end of the second step (including those from the first step) were pronounced too similarly in a specific sense. Basically, consonants were put into classes and if two strings differed by exactly one exchange of a lerfu with another lerfu in the same class, then the words were too similar (it is actually slightly more complicated than this, but this simplification preserves the idea and is really helpful). So, for example, "b" and "p" belong to such a class - thus, since "broda" and "proda" differ only in their respective first lerfu and their first lerfu belong to the same class, these words are too similar. There was another similarity which was restrictive: two proposed gismu strings could not differ in only (at most) their final vowel (in a sense, unstressed vowels belonged to another mutually shared class exclusive of all other sounds; stressed vowels each got their own singleton classes though). If the algorithm proposed a word which was too similar in any of these ways with a word which was already actualized, then the proposed string was rejected. Thus, the proposed string was said to have a (gismu) similarity conflict (gimkamsmikezypro) with the actualized gismu.

We should cover the complication which I just glossed over now. It turns out that the set of consonants is not partitioned into classes. A candidate lerfu is compared to certain other lerfu (it turns out: at most two), but in a nontransitive way; this nontransitive lerfu checking relation will be denoted by "R*". For example, if the proposed lerfu is "b", then it is checked against "p" and "v"; meanwhile, if the proposed lerfu is "p", then it is checked against "b" and "f". Notice that "b" and "p" are both checked against one another, but "b" is not checked against "f" (nor vice-versa, as it turns out) and "p" is not checked against "v" (nor vice-versa, as it turns out). In other words, a similarity in voicing, manner of articulation (plosive or fricative for most consonants), or place of articulation is enough to produce a conflict, but a difference in both voicing and place of articulation will not produce a conflict (even if two lerfu have a single common 'bridging' lerfu, which differs from one in one of these qualities and differs from the other in a different quality). The lerfu checking relation is reflexive (proposed "b" is checked against "b") and symmetric (if proposed "b" is checked against "p", then proposed "p" is checked against "b"), but it is not transitive (if it were and if proposed "b" is checked against "p" and if proposed "p" is checked against "f", then proposed "b" would be checked against "f"; but this is not the case). In order to place these lerfu all into an equivalence class as I did in my simplification, the property of transitivity must be provided; in other words, the we would need to produce a new relation R which is the transitive closure of the lerfu checking relation. Under R, there is an equivalence class of lerfu (for example: {b, p, v, f}); this dramatically reduces the complexity of an assessment and R (and its generated equivalence classes) was what I assumed in the previous paragraph for the sake of simplicity. Note that if an R*-class of lerfu has cardinality 2 or less, then it is trivially transitive and so is already an equivalence class (and is equal to the analogous R-class); moreover, if the R*-class has cardinality of 3 and has the structure that two of its elements are each R*-checked against each of the other two elements (in each case), then the final element is R*-checked against each of the other two and thus the R*-class is an equivalence class which is equal to the analogous R-class; in other words, if there are only three lerfu in the class and if each pair of these lerfu has terms which are mutually similar in at least one (and, of course, at most two) of the three possible qualities, then each element in the class will be checked against each other in every case. We also assume that all unstressed vowels belong to a single shared R-class and each stressed vowel belongs solely to its own particular singleton R-class. If we use R, then we induce equivalence classes on gismu space (in which any words which differ by a single pair of R-similar lerfu are deemed to be similar); this identification space is the one which is herein considered - but we cannot build it if we are using only R* (and not R).