Personal Exploration: word-similarity identification on word spaces: Difference between revisions

From Lojban
Jump to navigation Jump to search
(Created page with "Without any restrictions beyond phonotactics and morphology, gismu space is extremely large. It would be the set of all words which are exactly of the form 'CCVCV' or 'CVC/CV'...")
 
No edit summary
Line 1: Line 1:
Without any restrictions beyond phonotactics and morphology, gismu space is extremely large. It would be the set of all words which are exactly of the form 'CCVCV' or 'CVC/CV'.  'V' can be any of five different options (n<sub>V</sub> = 5), a single 'C' can be any of seventeen different options (n<sub>C</sub> = 17); 'CC' (permissible initial consonant pairs) can be any of forty-eight options (n<sub>CC</sub> = 48). 'C/C' (generic permissible consonant clusters) are a bit more involved to calculate.  At present, there are ten voiced consonants and eleven unvoiced consonants (where the cardinality of the intersection of these two sets is four). Consonants can never be adjacent if they differ in voicing, so we must calculate permissible pairs for each of these subsets; no pair of adjacent consonants can consist of a single lerfu being repeated ("cc" is prohibited), so if we are drawing lerfu from a bag, then no replacement would be allowed; order does matter ("sl" is distinct from "ls"); thus we need to calculate a permutation. Specifically, the number of generic permissible consonant pairs is given by n<sub>C/C</sub> = Permute(11,2) + Permute(10,2) - k, for some k which I will describe shortly. Therefore, n<sub>C/C</sub> = 11!/((11-2)!) + 10!/((10-2)!) - k = 11*10 + 10*9 - k = 110 + 90 - k = 200 - k.  There is a set K of specially forbidden consonant pairs; it is defined explicitly as K = {"cs", "sc", "jz", "zj", “cx”, “kx”, “xc”, “xk”, “mz”}; we define k = |K| = 9.  Therefore, n<sub>C/C</sub> = 200 - 9 = 191.  We can calculate the number of 'CCVCV' potential-gismu by evaluating the product N<sub>1</sub> = n<sub>CC</sub> n<sub>V</sub> n<sub>C</sub> n<sub>V</sub> = 48*5*17*5 = 20400; likewise, the number of 'CVC/CV' potential-gismu is N<sub>2</sub> = n<sub>C</sub> n<sub>V</sub> n<sub>C/C</sub> n<sub>V</sub> = 17*5*191*5 = 81175. Call the set of all strings of form 'CCVCV' or form 'CVC/CV' which is subject to only phonotactic restrictions "naïve gismu space"; it is the set of all unformed/primordial prototype potential-gismu strings (which are 'words' in the concept of this space). Thus, naïve gismu space has N = N<sub>1</sub> + N<sub>2</sub> = 20400 + 81175 = 101575 distinct words in it (not all of them actualized).
Without any restrictions beyond phonotactics and morphology, gismu space is extremely large. It would be the set of all words which are exactly of the form 'CCVCV' or 'CVC/CV'.  'V' can be any of five different options (n<sub>V</sub> = 5), a single 'C' can be any of seventeen different options (n<sub>C</sub> = 17); 'CC' (permissible initial consonant pairs) can be any of forty-eight options (n<sub>CC</sub> = 48). 'C/C' (generic permissible consonant clusters) are a bit more involved to calculate.  At present, there are ten voiced consonants and eleven unvoiced consonants (where the cardinality of the intersection of these two sets is four). Consonants can never be adjacent if they differ in voicing, so we must calculate permissible pairs for each of these subsets; no pair of adjacent consonants can consist of a single lerfu being repeated ("cc" is prohibited), so if we are drawing lerfu from a bag, then no replacement would be allowed; order does matter ("sl" is distinct from "ls"); thus we need to calculate a permutation. Specifically, the number of generic permissible consonant pairs is given by n<sub>C/C</sub> = Permute(11,2) + Permute(10,2) - k, for some k which I will describe shortly. Therefore, n<sub>C/C</sub> = 11!/((11-2)!) + 10!/((10-2)!) - k = 11*10 + 10*9 - k = 110 + 90 - k = 200 - k.  There is a set K of specially forbidden consonant pairs; it is defined explicitly as K = {"cs", "sc", "jz", "zj", “cx”, “kx”, “xc”, “xk”, “mz”}; we define k = |K| = 9.  Therefore, n<sub>C/C</sub> = 200 - 9 = 191.  We can calculate the number of 'CCVCV' potential-gismu by evaluating the product N<sub>1</sub> = n<sub>CC</sub> n<sub>V</sub> n<sub>C</sub> n<sub>V</sub> = 48*5*17*5 = 20400; likewise, the number of 'CVC/CV' potential-gismu is N<sub>2</sub> = n<sub>C</sub> n<sub>V</sub> n<sub>C/C</sub> n<sub>V</sub> = 17*5*191*5 = 81175. Call the set of all strings of form 'CCVCV' or form 'CVC/CV' which is subject to only phonotactic restrictions "naïve gismu space"; it is the set of all unformed/primordial prototype potential-gismu strings (which are 'words' in the concept of this space). Thus, naïve gismu space has N = N<sub>1</sub> + N<sub>2</sub> = 20400 + 81175 = 101575 distinct words in it (not all of them actualized).
But it turns out that actual gismu space has structure additional to that of naïve gismu space; it is a subspace of the latter which is endowed with additional properties - and restrictions. The main such property is that of a word-similarity identification between its members, which shall be explored shortly.
To be clear before we begin, though: actual gismu space (which will just be called "gismu space" hereinafter whensoever is possible without confusion) is partitioned into two sets, the first being of all gismu which are actualized (documented somewhere or accepted as official by some standard) and the second being of all possible strings which belong to naïve gismu space and which are neither actualized nor in similarity conflict with actualized gismu.
The original gismu creation process involved two primary steps. The first was an explicit listing of every cultural or handpicked gismu which was to be actualized. The second was the original gismu creation algorithm which knew of the list generated by the first step. By design, it turned out that no two words generated by the end of the second step (including those from the first step) were pronounced too similarly in a specific sense. Basically, consonants were put into classes and if two strings differed by exactly one exchange of a lerfu with another lerfu in the same class, then the words were too similar (it is actually slightly more complicated than this, but this simplification preserves the idea and is really helpful). So, for example, "b" and "p" belong to such a class - thus, since "broda" and "proda" differ only in their respective first lerfu and their first lerfu belong to the same class, these words are too similar. There was another similarity which was restrictive: two proposed gismu strings could not differ in only their final vowel (in a sense, unstressed vowels belonged to another mutually shared class exclusive of all other sounds; stressed vowels each got their own singleton classes though). If the algorithm proposed a word which was too similar in any of these ways with a word which was already actualized, then the proposed string was rejected. Thus, the proposed string was said to have a (gismu) similarity conflict with the actualized gismu.

Revision as of 07:09, 7 February 2017

Without any restrictions beyond phonotactics and morphology, gismu space is extremely large. It would be the set of all words which are exactly of the form 'CCVCV' or 'CVC/CV'. 'V' can be any of five different options (nV = 5), a single 'C' can be any of seventeen different options (nC = 17); 'CC' (permissible initial consonant pairs) can be any of forty-eight options (nCC = 48). 'C/C' (generic permissible consonant clusters) are a bit more involved to calculate. At present, there are ten voiced consonants and eleven unvoiced consonants (where the cardinality of the intersection of these two sets is four). Consonants can never be adjacent if they differ in voicing, so we must calculate permissible pairs for each of these subsets; no pair of adjacent consonants can consist of a single lerfu being repeated ("cc" is prohibited), so if we are drawing lerfu from a bag, then no replacement would be allowed; order does matter ("sl" is distinct from "ls"); thus we need to calculate a permutation. Specifically, the number of generic permissible consonant pairs is given by nC/C = Permute(11,2) + Permute(10,2) - k, for some k which I will describe shortly. Therefore, nC/C = 11!/((11-2)!) + 10!/((10-2)!) - k = 11*10 + 10*9 - k = 110 + 90 - k = 200 - k. There is a set K of specially forbidden consonant pairs; it is defined explicitly as K = {"cs", "sc", "jz", "zj", “cx”, “kx”, “xc”, “xk”, “mz”}; we define k = |K| = 9. Therefore, nC/C = 200 - 9 = 191. We can calculate the number of 'CCVCV' potential-gismu by evaluating the product N1 = nCC nV nC nV = 48*5*17*5 = 20400; likewise, the number of 'CVC/CV' potential-gismu is N2 = nC nV nC/C nV = 17*5*191*5 = 81175. Call the set of all strings of form 'CCVCV' or form 'CVC/CV' which is subject to only phonotactic restrictions "naïve gismu space"; it is the set of all unformed/primordial prototype potential-gismu strings (which are 'words' in the concept of this space). Thus, naïve gismu space has N = N1 + N2 = 20400 + 81175 = 101575 distinct words in it (not all of them actualized).

But it turns out that actual gismu space has structure additional to that of naïve gismu space; it is a subspace of the latter which is endowed with additional properties - and restrictions. The main such property is that of a word-similarity identification between its members, which shall be explored shortly.

To be clear before we begin, though: actual gismu space (which will just be called "gismu space" hereinafter whensoever is possible without confusion) is partitioned into two sets, the first being of all gismu which are actualized (documented somewhere or accepted as official by some standard) and the second being of all possible strings which belong to naïve gismu space and which are neither actualized nor in similarity conflict with actualized gismu.

The original gismu creation process involved two primary steps. The first was an explicit listing of every cultural or handpicked gismu which was to be actualized. The second was the original gismu creation algorithm which knew of the list generated by the first step. By design, it turned out that no two words generated by the end of the second step (including those from the first step) were pronounced too similarly in a specific sense. Basically, consonants were put into classes and if two strings differed by exactly one exchange of a lerfu with another lerfu in the same class, then the words were too similar (it is actually slightly more complicated than this, but this simplification preserves the idea and is really helpful). So, for example, "b" and "p" belong to such a class - thus, since "broda" and "proda" differ only in their respective first lerfu and their first lerfu belong to the same class, these words are too similar. There was another similarity which was restrictive: two proposed gismu strings could not differ in only their final vowel (in a sense, unstressed vowels belonged to another mutually shared class exclusive of all other sounds; stressed vowels each got their own singleton classes though). If the algorithm proposed a word which was too similar in any of these ways with a word which was already actualized, then the proposed string was rejected. Thus, the proposed string was said to have a (gismu) similarity conflict with the actualized gismu.