Personal Exploration: word-similarity identification on word spaces

From Lojban
Revision as of 06:39, 7 February 2017 by Krtisfranks (talk | contribs) (Created page with "Without any restrictions beyond phonotactics and morphology, gismu space is extremely large. It would be the set of all words which are exactly of the form 'CCVCV' or 'CVC/CV'...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Without any restrictions beyond phonotactics and morphology, gismu space is extremely large. It would be the set of all words which are exactly of the form 'CCVCV' or 'CVC/CV'. 'V' can be any of five different options (nV = 5), a single 'C' can be any of seventeen different options (nC = 17); 'CC' (permissible initial consonant pairs) can be any of forty-eight options (nCC = 48). 'C/C' (generic permissible consonant clusters) are a bit more involved to calculate. At present, there are ten voiced consonants and eleven unvoiced consonants (where the cardinality of the intersection of these two sets is four). Consonants can never be adjacent if they differ in voicing, so we must calculate permissible pairs for each of these subsets; no pair of adjacent consonants can consist of a single lerfu being repeated ("cc" is prohibited), so if we are drawing lerfu from a bag, then no replacement would be allowed; order does matter ("sl" is distinct from "ls"); thus we need to calculate a permutation. Specifically, the number of generic permissible consonant pairs is given by nC/C = Permute(11,2) + Permute(10,2) - k, for some k which I will describe shortly. Therefore, nC/C = 11!/((11-2)!) + 10!/((10-2)!) - k = 11*10 + 10*9 - k = 110 + 90 - k = 200 - k. There is a set K of specially forbidden consonant pairs; it is defined explicitly as K = {"cs", "sc", "jz", "zj", “cx”, “kx”, “xc”, “xk”, “mz”}; we define k = |K| = 9. Therefore, nC/C = 200 - 9 = 191. We can calculate the number of 'CCVCV' potential-gismu by evaluating the product N1 = nCC nV nC nV = 48*5*17*5 = 20400; likewise, the number of 'CVC/CV' potential-gismu is N2 = nC nV nC/C nV = 17*5*191*5 = 81175. Call the set of all strings of form 'CCVCV' or form 'CVC/CV' which is subject to only phonotactic restrictions "naïve gismu space"; it is the set of all unformed/primordial prototype potential-gismu strings (which are 'words' in the concept of this space). Thus, naïve gismu space has N = N1 + N2 = 20400 + 81175 = 101575 distinct words in it (not all of them actualized).