corpora: Difference between revisions

From Lojban
Jump to navigation Jump to search
mNo edit summary
 
mNo edit summary
Line 1: Line 1:


== Prior usage and discussion ==
Here is some info that we hope will be useful to [[jbocre: baupla fuzykamni PFK|baupla fuzykamni PFK]] commissioners, and other people doing research on the Lojban language.


MAI is postfix, this was probably decided to make it analogous to mei, moi, roi, and re'u. However, this serves to make the grammar of Lojban non-LALR(1), because the parser may have to look through an arbitrarily large numeral string to decide that it actually belongs in a free modifier. This should not be a problem if Robin's PEG parser is made official. If Robin's PEG parser is ''not'' made official, however, extensive pre-processing will be required.
There is now a master [http://lojban.org/cgi-bin/corpus orpus Application]!


=== fa'o ===
Use it, and if it's missing something, please add it.


Described as unconditional end of parsing. Evidently intended only for machine input. Sometimes used in the sense of "the end". Some erroneous uses, such as inside of tu'e -- tu'u groups. See [http://66.102.9.104/search?q=cache:MpjTKbPZY88J:www.lojban.org/twiki/pub/Files/Documents/carvi.html+%22fa%27o%22+site:.lojban.org&hl=no] I see no reason to legalise this practice, since ''fe'o'' is available for this purpose.
* All the text on lojban.org can be searched from http://www.lojban.org/search.html
* It can also be search by doing a Google search with "site:.lojban.org" Try, for instance http://www.google.com/search?hl=no&q=nurma+site%3A.lojban.org


==== usage ====
The two alternatives above will often give so many false positives in English so as to be useless. The main source we have for Lojban usage is the IRC logs:


There is no usage other than quotes or jokes in English. The only usage that could be deemed correct is a single fa'o at the end of the Berenstein Bears books or other books.<br />
* http://www.lojban.org/resources/irclog/
* The above texts, [http://lojban.org/resources/irclog/irclogs.zip n one big zip file].


=== ni'o ===
These are filtered line-by-line to exclude lines that have too many words that are not possible Lojban word-forms, so it is a very high-quality corpus, and consists of more than 360,000 words (as of February 12th, 2006).


Seems to be used mostly parallel to paragraph breaks in natural languages. See [http://www.lojban.org/], [http://www.wiw.org/~jkominek/lojban/9312/msg00394.html], and [http://www.wiw.org/~jkominek/lojban/9107/msg00052.html]. On IRC, which is indicative of spoken language,  this appears to have more of a meaning of changing the subject. Examples: [http://www.digitalkingdom.org/lojban/irclog/lojban/2004_06_02-02_21.txt], [http://www.digitalkingdom.org/lojban/irclog/lojban/2002_05_12--2002_11_28.txt].
Lojbab's old archives are at [http://www.lojban.org/files/texts/archives/]


ni'o implicitly cancels some assignments, depending on the number of consecutive ni'o and whether the text is spoken or written. The following table is due to CLL pp. 446--447.
We also have some contributed texts that were uploaded to the old [[jbocre: Twiki|Twiki]], and (to [[User:tsali y|tsali y]] knowledge) not available elsewhere:


||'''Number of consecutive ni'o'''|'''Written'''|'''Spoken'''
* {ATTACH(name=>birendra,showdesc=>1)}{ATTACH}
* {ATTACH(name=>bisli-viltcima,showdesc=>1)}{ATTACH}


ni'o|no effect|cancel KOhA and GOhA
* {ATTACH(name=>help-lojban.txt,showdesc=>1)}{ATTACH}
 
ni'oni'o|cancel KOhA and GOhA|cancel KOhA and GOhA and tenses
 
ni'oni'oni'o|cancel KOhA and GOhA and tenses|cancel KOhA and GOhA and tenses||
 
===  Examples of ''ni'o'' Usage ===
 
A: ni'o mi ca'o kelci lo samselkei
 
B: .i .ua go'i lo samselkei no'u ma
 
A: .i go'i la'o gy. Final Fantasy .gy.
 
B: .i .io mi nelci
 
A: ni'o mi djica lo nu citka
 
B: no'i mi djica lo nu jbera fi do
 
A: .i je'e
 
===  Notes ===
 
I did not include a natural example because the usage is wide, varied, and mostly incorrect between spoken, e-mailed, IRC'd, and written Lojban. It should be used for starting new topics of discussion, which as a by-product also clears KOhA and GOhA as well as sticky tenses (IIRC). It is not a paragraph marker (whitespace can be used for that and nobody said how much whitespace is allowed), it is a topic marker.
 
===  Issues ===
 
* [https://groups.google.com/group/lojban/tree/browse_frm/month/2008-12/dc259d0ccb79a9d0?hl=en&amp;rnum=51&amp;_done=%2Fgroup%2Flojban%2Fbrowse_frm%2Fmonth%2F2008-12%3Fhl%3Den%26scoring%3Dd%26&amp;scoring=d#doc_935fcdee29329746 arser related comment. Not sure if currently relevant.]]
 
=== i ===
 
Ubiquitous. This is used mostly in front of sentences that are not the first sentence in the text. Sometimes also the first sentence in the text is prefixed with .i. (However, this is incorrect.)
 
===  Examples of ''.i'' Usage ===
 
(see ''ni'o'')
 
===  Notes ===
 
It is used to indicate the beginning of a new jufra continuing on the topic established with ni'o.
 
=== Proposed Definition of ''no'i'' ===
 
Used to indicate the speaker is talking about a previous topic of discussion.
 
=== Examples of ''no'i'' Usage ===
 
* no'i la xrist. ba cpacu loi vanju mu'i lenu pinxe kei gi'e te preti fo ko'a felenu ko'a djica lenu la xrist. dunda dakau ko'a
** ''"Christ then took wine to drink, and asked the man what he wanted Christ to give him."'' From the translation of "Cardplayer", by Nick Nicholas. [http://www.lojban.org/files/texts/cardplayer]
 
* no'i mi pu co'a mutce kurji lo nu jmina la jbovlaste
** ''"Anyway, I take great care about additions to Jbovlaste."'' [http://www.livejournal.com/users/camgusmis/2435.html]
 
Also see example at ni'o.
 
=== tu'e - tu'u ===
 
''tu'e'' - ''tu'u'' seems to be used mainly to be used to set off a large block of text and refer to it metalinguisticially. For instance, there is a (very large) mailing list thread called [http://www.lojban.org/lists/lojban-list/msg03769.html oi preti be fi lo nincli zo'u tu'e]. Also lots of poetry are prefixed with titles that uses ''di'e'' to refer to the body of the poem, set of with ''tu'e''.
 
* [http://lojban.org/lists/lojban-list/msg08842.html onfusion as to the fact that tu'e clauses don't fit into relative clauses]
 
===  Examples of ''tu'e-tu'u'' Usage ===
 
Usage is contended. No consistent natural examples exist. Arbitrary examples follow:
 
* ro da pa de zo'u tu'e da gerku .ije de mlatu tu'u .inaja da jersi de
* .i la robin. kakne lo nu djuno tu'e lo se pensi be da
 
* mi nelci lo nu pilno zo ka'u va'o tu'e le jboklu
 
=== zo'u ===
 
Marks the end of a prenex. A prenex can have one or more terms, which may constrain the instantiation of logical variables in the main sentence. Prenexes are also used as a topic field.
 
===  Examples ===
 
*i lo do solri nu canci zo'u do ba lifri i mi ba mi'ecpe : Your sun-like vanishing exists such that you will experience it. I will demand it.
 
== Proposed dictionary entries ==
 
;'''fa'o''' (FAhO):Unless quoted by "zo" or "lo'u" -- "le'u", turned into a quote delimiter by zoi, or acting as part of a lujvo made by a preceding "zei", marks the end of input to be parsed. Any remaining text is to be disregarded.
 
;'''i''' (I):Starts a new sentence.
 
;'''mai''' (MAI):Enumerates a point in the text. Combines with the preceding numeral to make a free modifier, which can be placed almost anywhere in a text.
 
;'''mo'o''' (MAI):Enumerates a higher-level section or chapter in the text. Combines with the preceding numeral to make a free modifier, which can be placed almost anywhere in a text.
 
;'''ni'o''' (NIhO):Marks the start of a paragraph and a change of subject. Multiple "ni'o" in a row means higher-level section breaks. In written contexts, two or more consecutive "ni'o" cancels the assignment of pro-sumti and pro-bridi in the selma'o KOhA and GOhA, respectively, and three or more consecutive "ni'o" additionally cancels all current tenses. In spoken contexts, a single or several consecutive "ni'o" cancels the assignment of pro-sumti and pro-bridi in the selma'o KOhA and GOhA, respectively, while two or more consecutive "ni'o" additionally cancels all current tenses.
 
;'''no'i''' (NIhO):Marks the start of a paragraph and change back to a previous subject. If no'i has a positive or zero subscript, it indicates the continuation of an earlier topic that was introduced with the word ni'o with the same subscript. If no'i has a negative subscript, it is a resumption of the topic of the paragraph found by counting backwards, starting with the paragraph before the one introduced with ni'o.
 
;'''tu'e''' (TUhE):Starts a text scope, which is a group of sentences. The text scope acts as a single sentence externally, for purposes such as logical operators.
 
;'''tu'u''' (TUhU):Ends a text scope. Elidable terminator for tu'e.
 
;'''zo'u''' (ZOhU):Marks the end of a prenex. A prenex can occur at the beginning of the sentence, and consists of one or more terms. A term is either a sumti or a sumti preceded by a tense or modal tag. The primary use of a prenex is for quantifying logical variables prior to their use in the sentence and/or sentences that are joined to it by a logical connective. Terms that do not quantify logical variables are instead interpreted as 'topics' of the containing sentence, and any sentences that are joined to it by a logical connective.
 
== Proposed keywords ==
 
;fa'o:The End. parsing ends here. end parsing here.
 
;i:and then.
 
;mai:-stly. -ndly. -thly.
 
;mo'o:-st section. -nd section. -rd section.
 
;ni'o:continuing to the next topic.
 
;no'i:returning to the previous topic.
 
;zo'u:so that. such that.
 
== Interaction with other sections ==
 
* The wording of the definition of "fa'o" must be watched closely to prevent contradictions with [[BPFK Section: Nonce connectives]].
* The selma'o MAI probably requires either preprocessing prior to YACC, or a PEG grammar.
 
== Notes ==
 
TUhU is currently seldom elidable. I believe that currently it is only elidable at the end of text. It is the belief of .xorxes., me, and possibly others that it should never be elidable. - .aionys.
 
NIhO should *NOT* have different uffects depending on the medium it is in. rlpowell agrees. (I don't like how "ni'o"*N resets various things depending on N. Can't tense be reset using KI?) - .djeims./purpleposeidon/neptunepink (+1 check out my notes by the applicable words. -Lindar)
 
== Impact ==
 
It is my belief that this section does not invalidate actual usages that were previously valid, nor does it contradict current prescription of the language.
 
* Clarification: topic resumption by label applies if no'i has a positive '''or zero''' subscript.
* Clarification: topic resumption by back-counting '''starts at section before the one currently being introduced'''.
 
* Clarification: the implication that any term in a prenex is either a bound variable or a topic (CLL p. 467) is made explicit.
 
{POLL(pollId=>16)}Text Structure cmavo Poll{POLL}

Revision as of 16:46, 4 November 2013

Here is some info that we hope will be useful to baupla fuzykamni PFK commissioners, and other people doing research on the Lojban language.

There is now a master orpus Application!

Use it, and if it's missing something, please add it.

The two alternatives above will often give so many false positives in English so as to be useless. The main source we have for Lojban usage is the IRC logs:

These are filtered line-by-line to exclude lines that have too many words that are not possible Lojban word-forms, so it is a very high-quality corpus, and consists of more than 360,000 words (as of February 12th, 2006).

Lojbab's old archives are at [1]

We also have some contributed texts that were uploaded to the old Twiki, and (to tsali y knowledge) not available elsewhere:

  • {ATTACH(name=>birendra,showdesc=>1)}{ATTACH}
  • {ATTACH(name=>bisli-viltcima,showdesc=>1)}{ATTACH}
  • {ATTACH(name=>help-lojban.txt,showdesc=>1)}{ATTACH}