The Biaroza file format (BFF)
last updated: Jan 21st, 2002
Sections:
1. Presenting the BFF
2. Physical description of the BFF file format
3. The line types in a BFF file
4. Special characters
5. Formatting the data
6. Fields
7. Explaining the parenthesis around
1. Presenting the BFF
The BFF (Biaroza File Format) is the standard way to store the dictionaries' databases
of the Biaroza project.
The BFF is:
- Capable of holding complex dictionary data, thus covering all types of information
of conventional printed-book dictionaries.
- An extensible format.
- Editable directly by simple text software.
- Capable of covering virtually any language
(although the Biaroza project aims only belarusian dictionaries).
2. Physical description of the BFF file format
The BFF is an 8-bit text file, ASCII compatible.
The formatation and data organization is done with standard ASCII strings.
The high-bit (non-ASCII) characters are not interpreted by the format itself, so the text
may be any ISO 8859 encoding, UTF-8 or any ASCII-compatible coding.
The break-of-line may be CR+LF, LF or CR.
Note: The Biaroza project uses BFF with UTF-8 and LF as break-of-line.
3. The line types in a BFF file
There are five possible line types:
- Empty lines
Lines witch zero length, or lines with spaces only (characters <= 0x20)
Empty lines are ignored.
- Comment lines
Lines starting with the ";" (semicolon) character.
Comment lines are ignored.
- Attribution lines
Lines starting with the "#" ("scratchy") character.
The syntax of these lines are yet to be developed, so they must be ignored for now.
- Definition HEAD
Any line which does not fit any other line type.
These lines hold the definition name.
Normally this line is followed by DATA lines, but it's possible a HEAD line with no DATA at all
(not of much use though).
- Definition DATA
Non-empty lines starting with spaces (characters <= 0x20)
The DATA lines are owned by the previously-defined HEAD, as part of that definition.
DATA lines without a previous HEAD line are invalid.
The only lines allowed between HEAD and DATA lines are empty lines and comment lines.
Note: The Biaroza project uses one space (0x20) as DATA line identificator.
4. Special characters
These are the characters which have a special meaning in a BFF file:
- # (number) ; (semicolon) " " (space)
Used in the beginning of lines as line type identificator (as described in the section 3).
- : (colon) ; (semicolon, again) , (comma)
Divider, sub-divider and sub-sub-divider of a line.
These may theorically appear in any HEAD or DATA line.
Notes (specific to the Biaroza project):
- These characters are not used at all in HEAD lines.
- The : (colon) is not used more than once per line.
- ( ) (parenthesis)
These have two functions:
- To insert free comments (write whatever you want there).
- To assign a property of a field or word (write standard tags there).
The file parser must not interpret the data between the parenthesis.
Any character between the parenthesis is allowed except control characters.
The parenthesis can be nested inside each other (but everything inside the first level of parenthesis
will be ignored).
High-level interpretation of data between parenthesis is completely optional.
Note: It's a good idea not to use ,;: (comma, semicolon, colon) characters between parenthesis,
neither to nest parenthesis inside each others otherwise simpler software may have problems processing the file.
5. Formatting the data
Here it's a fictional example of BFF data:
abvinavačvać
stress: abvinava_čvać
meaning (v.imp.): accuse
declesion: abvinić (v.perf.)
; this is a comment line
abvinić
stress: abvini_ć
see (v.perf.): abvinavačvać
achova
stress: acho_va
; it's possible to insert comments here, for example
meaning (f.): protection, guard
Note: Definition HEAD (red), definition DATA (black), comment line (green).
6. Fields
Fields are the keywords used to define which data a DATA line carries.
All field types are optional (so it may be present or not) and may appear more than once
(although in practical use few fields - like the "meaning" - appear more than once).
All field types may have a property attached (the parenthesis data), although currently
such data is only relevant for the "meaning" and "see" fields.
The field types currently present in the BFF standard:
- declesion
If a definition (a verb, an adjective, etc) varies in irregular way,
the variants may appear here.
This often happens to slavic perfective/imperfective verbs variations.
In English, for example, there's the case of "knife" (n.) which turns into "knives" (pl.).
- meaning
The meaning of the definition.
If there are different meanings for the same word, the ";" (semicolon) is used to separate words
for a different meaning.
If the same word has, simultaneously, several classes (noun, verb, adjective, etc) a new line with
the "meaning" field will be needed for each class. -- This often happens in English (e.g.
"walk" is simultaneously verb and noun).
- see
Works much like the "meaning" field, except it indicates another definition (or definitions)
equivalent to that.
It's advisable to not point this field to other definitions which, too, have "see"
entries. This would add unnecessary complexity to the dictionary database.
- stress
Indicates the stress syllable of the defined word.
The stress syllable is the one just before the "_" (underline) character.
- variation
This field is used to list variations of the definition, but in different grammatical classes.
(e.g. in the definition of "slow" (adj.) we could put in the variation entry: "slowly" (adv.)
and "slowness" (n.))
7. Explaining the parenthesis around
First of all, keep in mind the explanation about the parenthesis in section 4.
Now look at this example:
abvinavačvać
stress: abvinava_čvać
meaning (v.imp.): accuse
declesion: abvinić (v.perf.)
The first parenthesis data "v.imp." is an attribute of that "meaning" field.
It is applied to the definition word "abvinavačvać".
The second parenthesis data is related to the word "abvinić", which is "v.perf."
(perfective verb).
Why the first parenthesis is not next to the "abvinavačvać" word instead?
It happens because some definitions may have more than one class simultaneously (noun, verb, etc)
so the "meaning" and "see" fields carry this data instead, since they may appear
more than once (as explained in section 6).
|