Project "Biaroza"
Main page
About the project
Softwares
Technical information
Contact

The Biaroza file format (BFF)

last updated: Jan 21st, 2002

Sections:

1. Presenting the BFF
2. Physical description of the BFF file format
3. The line types in a BFF file
4. Special characters
5. Formatting the data
6. Fields
7. Explaining the parenthesis around

1. Presenting the BFF

The BFF (Biaroza File Format) is the standard way to store the dictionaries' databases of the Biaroza project.

The BFF is:

  • Capable of holding complex dictionary data, thus covering all types of information of conventional printed-book dictionaries.
  • An extensible format.
  • Editable directly by simple text software.
  • Capable of covering virtually any language
    (although the Biaroza project aims only belarusian dictionaries).

2. Physical description of the BFF file format

The BFF is an 8-bit text file, ASCII compatible.
The formatation and data organization is done with standard ASCII strings.
The high-bit (non-ASCII) characters are not interpreted by the format itself, so the text may be any ISO 8859 encoding, UTF-8 or any ASCII-compatible coding.
The break-of-line may be CR+LF, LF or CR.
Note: The Biaroza project uses BFF with UTF-8 and LF as break-of-line.

3. The line types in a BFF file

There are five possible line types:

  • Empty lines
    Lines witch zero length, or lines with spaces only (characters <= 0x20)
    Empty lines are ignored.
  • Comment lines
    Lines starting with the ";" (semicolon) character.
    Comment lines are ignored.
  • Attribution lines
    Lines starting with the "#" ("scratchy") character.
    The syntax of these lines are yet to be developed, so they must be ignored for now.
  • Definition HEAD
    Any line which does not fit any other line type.
    These lines hold the definition name.
    Normally this line is followed by DATA lines, but it's possible a HEAD line with no DATA at all (not of much use though).
  • Definition DATA
    Non-empty lines starting with spaces (characters <= 0x20)
    The DATA lines are owned by the previously-defined HEAD, as part of that definition.
    DATA lines without a previous HEAD line are invalid.
    The only lines allowed between HEAD and DATA lines are empty lines and comment lines.
    Note: The Biaroza project uses one space (0x20) as DATA line identificator.

4. Special characters

These are the characters which have a special meaning in a BFF file:

  • # (number) ; (semicolon) " " (space)
    Used in the beginning of lines as line type identificator (as described in the section 3).
  • : (colon) ; (semicolon, again) , (comma)
    Divider, sub-divider and sub-sub-divider of a line.
    These may theorically appear in any HEAD or DATA line.
    Notes (specific to the Biaroza project):
    • These characters are not used at all in HEAD lines.
    • The : (colon) is not used more than once per line.
  • ( ) (parenthesis)
    These have two functions:
    • To insert free comments (write whatever you want there).
    • To assign a property of a field or word (write standard tags there).
    The file parser must not interpret the data between the parenthesis.
    Any character between the parenthesis is allowed except control characters.
    The parenthesis can be nested inside each other (but everything inside the first level of parenthesis will be ignored).
    High-level interpretation of data between parenthesis is completely optional.
    Note: It's a good idea not to use ,;: (comma, semicolon, colon) characters between parenthesis, neither to nest parenthesis inside each others otherwise simpler software may have problems processing the file.

5. Formatting the data

Here it's a fictional example of BFF data:


abvinavačvać stress: abvinava_čvać meaning (v.imp.): accuse declesion: abvinić (v.perf.) ; this is a comment line abvinić stress: abvini_ć see (v.perf.): abvinavačvać achova stress: acho_va ; it's possible to insert comments here, for example meaning (f.): protection, guard

Note: Definition HEAD (red), definition DATA (black), comment line (green).

6. Fields

Fields are the keywords used to define which data a DATA line carries.
All field types are optional (so it may be present or not) and may appear more than once (although in practical use few fields - like the "meaning" - appear more than once).
All field types may have a property attached (the parenthesis data), although currently such data is only relevant for the "meaning" and "see" fields.

The field types currently present in the BFF standard:

  • declesion
    If a definition (a verb, an adjective, etc) varies in irregular way, the variants may appear here.
    This often happens to slavic perfective/imperfective verbs variations.
    In English, for example, there's the case of "knife" (n.) which turns into "knives" (pl.).
  • meaning
    The meaning of the definition.
    If there are different meanings for the same word, the ";" (semicolon) is used to separate words for a different meaning.
    If the same word has, simultaneously, several classes (noun, verb, adjective, etc) a new line with the "meaning" field will be needed for each class. -- This often happens in English (e.g. "walk" is simultaneously verb and noun).
  • see
    Works much like the "meaning" field, except it indicates another definition (or definitions) equivalent to that.
    It's advisable to not point this field to other definitions which, too, have "see" entries. This would add unnecessary complexity to the dictionary database.
  • stress
    Indicates the stress syllable of the defined word.
    The stress syllable is the one just before the "_" (underline) character.
  • variation
    This field is used to list variations of the definition, but in different grammatical classes.
    (e.g. in the definition of "slow" (adj.) we could put in the variation entry: "slowly" (adv.) and "slowness" (n.))

7. Explaining the parenthesis around

First of all, keep in mind the explanation about the parenthesis in section 4.

Now look at this example:

abvinavačvać
 stress: abvinava_čvać
 meaning (v.imp.): accuse
 declesion: abvinić (v.perf.)

The first parenthesis data "v.imp." is an attribute of that "meaning" field.
It is applied to the definition word "abvinavačvać".
The second parenthesis data is related to the word "abvinić", which is "v.perf." (perfective verb).

Why the first parenthesis is not next to the "abvinavačvać" word instead?
It happens because some definitions may have more than one class simultaneously (noun, verb, etc) so the "meaning" and "see" fields carry this data instead, since they may appear more than once (as explained in section 6).