summaryrefslogtreecommitdiff
path: root/misc/stardict-tools/README.StarDict
diff options
context:
space:
mode:
Diffstat (limited to 'misc/stardict-tools/README.StarDict')
-rw-r--r--misc/stardict-tools/README.StarDict380
1 files changed, 380 insertions, 0 deletions
diff --git a/misc/stardict-tools/README.StarDict b/misc/stardict-tools/README.StarDict
new file mode 100644
index 0000000000..352bf18dd0
--- /dev/null
+++ b/misc/stardict-tools/README.StarDict
@@ -0,0 +1,380 @@
+
+Format for StarDict dictionary files
+------------------------------------
+
+StarDict homepage: http://stardict.sourceforge.net
+
+{0}. Number and Byte-order Conventions
+When you record the numbers that identify sizes, offsets, etc., you
+should use 32-bit numbers, such as you might represent with a glong.
+
+In order to make StarDict work on different platforms, these numbers
+must be in network byte order. You can ensure the correct byte order
+by using the g_htonl() function when creating dictionary files.
+Conversely, you should use g_ntohl() when reading dictionary files.
+
+Strings should be encoded in UTF-8.
+
+
+
+{1}. Files
+
+Every dictionary consists of three files:
+
+(1). somedict.ifo
+(2). somedict.idx or somedict.idx.gz
+(3). somedict.dict or somedict.dict.dz
+
+You can use gzip -9 to compress the .idx file. If the .idx file are not
+compressed, the loading can be fast and save memory when using, compress it
+will make the .idx file load into memory and make the quering fast when using.
+
+You can use dictzip to compress the .dict file.
+"dictzip" uses the same compression algorithm and file format as does gzip,
+but provides a table that can be used to randomly access compressed blocks
+in the file. The use of 50-64kB blocks for compression typically degrades
+compression by less than 10%, while maintaining acceptable random access
+capabilities for all data in the file. As an added benefit, files
+compressed with dictzip can be decompressed with gunzip.
+For more information about dictzip, refer to DICT project, please see:
+http://www.dict.org
+
+Stardict will search for the .ifo file, then open the .idx or
+.idx.gz file and the .dict.dz or .dict file which is in the same directory and
+has the same base name.
+
+
+
+{2}. The ".ifo" file's format.
+
+The .ifo file has the following format:
+
+StarDict's dict ifo file
+version=2.4.2
+[options]
+
+Note that the current "version" string must be "2.4.2". If it's not,
+then StarDict will refuse to read the file.
+
+[options]
+---------
+
+In the example above, [options] expands to any of the following lines
+specifying information about the dictionary. Each option is a keyword
+followed by an equal sign, then the value of that option, then a
+newline. The options may be appear in any order.
+
+Note that the dictionary must have at least a bookname, a wordcount and a
+idxfilesize, or the load will fail. All other information is optional. All
+strings should be encoded in UTF-8.
+
+Available options:
+
+bookname= // required
+wordcount= // required
+idxfilesize= // required
+author=
+email=
+website=
+description=
+date=
+sametypesequence= // very important.
+
+
+wordcount is the count of word entries in .idx file, it must be right.
+
+idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
+to a .idx.gz file, this entry must record the original .idx file's size, and it
+must be right too. The .gz file don't contain its original size information,
+but knowing the original size can speed up the extraction to memory, as you
+don't need to call realloc() for many times.
+
+
+The "sametypesequence" option is described in further detail below.
+
+***
+
+sametypesequence
+
+You should first familiarize yourself with the .dict file format
+described in the next section so that you can understand what effect
+this option has on the .dict file.
+
+If the sametypesequence option is set, it tells StarDict that each
+word's data in the .dict file will have the same sequence of datatypes.
+In this case, we expect a .dict file that's been optimized in two
+ways: the type identifiers should be omitted, and the size marker for
+the last data entry of each word should be omitted.
+
+Let's consider some concrete examples of the sametypesequence option.
+
+Suppose that a dictionary records many .wav files, and so sets:
+sametypesequence=W
+In this case, each word's entry in the .dict file consists solely of a
+wav file. In the .dict file, you would leave out the 'W' character
+before each entry, and you would also omit the 32-bit integer at the
+front of each .wav entry that would normally give the entry's length.
+You can do this since the length is known from the information in the
+idx file.
+
+As another example, suppose a dictionary contains phonetic information
+and a meaning for each word. The sametypesequence option for this
+dictionary would be:
+sametypesequence=tm
+Once again, you can omit the 't' and 'm' characters before each data
+entry in the .dict file. In addition, you should omit the terminating
+'\0' for the 'm' entry for each word in the .dict file, as the length
+of the meaning string can be inferred from the length of the phonetic
+string (still indicated by a terminating '\0') and the length of the
+entire word entry (listed in the .idx file).
+
+So for cases where the last data entry for each word normally requires
+a terminating '\0' character, you should omit this character in the
+dict file. And for cases where the last data entry for each word
+normally requires an initial 32-bit number giving the length of the
+field (such as WAV and PNG entries), you must omit this number in the
+dictionary.
+
+Every dictionary should try to use the sametypesequence feature to
+save disk space.
+
+***
+
+
+
+{3}. The ".idx" file's format.
+
+The .idx file is just a word list.
+
+The word list is a sorted list of word entries.
+
+Each entry in the word list contains three fields, one after the other:
+
+word_str; // a utf-8 string terminated by '\0'.
+word_data_offset; // word data's offset in .dict file
+word_data_size; // word data's total size in .dict file
+
+word_str gives the string representing this word. It's the string
+that is "looked up" by the StarDict.
+
+word_data_offset and word_data_size should both be 32-bit numbers in
+network byte order.
+
+No two entries should have the same "word_str". In other words,
+(strcmp(s1, s2) != 0).
+
+The length of "word_str" should be less than 256. In other words,
+(strlen(word) < 256).
+
+The word list must be sorted by calling stardict_strcmp() on the "word_str"
+fields. If the word list order is wrong, StarDict will fail to function
+correctly!
+
+============
+gint stardict_strcmp(const gchar *s1, const gchar *s2)
+{
+gint a;
+a = g_ascii_strcasecmp(s1, s2);
+if (a == 0)
+return strcmp(s1, s2);
+else
+return a;
+}
+============
+
+g_ascii_strcasecmp() is a glib function:
+
+Unlike the BSD strcasecmp() function, this only recognizes standard
+ASCII letters and ignores the locale, treating all non-ASCII characters
+as if they are not letters.
+
+stardict_strcmp() works fine with English characters, but the other
+locale characters' sorting is not so good. There should be a _strcmp
+function which handles the utf-8 string sorting better. If you know
+one, email me :)
+
+g_utf8_collate()? This is a locale-dependent funcition. So if you look
+up Chinese characters while in the Chinese locale, it works fine. But
+if you are in some other locale then the lookup will fail, as the
+order is not the same as in the Chinese locale (which was used when
+creating the dictionary).
+
+g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but..
+
+The complete solution can be found in "Unicode Technical Standard #10: Unicode
+Collation Algorithm", http://www.unicode.org/reports/tr10/
+
+I hope glib will provide a locale-independent g_utf8_collate() soon.
+http://bugzilla.gnome.org/show_bug.cgi?id=112798
+
+
+
+{4}. The ".dict" file's format.
+
+The .dict file is a pure data sequence, as the offset and size of each
+word is recorded in the corresponding .idx file.
+
+If the "sametypesequence" option is not used in the .ifo file, then
+the .dict file has fields in the following order:
+
+==============
+word_1_data_1_type; // a single char identifying the data type
+word_1_data_1_data; // the data
+word_1_data_2_type;
+word_1_data_2_data;
+...... // the number of data entries for each word is determined by
+// word_data_size in .idx file
+word_2_data_1_type;
+word_2_data_1_data;
+......
+==============
+
+It's important to note that each field in each word indicates its
+own length, as described below. The number of possible fields per
+word is also not fixed, and is determined by simply reading data until
+you've read word_data_size bytes for that word.
+
+
+Suppose the "sametypesequence" option is used in the .idx file, and
+the option is set like this:
+
+sametypesequence=tm
+
+Then the .dict file will look like this:
+
+==============
+word_1_data_1_data
+word_1_data_2_data
+word_2_data_1_data
+word_2_data_2_data
+......
+==============
+
+The first data entry for each word will have a terminating '\0', but
+the second entry will not have a terminating '\0'. The omissions of
+the type chars and of the last field's size information are the
+optimizations required by the "sametypesequence" option described
+above.
+
+
+Type identifiers
+----------------
+
+Here are the single-character type identifiers that may be used with
+the "sametypesequence" option in the .idx file, or may appear in the
+dict file itself if the "sametypesequence" option is not used.
+
+Lower-case characters signify that a field's size is determined by a
+terminating '\0', while upper-case characters indicate that the data
+begins with a 32-bit integer that gives the length of the data field.
+
+'m'
+Word's pure text meaning.
+The data should be a utf-8 string ending with '\0'.
+
+'l'
+Word's pure text meaning.
+The data is NOT a utf-8 string, but is instead a string in locale
+encoding, ending with '\0'. Sometimes using this type will save disk
+space, but its use is discouraged.
+
+'g'
+A utf-8 string which is marked up with the Pango text markup language.
+For more information about this markup language, See the "Pango
+Reference Manual."
+You might have it installed locally at:
+file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
+
+'t'
+English phonetic string.
+The data should be a utf-8 string ending with '\0'.
+
+Here are some utf-8 phonetic characters:
+θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ
+æɑɒʌәєŋvθðʃʒːɡˏˊˋ
+
+'y'
+Chinese YinBiao.
+The data should be a utf-8 string ending with '\0'.
+
+
+'W'
+wav file.
+The data begins with a network byte-ordered glong to identify the wav
+file's size, immediately followed by the file's content.
+
+'P'
+png file.
+The data begins with a network byte-ordered glong to identify the png
+file's size, immediately followed by the file's content.
+
+'X'
+this type identifier is reserved for experimental extensions.
+
+
+
+{5}. Tree Dictionary
+
+The tree dictionary support is used for information viewing, etc.
+
+A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
+and sometreedict.dict.dz.
+
+It is better to compress the .tdx file, as it is always load into memory.
+
+The .ifo file has the following format:
+
+StarDict's treedict ifo file
+version=2.4.2
+[options]
+
+Available options:
+
+bookname= // required
+tdxfilesize= // required
+wordcount=
+author=
+email=
+website=
+description=
+date=
+sametypesequence=
+
+wordcount is only used for info view in the dict manage dialog, so it is not
+important in tree dictionary.
+
+The .tdx file is just the word list.
+
+-----------
+
+The word list is a tree list of word entries.
+
+Each entry in the word list contains four fields, one after the other:
+word_str; // a utf-8 string terminated by '\0'.
+word_data_offset; // word data's offset in .dict file
+word_data_size; // word data's total size in .dict file. it can be 0.
+word_subentry_count; //have many sub word this entry has, 0 means none.
+
+Subentry is immidiately followed by its parent entry. This make the order is
+just as when a tree list with all its nodes extended, then sort from top to
+bottom.
+
+The .dict file's format is the same as the normal dictionary.
+
+
+
+{6}. More information.
+
+You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
+"src/tools/*.cpp" for more information.
+
+If you have any questions, email me. :)
+
+Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
+English.
+
+Hu Zheng <huzheng_001@163.com>
+http://forlinux.yeah.net
+
+2003.11.11
+