| NVNC.de | German Railway | Historical Data | Documentation | | Deutsch |
3.1 The General Format
3.2 Data Types
After concepts have been introduced it is now time to explain the way data will be stored. It may appear unusal to some readers because we use neither XML nor data bases. The reason is that the data should be easy to read and edit after a short introduction to the format. No heavy tool should be necessary for this although it should be possible to let the computer turn the data into other formats for publishing.
This last point demands for a strict grammar and suggests the use of XML. However, if you have ever tried to read or edit an XML file using a text editor only, you will aggree that this was not quite easy. It is for this reason that we will go back to the roots and use one of the oldest concepts of data storage which is puting data into simple text files that follow a grammar developed particualarly for this project. This grammar is simple enough to understand and use with some experience.
This chapter will first explain the general structure of the data in prose first. We will then dig further into some important issues. The elements are then explained individually in the next chapter.
It has been mentioned that all data is put into text files. This means that all the bits and bytes of the files are interpreted as characters and are processed in this form. A file is formalized by declaring some rules that describe which characters may follow in what order (called syntax) and what these sequences should mean (semantics). There are many options for defining the rules. One may define general rules that are good for pretty much all kind of data but result in 'ugly' text files (such as XML) or one can tailor the rules to the very special needs of a certain application.
This project takes the second option. The syntax (i.e., what sequence of characters is considered legal) is inspired by the C programming language. A very prominent feature of this language is that it has a very small amount of noise. This means that it keeps the amount characters necessary for bookkeeping very low. The advantage is that for reading such a file you have to scan fewer characters to get to the point. And in editing the advantage of course is that you have to type less.
But enough for the advocacy, let's start with the explaination of the actual format.
In order to fullfill the readability requirement it will be best to follow the way that natural languages group their script. First of all we have letters, digits and symbols. They all have a different purpose. Letters represent sounds, numbers mathematical concepts, and symbols are used to support the decryption of the text. The period, for instance, marks the end of a sentence.
Another important element of text is missing in the above list, which is the white space between what was mentionend. But it is very important, indeed, because it allows grouping of characters and words. Textwithoutwhitespaceisveryhardtoread.
With this we have all elements of a text and therefore of a text file. Letters and digits contain, if properly arranged, the actual information (in natural languages: words and numbers), white space speparates the smallest units (again words and numbers) und symbols allow structuring the text on a higher level (phrases, sentences). If we use this concept we should be able to produce text that reads well.
Of course, we will and this is what we get. First, there are words. They form the units of information and are made of letters, digits, and the symbols dash (-), forward slash (/), vertical bar (|), plus sign (+), question mark (?), and period (.)[That the period is part of the 'letters' is somewhat strange. However, many computer languages use it to separate elements within a hierarchy, so this is rather common in computer science]. A word ends right before a white space character, i.e., space, tab, or line feed. The next word starts after the white space is over. This means that if a white space character is followed by another white space character they both form a single white space, not two white spaces.
In proper type setting, however, a punctuation mark directly follows a word with no interupting white space. This means, that the word has to end before the punctuation mark. These are the punctation marks of our language, called specials: colon (:), semi-colon (;), asterisk (*), curly and square brackets, and the greater-than sign (>). Their meaning will be explain below.
A realy bad problem remains. Say we want to enter the name of a company. This name may contain white space such as in 'Deutsche Bahn AG'. However, within our language, the name will be a single word because it is a single unit of information. Now, that's a contradiction. Whenever there is contradiction you have to enhance your concept. Again we steal a principle from natural languages, in this case, quotation.
We say that a word that starts with a single or double quotation mark is going on until we hit another quotation mark of the same kind regardless of white space. 'Quotation mark of the same kind' means that if a quotation starts with a double quotation mark it has to end with one, thus you can include any number of single quotation marks which all will be regular parts of the resulting word. (On a side node: We are talking about what is called vertical quotation marks that you can find on your keyboard, not the typographic versions.)
That is, except for one special case, the basic principle of our language. We will describe it more thoroughly in the following sections, including the ominous special case. The formal definition, which is called grammar, can be found in chapter 5.
In the course of the last chapter we introduced things we would like to be able to describe: lines, routes, operation posts, companies. A certain such thing will be an object, e.g, the line 91200 or the station Hinterkleintupfingen.
However, the concepts introduced explain not a single object but all similar objects; they explain all lines, all operation posts, etc. The process of assigning objects to groups of similar objects is called classification and the groups thus are called classes.
This is what we want to describe: objects that are part of object classes. All information we want to state belongs to a very specific object. We thus start every statement by giving that object. We choose the form of stating '<object class> <object name>', because this way a rather dumb program instantly knows, what's going on. Unlike us, who can easily determine that the object '92100' must be a line, implementing such a mechanism into a program can become quite difficult.
But what is '<object classe>'? It simply is a word that describes the class. Since this is a German project we use German words, e.g., 'strecke' for the class of lines. '<object name>' on the other hand has to uniquely identify the object. We already have introduced means to do so: line number, route name, and the key names of operation posts and companies.
We will call a single bit of information in its formal representation a fact. For instance, the date of opening a line is such a fact or the rank of a station. A fact itself is not yet assigned to an object, it is merely the information like 'opened 12 Oct 1893' or 'rank station'.
If you assign the fact to an object, i.e., a line or operation post, the result will be called entry. Since there are many things to be said about a certain object, it is generally a good idea to be able to collect more than one fact with each entry. On the other hand, it might be useful to be able to split the facts to a certain object into multiple entries. It is, for instance, a good idea to state the position of the operation post of a line in their real order within a file dedicated to the line. However, the other facts for the operation posts, such as name or rank, must only be given once. Thus, for a station that lies on many lines, there will be an entry in each file for those lines.
How, then, does an entry looks like? Let's do a (fictious) example of a (fictious) station Klein Tupfingen on the (fictious) line 92130:
betrst Klein-Tupfingen {
lage - 92130/12,30 [Mueller02];
rang - Bf ;
rang 1982-12-01 Hp
(Stilllegung Stw Tf) [Mueller02];
name - "Klein Tupfingen" ;
}
Nice. So what does it say? The first two words state the object as introduced in the last section. Here we have the operation post (German: Betriebsstelle, the class identifier is its abbreviation betrst) with the key name Klein-Tupfingen. Since we cannot use white space within object names and don't want quotation we have to replace the space in 'Klein Tupfingen' by a hyphen.
The follwing open bracked says that now the facts for this object start. And indeed there are four of them that define the posts postition (German: Lage), rank (Rang), and name (Name). Every fact starts with a word declaring what this fact is all about. We will call this fact type. In the example we have facts of three different types namely lage (for position), rang (rank), and name (your guess). All facts then continue with a number of words which are called arguments in the mathematical sense of the word. How many arguments there are and what the are supposed to mean depends on the fact type.
The three types of our example all have two arguments. The first states a date and the second the position, rank, or name. The date '-' means that the earliest possible date should be used which, in this case, probably is the opening of line 91230. The rank fact appears twice which tells us that the post was openend as a station (German: Bahnhof, abbr. Bf) and was turned into a simple stop on 1 Dec 1982.
Next to the number of words demanded by the fact type (some types have optional arguments) a comment is allowed which should be enclosed in parantheses. The third fact has one but again it is in German. This is a little bad but since the comment is supposed to contain something that cannot be turned into a formal fact, there is no other way to do this. We could demand the comment to be in English but this is about the German railways and thus it should be in German.
Finally, the sources of the fact are noted in square brackets. The source is given by an identifier and what exactly that is, is stated elsewhere by quelle objects.
Each fact ends with a semi-colon. Because of this line breaks have no meaning. The comment in the second fact makes it somewhat long. It has been broken to fit into an 78 character line which used to be standard on older displays and still is on more primitve printers. We will stick to this limit since 78 characters is a very good line length for easy reading. Additionally you can see by the example that indention is used to make reading even simpler. Similar parts of facts start at the same column.
We have reached quite some level of formalisation but it is still not enough. Currently, if we write down words a computer program can take these words as what they are (e.g., the name of a post) or it can compare them to predefined words (the object classes or fact types). But if we state the position of a post we want the program to determine line and point of the post. Or if we give a date, the computer should be able to compare or sort them.
For this we need a formal definition of positions and dates. This is what we will introduce in this section. Again, we will merely explain them in prose. The rules can be found in chapter 5.
There are many ways to write down a date. We will use the form propagated by DIN and ISO, that is, the order is year, month, day each of which is connected by a hyphen. 6 Dec 1982 thus will be written as 1982-12-06. The parts always have to consist of 4 (year) or 2 (month, day) digits. The 6 in the example thus has to be converted into a 06.
If we don't know the exact day or not even the month we don't state them. 1982 and 1982-12 both form valid dates. If we are not sure about a date we precede it with a small c for circa: c1980 means around the year 1980.
Not quite enough. We want to be able to state a period and an alternative. The former will for instance appear when a line is demolished which may take some days. The period is statet by the slash operator / which is put between two dates, such as 1974-06-03/1974-06-18 for a period from 3 to 18 Jun 1974. Note that there must be no white space between the dates and the operator or within the dates because that would end the word and thus the date.
Alternatives are given using the or operator |. There may be more than two alternatives. Then all are separated by vertical bars.
Finally, we need an abbreviation, that is, a date that only consists of a hyphen '-'. It is supposed to mean 'as early as possible'. It will be used if we have to describe the state of the line and its stations upon first opening. Giving the date of this opening again and again is somewhat tedious and correcting the date later on would be a horrible task. The computer will happily provide us with the date should this become necessary so we just give it the hyphen.
Quite often we have to describe where is certain object is located within the network or where a certain event happened. We have seen in the last chapter that it is convenient to use a route and a kilometer value for this route. This allows to make three distinct designations: line/route and kilometers, kilometers only, and line/route only (although the latter is not describing a point). We will use the terms network point, line point, and line. This means that if we speak of lines in this contex this also includes routes.
A line point is used if we are already in the context of a certain line. It would be silly to state the line number in this case. When giving a line point we have two options. We can either explicitely give a kilometer value or we give the key name of an operation post that is located on the line. In this case the location of the post is used. This is, of course, not possible when defining the location of a post.
For stating actual kilometers the German way of stating decimal fractions is used. The only difference is that a comma is used instead of a point the delimit the integer and fractional parts. Of the fractional part exactly as many places are given as are known. This means, that if a position is known to 10 meters we use two places even if the last digit is a zero. If the last zero would be omited and we would have to assume that the value is exact to 100 meters.
It again becomes a little more tricky if re-routing comes into play. Everything is fine if the line is shorter after construction works have finished. In this case a certain range in the line's kilometering becomes invalid. The kilometers leap at a certain point. If, on the other hand, the lines becomes longer we have to insert some extra meters. This insert is given relative to the kilometer value of the point where it begins. The kilometer value of this point is suffixed by a plus sign + followed by the distance we are into the insertion in meters. If we have values that are exact only to 10 or 100 meters we replace the unknown digit by a dot.
To clarify this let's do an
example. The picture to the left shows a line that has been
re-routed. Originally the line went straight from A to D. Later
something was to be build there and the line was re-routed to
go in a long curve via B and C. On this new curve the kilometers
continue normally from A to B. Here we reach the value of
53,120 which is the value that D originally had. We have
to start an insertion. C is a post on the
new line. However, we only know its location to 100 meters and
thus have to state its location by 53,120+1.. The route
ends in D where also the insertion ends. The new route is
452 m longer than the old one which is the length of the
insertion. D is the last point to get a new kilometer value assigned
(its old value is already taken by B) which will be
53,120+452. All points beyond D keep their old kilometers.
Now that we can describe line points it is easy to extend this to describe network points. All we need is a line designator which we already have invented: line and route numbers. We combine them with the kilometer values through a forward slash. Thus if the line from the above picture would by line 92130 we could describe C network-wide as 92130/53,120+1.. If the curved route would be named 92130Ax2 we could also describe C by 92130Ax2/53,120+1.. We have to this if we want to give events before route become an active part of line 92130.
Quite often we have to input sections, for instance, if we want to state the opening date of part of a line. Such a section always consists of two line points, namely the starting point and the end point. Both are separated by the > operator. Because > is an operator the is no need to insert white space between it and the line points, however, you are free to do so, nonetheless.
There is one special case. Whenever the start point or end point of a line is to be given you may do so through the * operator.
Occasionally we will have to utter doubt, for instance on dates. There are two kinds of doubt. It may already be present in our sources (you could call this 'official' doubt) or we attained the data on rather unusal ways and, therefore, are not really sure if they are thrustworthy. For example, you can derive the position of an station by determining the distance from a known station as given in a schedule. Since there are always rounding errors we may want to show that the actual position may be off by one digit or so.
To express doubt we, of course, use question marks. A single question mark is for 'official' doubt and two for doubt raised by the editor. The question marks are always suffixed to the data. If we doubt the position of a station, this may look like this: '92130/23,1??'.
The double question mark may be used without any data. In this case we don't know the data. The most common cases are if we don't know the postion of a post and if we know that something happend but not when.
| previous | Index | next |
Last updated on 8 Aug 2003.
Please mail your comments, suggestions, and complaints to
Martin Hoffmann <hn@nvnc.de>.