Building Blocks: 1 – Data
Data is everything (and everything is data).
For a long time I had a fundamental misconception about programming. I believed that systems were walled off from each other, isolated, each an independent functioning whole and that achieving communication between systems was difficult or impossible if the systems were not built of a common architecture.
The gap in my knowledge was the understanding that all systems are built of a series of message exchanges – and the content of each message is data. All that happens in any information system is the continuous exchange of data, between components, between systems, between users and between human and computer.
So the big question is then: What is data? Well, like I said at the top – everything is data and data is everything. To be most accurate, the content of every message is data. In our human world we exchange data on a massive scale everyday – every word, gesture, facial expression, printed or digitised word and image, taste, aroma, sound and texture becomes data that we readily exchange. In human:computer interaction, the palette of datatypes is much smaller – basically every message consists of text.
My misunderstanding grew out of the semantic and syntactic differences between each of the programming languages I learned – knowing that, for example, I couldn’t mix .asp and .php in a webpage how could a system built with .asp communicate with a system built with .php? But humans have built computers and have modelled their own behaviour. An English speaker can communicate with a French speaker despite having different languages – they just need to standardise the messages they exchange into a mutually intelligible format.
Messages exchanged between computers must be structured to obey expected and consistent rules. The same conventions apply to human language and structured text formats:
- messages must have a discernible beginning and end
- constituent parts of messages must be distiguished and defined
- rules must be applied consistently.
It seems incredible to be able to define communication with only three rules but that’s about it – think of something you’d say to somebody in English – you can classify it using just those three rules.
There are a handful of structured text formats in common use:
- Name|Value pairs: the simplest structured data format consists of a sequence of pairs in the format Name=Value, separated by an ampersand. This is the format passed using the GET verb in http. In a URL the start of the message is signified by the ? character and ends at the last character in the message being passed. In other contexts nv pairs are sent as a distinct unit of data that starts with the first character and ends with the last. It is possible to introduce other conventions to introduce more structure: splitting values into additional or multiple values by use of additional delimiters eg:
?name=forename_john|surname_smith would describe a data item name comrpised of two units forename and surname. Largely due to its use in the context of URL requests and the limitations in terms of data length and reserved characters that this imposes, nv pairs are most commonly found transmitting data at a smaller scale and lower complexity. NV pairs are good for sending instructions and identifying items. - Delimited text: CSV, tab delimited text etc this format uses physical position as a sntactic structure and in this respect is far removed from natural, human communication. A delimited text comprises a list of definitions or labels and then a set of values. The individual labels and values are separated by a consistent delimiting character, commonly a comma, and then each complete set of labels or values is separated by another delimiter, most often a newline character:
title,forename,surname,age
mr,john,smith,32
CSV data allows a single level of hierarchy, it describes only a two-dimensional dataset, but it is compact – the semantic definitions are only stated in the initial line. Delimited text is less huamn-readable and imposes a serious restriction on the content by reserving two characters that must not appear except as delimiters – in this way, delimited text structures are prone to structural error due to the simplification of their data definition rules. - XML and JSON are tightly but flexibly structured text formats. Each allows user definition of labels and user imposition of hierarchies of potentially infinite depth. Formal structural rules describe only the formats for signifying the beginning and end of data items and the delimiter between labels and values.
XML is a verbose format that defines data items between pairs of labels:<person id="2"> <title>mr</title> <forename>john</forename> <surname>smith</surname> <age>32</age> </person>The base units of data are described by adding child elements (title, forename, surname etc) and by annotating labels with attributes (the id attribute above); as an extensible structure, there are no rules that define which method should be used. There are defined standards for XML that describe the characters that can be used, but, considered purely as a structural convention, the only limitations will be on the use of characters also used as delimiters. By virtue of its verbosity and use of full text labels as delimiters, XML is a highly human readable as well as machine readable format.
JSON is a less verbose structure that was originally employed to construct objects in Javascript programming. The format uses pairs of braces to define the bounds of each data item and plain text labels to define the role of values – names and values are separated by a colon. Complex and hierarchical data structures can be created by nesting objects.{ title: "mr", forename: "John", surname: "Smith", age: 32, pet: { type: "dog", name: "Fido" } }As with other delimited structures, characters used as delimiters cannot be used in data values unless they are ‘escaped’ – rendered in a format that distinguishes their role as data value and not structure.
So, equipped with a standardised way of structuring a message, as long as the sender and recipient apply the same rules to parse the message, the systems can communicate, whatever their architecture.
It comes almost as a disappointment to discover that really there is no magic to it, that it’s not a complex and arcane art to make an Actionscript component talk to a Java component all we need to do is provide a common means of communication.
All information systems are built on this basic premise; the world wide web is just the post office that routes data from server to client, the cloud is just an internet of systems exchanging messages and instructions.
Three simple rules have defined human communication – that’s the magic.
In: Building Blocks, Programming · Tagged with: agnostic, computer, csv, data, development, excel, exchange, information, java, javascript, json, message, php, platform, programming, science, system, xml
