free counter

Unstructured vs semi-structured data: Order from chaos

We look at alternatives to relational databases which have emerged to greatly help bring some structure to unstructured data and gain valuable insight by rendering it semi-structured

Antony Adshead


Published: 02 Sep 2022

Structured vs unstructured data its a standard method of categorising things. But its nearly that easy.

Although structured data is simple to grasp, the planet of unstructured data and its own transformation to easier understandable, usable and analysable semi-structured data, is less simple.

In this post, we look at structured data, unstructured data, and how semi-structured data brings some order from potential chaos. And brings advantages to organisations that are looking to gain value from often large stores of documents, images, sound files, video, social media marketing posts, and so forth.

Structured data has… structure

Business information is mainly generated by systems or people. Data from systems is most probably to bestructured.

In its traditional format, that is most typified by data in relational databases that use SQL (structured query language). In these, structure is everything. Columns that represent variables are create beforehand and populated by rows of data when a value sits at the intersection of every.

Its something we are able to all visualise. Its like we see in a spreadsheet though whether spreadsheets are structured data is up for debate but complex SQL database schemas involve the same as numerous spreadsheets (tables, in database-speak) that relate (whence relational) to one another and will be filtered, joined and manipulated in lots of ways since they have common elements (keys).

Regardless of the prevalence of unstructured data and the rise of formats which are better referred to as semi-structured, structured databases are essential and wont disappear completely soon.

They’re user friendly, by from large-scale enterprise applications to machine learning tools, but could be limited in how they’re accessed and used and will be relatively onerous to keep up also to change once initially configured.

The mass of unstructured data

Unstructured data is frequently generated by people but not solely and includes media such as for example images and sound recordings, social media marketing posts, agent notes, websites and emails.

Unstructured data holds to no predefined data model and files and objects can be found in an array of sizes, from the few kilobytes for a social media marketing post, for instance, to potentially terabytes for uncompressed video.

Estimates often claim that the vast almost all data is unstructured around 80% or 90% of data held by organisations.

If this is the case and we are able to safely assume it often is then this presents huge challenges for organisations. Unstructured data is, to a larger or lesser extent, undefined and opaque to find and classification.

Which means organisations might not know what is in fact there, and that may be a security and compliance risk. Simultaneously, it means passing up on opportunities to interrogate that data to get insights and value as a result.

No such thing as unstructured data?

However in fact, it really is arguable that no data is actually unstructured. Probably the most unstructured data it is possible to think about image and sound files, for instance includes metadata headers offering high-level home elevators file contents which can be searched and questioned.

In fact it is increasingly possible to look at the contents of such files using artificial intelligence/machine learning ways to, for instance, examine and categorise the contents of sound and video files. YouTube does this to make sure copyright on music isn’t contravened once you upload a video, for example, so these kinds of data could be tagged with new metadata-based, algorithm-based interrogation, should an organisation desire to throw compute at it.

The semi-structured data revolution

Simultaneously, there exists a growing trend towards more usage of semi-structured means of holding data. Some types of semi-structured data have already been around for quite a while, such as for example CSV and XML. A little later came JSON. Each one of these brought using them something like an integral:value format for representing variables and values.

Later came an array of means of holding and analysing data which were not restricted by predefined structure. Generally speaking, these could be lumped together as so-called NoSQL databases, but there are a variety of types within that catch-all.

They include column store databases like Hadoop and Cassandra, document stores like MongoDB and CouchDB, key value stores like Riak, in addition to graph databases, object databases, and so forth. The list gets pretty long.

But, what links these may be the insufficient the predefined structure schema-on-write where SQL is defined. So, with one of these non-SQL formats, potentially any data in virtually any existing format, ie unstructured, could be given a structure schema-on-read as data is queried. It really is even possible to add sound and video files the best in unstructured-ability in items that get called databases, such as for example with MongoDB (although you can find limitations).

The big benefit of having the ability to put unstructured data into some type of semi-structured format is that it enables a variety of use cases to emerge, such as for example analytics to identify consumer behaviour, market trends, sentiment analysis.

Arguably, analytics with this sort of data gives deeper insight into users. An SQL database might hold name, date of birth, address, etc, but analysing unstructured data via rendering it semi-structured will get nearer to what consumers think.

Additionally it is possible to place some structure on the unstructured and utilize it. An image of delivered item will be unstructured data, but metadata from the image file could possibly be coupled with geo-tracking information from delivery vehicles in a small business intelligence tool.

Read more on Storage management and strategy

Read More

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button

Adblock Detected

Please consider supporting us by disabling your ad blocker