Rishika Jolly's Blog: Big Unstructured Data v/s Structured Relational Data

STRUCTURED DATA

Structured Data is data that is stored in an organized manner. It resides in a fixed field within a record or file. Normally data that is stored in relational databases or on spreadsheets in an orderly row and column format forms structured data. In order to store structured data, one must create a data model that specifies the business data that one has to deal with along with the data types such as numeric, alphanumeric, Boolean, etc., data constraints such as primary, referential integrity, check, not null, etc. and metadata information. Because it is well ordered, structured data can be easily entered, stored, queried and analyzed.

Structured Query Language (SQL) is most commonly used to manage structured data. SQL helps us perform several operations to analyze the data and fetch desired results. These operations include search, insert, update, delete and others.

UNSTRUCTURED DATA

As the name suggests, unstructured data refers to data that is unorganized and is unable to dwell inside a database or on a spreadsheet. Unstructured data comprises of text and multimedia content. A few examples are e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Despite the fact that business documents do follow a structured approach, their content is unable to fit in a database and hence, they are categorized as unstructured data.

Business Documents within an organization are unstructured data that contain a large amount of useful information important in strategic decision-making. In order to gain valuable insights into this data, organizations today resort to various technologies such as Hadoop, Business Intelligence software, data mining tools, data integration software and many other technological solutions.

The term "big data" is closely associated with unstructured data. Big data refers to extremely large datasets that are difficult to analyze with traditional tools. Big data can include both structured and unstructured data, but studies show that 90 percent of big data is unstructured data.

Fig 1: Structured data v/s Unstructured data

VOLUME OF DATA

Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly -- often many times faster than structured databases are growing.

Fig 2: Unstructured v/s Structured Data Growth Rates

Because the volume of unstructured data is growing so rapidly, many enterprises also turn to technological solutions to help them better manage and store their unstructured data. These can include hardware or software solutions that enable them to make the most efficient use of their available storage space.

Fig 3: Technology Solutions for Structured and Unstructured Data

TYPES OF DATA

Spatial Data: Spatial data pertains to information that has several dimensions. Data that gives importance to location or position with respect to some entity is spatial in nature. For example, coordinates of a place, maps or images taken from space, remote sensing data, etc.

Integrated Operational Data: This data store is an easily changing collection of data that is a result of an organization’s daily business activities. In technical description, an integrated operational data store is a subject-oriented, integrated, volatile, current-valued, detailed-only collection of data.

Redundant Data: This kind of data refers to duplication of data. It means that the same information is replicated at multiple areas. Such data may be the cause of inefficiencies within a system.

Integrated Historical Data: This data forms part of the enterprise data that passes through the process of extract, transform and load. It is stored in a data warehouse and is static in nature. Data in a data warehouse is subject-oriented, integrated, time-variant and non-volatile. This data is also historical and serves the key purpose of statistical analysis. Thus, users may be able to study trends of their organizations relating to product sales, revenues, expenditures and others. An operational database fails to deliver such results. A few key characteristics of data warehouse that facilitate statistical analysis are:

Denormalization of data that simplifies and improves the query performance
Helps consolidate data from disparate sources
Use of historical data in studying trends and patterns
Large amounts of result-set retrieved as a result of data archival over long periods of time
No frequent updates makes it performance relaxed

Fig 4: Non-volatile nature of Data Warehouse

Fig 5: Subject-oriented nature of Data Warehouse

Fore Data: Fore data are the upfront data, which are used for describing data architecture’s objects and events. Their primary purpose is presentation and not input to some backend database process.

Legacy Data: Legacy data refers to data from disparate sources, which are old and outdated but still in use because they form the basis of an organization. E.g., XML data, hierarchical & network data, objects, etc.

Demographic Data: This data deals with information about the human population such as size, structure, distribution, and spatial and temporal changes.

LIMITATIONS OF DATA WAREHOUSE IN TERMS OF DATA ANALYSIS

Underestimation of resources of data loading: It might take long to extract, transform and load the historical data and hence, the time to develop the data warehouse would also significantly increase.

Inconsistent data: While loading data, care should be taken to see that the data is consistent or else it might result in performance degradation.

Required data not captured: In some cases, important information related to the business process under analysis is not captured by the source system but maybe important for strategic decision making.

Complexity of integration: Integration of data from various disparate sources is a highly complex task. Also, a different tool performs each task within a data warehouse and integration of all these tools also increases the complexity of implementing a data warehouse.

High maintenance: These are high maintenance systems. If there were a change in the business process (and hence, the data) or the source system that results in a change in the data warehouse, it would result in very high maintenance costs.

FUTURE OF DATA WAREHOUSING

Today, it takes years for a successful implementation of data warehouse for a particular business. Any integrations to the data warehouse, requires a lot of effort to see that there is consistency. As a result, in the future, an agile model of data warehouse is expected. This model would no longer increase the implementation speed but would also facilitate discovery-based analytics.

Moreover, cloud-based solutions for data warehousing and analytics might become the norm. The flexibility of the cloud-based solution offers performance enhancements and also a native understanding of the wide range of analytic support provided by the data such as BI Consulting, Data Services, Analytics and Big Data – Social Media. The cost of traditional on-premises offerings as well as management overhead costs will be significantly lowered.