Big data

Deriving business value from big data is a multiphase process that takes raw data and refines it into useful information. After the data has been acquired, using data stores such as Hadoop Distributed File System (HDFS), it is then pre-processed to weed out less useful and structure what is left for analysis. This phase in the process is often handled by using Apache Hadoop.

The new tools that are required to capture and organise big data do not mean that relational data stores are irrelevant. To capitalise on the big data opportunity, organisations must be able to analyse all types of data, both relational and nonrelational (text, sensor data, audio, video, customer transactions, and so on). However, it will be necessary to augment these traditional approaches with technology that will allow organisations to benefit from the big data era.

Contents

What is big data?

Big data refers to data sets whose size is beyond the ability of typical database software tools to process, store, and analyse. Big data is characterised not only by its size but also by its variety. With the rapid growth in the number of devices and applications that create and capture information, the data has become more complex, because it includes not only traditional relational data, but also raw, semi-structured and unstructured data from sources such as:

In a digitised world therefore, consumers going about their day (communicating, browsing, buying, sharing, searching), create their own enormous trails of data.

Analysing such raw, semistructured, and unstructured data enables organisations to gain a better understanding of their business, customers, and marketplace. But what solutions are available that can store and process all that data and find out what is valuable and worth exploiting?

Apache Hadoop

Apache Hadoop is a top-level Apache project, written in Java, that enables organisations to organize raw (often unstructured) data and transform it so it can be loaded into other applications for analysis. Hadoop contains two main components: the Hadoop Distributed File System (HDFS) for data storage, and the MapReduce programming model that manages the processing of the data.

HDFS allows Hadoop to scale out workloads across large clusters of inexpensive machines to work on big data problems. HDFS is fault tolerant, which is a necessity as in clusters with potentially thousands of nodes, failure is the norm rather than the exception. (Detection of faults and quick, automatic recovery from them was a core architectural goal of HDFS.)

MapReduce has emerged as a popular way to harness the power of large clusters of computers. This is because MapReduce breaks a problem into independent pieces to be worked on in parallel across many machines.

The term MapReduce refers to two tasks that MapReduce programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into key/value pairs. The reduce job takes the output from a map job as input and combines these key/value pairs into a smaller set of key/value pairs. As the name MapReduce implies, the reduce job is always performed after the map job.

Big data and relational databases

Your big data solution will not live alone; it has to integrate and will likely enhance your relational database.

Sqoop

Sqoop is a command-line utility for Hadoop that imports and exports data between HDFS (or Hive) and structured databases. Sqoop was created by Cloudera, and is now a top-level Apache project. The name Sqoop comes from "SQL to Hadoop". Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

Sqoop uses a connector based architecture, which it uses to read and write to external databases. By default Sqoop, includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2.

Sqoop uses JDBC to connect to the database. JDBC is a Java application programming interface that connects Java to relational databases. To interact with a database, a Java application uses a JDBC driver. A JDBC driver implements the JDBC API for a particular database. For example, to connect Sqoop to SQL Server, the Microsoft SQL Server JDBC driver is required.

Sqoop includes a generic JDBC connector, which can be used to connect to any database that's accessible though JDBC. By using the generic JDBC connector with a JDBC-ODBC Bridge, Sqoop can access any database for which an ODBC driver is available.

Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop.

SQL Server big data connectors

The Microsoft SQL Server Connector for Apache Hadoop (SQL Server-Hadoop Connector) enables data to be moved backwards and forwards between Hadoop and SQL Server.

As well as being able to transfer more data more efficiently than the built-in SQL Server connector, the SQL Server-Hadoop Connector provides support for nchar and nvarchar columns, SQL Server's Unicode data types.

Although Microsoft recommend using the SQL Server-Hadoop Connector with Cloudera's Hadoop distribution, we successfully used the connector with the Apache Hadoop distribution.

We first copied the Microsoft SQL Server JDBC driver to the $SQOOP_HOME/lib directory on the Sqoop machine:

$ gunzip sqljdbc_4.0.2206.100_enu.tar.gz
$ tar -xvf sqljdbc_4.0.2206.100_enu.tar
$ cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib

We then installed the SQL Server-Hadoop Connector under $SQOOP_HOME:

$ gunzip sqoop-sqlserver-1.0.tar.gz
$ tar -xvf sqoop-sqlserver-1.0.tar
$ cd sqoop-sqlserver-1.0
$ export SQOOP_HOME=/usr/lib/sqoop
$ export MSSQL_CONNECTOR_HOME=/tmp/sqoop-sqlserver-1.0
# ./install

We also had to add the SQL Server JDBC driver path to the entry in $SQOOP_HOME/conf/managers.d/mssqoop-sqlserver:

com.microsoft.sqoop.SqlServer.MSSQLServerManagerFactory=/usr/lib/sqoop/lib/sqljdbc4.jar

We then used the connector to import data from SQL Server to HDFS:

$ ./sqoop import --connect 'jdbc:sqlserver://mymachine;instanceName=SQLEXPRESS;username=myuser;password=mypassword;database=AdventureWorks' --table HumanResources.Department --target-dir /data/department --split-by DepartmentID

Sqoop imports the SQL Server data into HDFS, which can be examined in the HDFS shell:

$ hadoop dfs -cat /data/department/part-m-00001
5,Purchasing,Inventory Management,1998-06-01 00:00:00.0
6,Research and Development,Research and Development,1998-06-01 00:00:00.0
7,Production,Manufacturing,1998-06-01 00:00:00.0
8,Production Control,Manufacturing,1998-06-01 00:00:00.0

Note The SQL that Sqoop generates to extract the data from SQL Server surrounds the column name with square brackets ([]). This is invalid SQL syntax for SQL Server and so it is necessary to create a synonym for the column that encloses the column name with square brackets. We used our SQL Server ODBC driver for Linux and UNIX platforms to do this:

$ /usr/local/easysoft/unixODBC/bin/isql.sh -v SQLSERVER_2012
SQL> use AdventureWorks
SQL> CREATE SYNONYM [HumanResources.Department] For HumanResources.Department

Unstructured data storage in SQL Server

SQL Server provides the binary large object (BLOB) data type to store data whose structure is not recognised by the database. For example, product brochures and product images. The disadvantage with BLOB data is that it has a restricted file size (2 GB) and because it is stored internally can make a database enormously bloated and take a very long time to back up, restore, and perform other preventative maintenance tasks.

The FILESTREAM data type (introduced in SQL Server 2008) enables unstructured binary data such as documents and images to be stored externally on the Windows file system instead of in the database file. FILESTREAM data is still under the control of SQL Server, and so you can retain database engine functionality such as backing up, restoring, and controlling access to the data while taking advantage of low-cost hardware to store the data.

The FileTable feature (introduced in SQL Server 2012) builds on the existing FILESTREAM data type and converts SQL Server tables into folders, which can be accessed through Windows Explorer. The directory structure and the file attributes are stored in the FileTable as columns. An organisation can store files and documents in FileTables and access them from client applications as if they were stored in the file system without making any changes to those applications.

Unstructured data analysis in SQL Server

Full-text search

Full-text search, introduced in SQL Server 2005, allows for fast and efficient querying of large amounts of unstructured data. Unlike character-based comparisons used in SQL LIKE queries, Full-text search performs linguistic searches by operating on words or phrases. (Linguistics is the use of information about the structure and variation of languages so that users can more easily find relevant information.) Querying large amounts of data using full-text search can provide significant performance gains over character-based searches as well as gains in usability.

A SQL LIKE query when run against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A SQL LIKE query against millions of rows of text data can take minutes to return. A full-text query can take only seconds or less against the same data, depending on the number of rows that are returned.

Doing linguistic searches instead of character-based searches also adds other levels of usability. For example, when searching on a given word, full-text search not only tries to match on the given word, but can also match on forms of the given word or even on other words that have similar meaning. For example, searching on "house" can also yield matches on "houses," "housing," or even "home."

The following query retrieves all the products whose name matches a particular search pattern:

SQL> USE AdventureWorks2012
SQL> SELECT ProductID, [Name]
FROM Production.Product
WHERE CONTAINS([Name], '"*washer*" OR "*ball*"');

+------------+-----------------------+
| ProductID  | Name                  |
+------------+-----------------------+
| 2          | Bearing Ball          |
| 3          | BB Ball Bearing       |
| 4          | Headset Ball Bearings |
| 341        | Flat Washer 1         |
+------------+-----------------------+

Statistical semantic search

Statistical semantic search, introduced in SQL Server 2012, extends SQL Server's full-text search capability by providing semantic insight into document content. While full-text search lets you query specific words in a document, statistical semantic search lets you query the meaning of a document. Typical semantic search use-case scenario include finding key phrases in a document, finding similar documents or finding related documents.

We used our SQL Server ODBC driver for Linux and UNIX platforms to try out these semantic search use-cases.

Example: Finding the key phrases in a document

In the following example, the query locates and retrieves the key phrases in a document stored in the Production.Document table of the AdventureWorks sample database. A score is assigned to each key phrase based on its statistical significance. This score determines the order in which each key phrase is reported.

SQL> USE AdventureWorks2012
SQL> SELECT TOP(10) KEYP_TBL.keyphrase, KEYP_TBL.score
FROM SEMANTICKEYPHRASETABLE
    (
    Production.Document,
    Document,
    0x6B40
    ) AS KEYP_TBL
ORDER BY KEYP_TBL.score DESC;

+------------+-----------+
| keyphrase  | score     |
+------------+-----------+
| brake      | 0.9013491 |
| oil        | 0.8636435 |
| pedals     | 0.8491791 |
| pivot      | 0.8237474 |
| bicycle    | 0.8049155 |
| caliper    | 0.7762683 |
| lubricate  | 0.750616  |
| cantilever | 0.740355  |
| lubricating| 0.7147026 |
| rims       | 0.6890502 |
+------------+-----------+

Example: Finding similar or related documents

This query locates and retrieves CVs that are similar or related to the CV of a specified candidate. The results are scored based on their similarity to the CV and then displayed in ranked order.

SQL> SELECT TOP(10) KEY_TBL.matched_document_key AS Candidate_ID
FROM SEMANTICSIMILARITYTABLE
    (
    HumanResources.JobCandidate,
    Resume,
    1
    ) AS KEY_TBL
ORDER BY KEY_TBL.score DESC;

+---------------------+
| Candidate_ID        |
+---------------------+
| 2                   |
| 10                  |
| 7                   |
| 4                   |
| 6                   |
| 3                   |
| 8                   |
| 9                   |
| 5                   |
+---------------------+

The following query reveals that the most similar candidate worked for the same company as the specified candidate.

SQL> SELECT JobCandidateID, 
    Resume.query('declare default element namespace
"http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/Resume";
    for $employer in /Resume/Employment
    return data($employer/Emp.OrgName)
    ') Company
FROM 
    HumanResources.JobCandidate
WHERE
    JobCandidateID <=2

+----------------+--------------+
| JobCandidateID | Company      |
+----------------+--------------+
| 1              | Wingtip Toys |
| 2              | Wingtip Toys |
+----------------+--------------+

Example: Finding the key phrases that make documents similar or related

The following query retrieves the 5 key phrases that make 2 CVs similar or related to one another. It presents the results in descending order by the score that ranks the weight of each key phrase.

SQL> SELECT TOP(5) KEY_TBL.keyphrase, KEY_TBL.score
FROM SEMANTICSIMILARITYDETAILSTABLE
    (
    HumanResources.JobCandidate,
    Resume, 1,
    Resume, 2
    ) AS KEY_TBL
ORDER BY KEY_TBL.score DESC;

+---------------+-----------+
| keyphrase     | score     |
+---------------+-----------+
| production    | 0.5522802 |
| bachelor      | 0.4300567 |
| manufacturing | 0.1395864 |
| wingtip       | 0.1353242 |
| com           | 0.1256598 |
+---------------+-----------+

Statistical semantic search prerequisites

To try out statistical semantic search for yourself, you need to:

Refer to your SQL Server documentation for assistance with these tasks.