Apache Spark ODBC Driver

We're currently developing an Apache Spark ODBC driver, which lets you work with Spark data in applications such as Perl, PHP, Excel and Oracle).

Configuring an ODBC Data Source

Before the Apache Spark ODBC Driver can be used to connect an application to Apache Spark, it's necessary to configure an ODBC data source. An ODBC data source stores the connection details for the target database (e.g. Apache Spark) and the ODBC driver that is required to connect to it (e.g. the Apache Spark ODBC driver).

ODBC data sources are configured in ODBC Administrator, which is included with Windows.

In ODBC Administrator:

  1. Choose the System DSN tab, and then choose Add.
  2. In the Create New Data Source dialog box, choose Easysoft ODBC-Apache Spark Driver, and then choose Finish.
  3. Complete the fields in the Easysoft ODBC-Apache Spark Driver DSN Setup dialog box.
  4. In your application, connect to your newly configured data source and run a sample query. For example:
    select * from MyTable

Work with Apache Spark data in Perl

Strawberry Perl is a Perl distribution for Windows that includes the necessary middleware layers (Perl DBI and Perl DBD::ODBC) to enable the Apache Spark ODBC driver to connect your Perl applications to Apache Spark.

  1. Here's a Perl script that retrieves some Apache Spark data:
    #!/usr/bin/perl -w
    use strict;
    use DBI;
    my $dbh = DBI-> connect('dbi:ODBC:MyApacheSparkDataSource');
    
    my $sql = "SELECT MyCol FROM MyTable LIMIT 10";
    
    # Prepare the statement.
    my $sth = $dbh->prepare($sql)
        or die "Can't prepare statement: $DBI::errstr";
    
    # Execute the statement.
    $sth->execute();
    
    my($SparkCol);
    
    # Fetch and display the result set value.
    while(($SparkCol) = $sth->fetchrow()){
       print("$SparkCol\n");                   
    }
    
    $dbh->disconnect if ($dbh);

Work with Apache Spark data in PHP

  1. Here's a PHP script that retrieves some Apache Spark data:
    <?php
      $con = odbc_connect("MyApacheSparkDataSource", "", "");
      $err = odbc_errormsg();
      if (strlen($err) <> 0) {
        echo odbc_errormsg();
      } else {
        $rs2 = odbc_exec($con, "select MyCol from MyTable");
        odbc_result_all($rs2);
        odbc_close($con);
      }
    ?>

Connecting Excel to Apache Spark

Follow these steps to return data from Apache Spark to Microsoft Excel by using Microsoft Query:

  1. On the Data tab, choose New Query > From Other Source > From ODBC.
  2. Choose the Apache Spark ODBC data source when prompted.
  3. Choose a table from the available data sets.
  4. Choose the Load to return the Apache Spark data to the worksheet.

    Note that for large result sets, you may have to have to filter the data using Excel before the data can be returned to the worksheet.

Connect to Apache Spark from Oracle

  1. Create a DG4ODBC init file on your Oracle machine. To do this, change to the %ORACLE_HOME%\hs\admin directory. Create a copy of the file initdg4odbc.ora. Name the new file initspark.ora.

    Note In these instructions, replace %ORACLE_HOME% with the location of your Oracle HOME directory. For example, C:\oraclexe\app\oracle\product\11.2.0\server.

  2. Ensure these parameters and values are present in your init file:
    HS_FDS_CONNECT_INFO = MyApacheSparkDataSource
  3. Comment out the line that enables DG4ODBC tracing. For example:
    #HS_FDS_TRACE_LEVEL = <trace_level>
    
  4. Add an entry to %ORACLE_HOME%\network\admin\listener.ora that creates a SID_NAME for DG4ODBC. For example:
    SID_LIST_LISTENER =
     (SID_LIST =
       (SID_DESC=
         (SID_NAME=spark)
         (ORACLE_HOME=%ORACLE_HOME%)
         (PROGRAM=dg4odbc)
       )
     )
    
  5. Add a DG4ODBC entry to %ORACLE_HOME%\network\admin\tnsnames.ora that specifies the SID_NAME created in the previous step. For example:
    SPARK =
      (DESCRIPTION =
        (ADDRESS = (PROTOCOL = TCP)(HOST = oracle_host)(PORT = 1521))
        (CONNECT_DATA =
          (SID = spark)
        )
        (HS = OK)
      )
    

    Replace oracle_host with the host name of your Oracle machine.

  6. Start (or restart) the Oracle Listener:
    cd %ORACLE_HOME%\bin
    lsnrctl stop
    lsnrctl start
    
  7. Connect to your Oracle database in SQL*Plus.
  8. In SQL*Plus, create a database link for the target Apache Spark instance. For example:
    CREATE PUBLIC DATABASE LINK SPARKLINK CONNECT TO "mydummyuser" IDENTIFIED BY "mydummypassword" USING 'spark';
    
  9. Try querying your Apache Spark data. For example:
    SELECT * FROM "MyTable"@SPARKLINK;
    
Notes