Custom Processors for ActiveWarehouse ETL

Written on 12:45:00 AM by S. Potter

For anyone interested in extending ActiveWarehouse ETL's features with custom pre/post processors, I thought I would share this piece of code that I wrote in August for a personal project I am working on. The example should provide you with enough details for you to create your own custom processors.


# Written by Susan Potter  under open source MIT license.
# August 12, 2007.

require 'net/ftp'

module ETL
  module Processor
    # Custom pre-processor to download files via FTP before beginning control process.
    class FtpDownloaderProcessor < ETL::Processor::Processor
      attr_reader :host
      attr_reader :port
      attr_reader :remote_dir
      attr_reader :files
      attr_reader :username
      attr_reader :local_dir
      
      # configuration options include:
      # * host - hostname or IP address of FTP server (required)
      # * port - port number for FTP server (default: 21)
      # * remote_dir - remote path on FTP server (default: /)
      # * files - list of files to download from FTP server (default: [])
      # * username - username for FTP server authentication (default: anonymous)
      # * password - password for FTP server authentication (default: nil)
      # * local_dir - local output directory to save downloaded files (default: '')
      # 
      # As an example you might write something like the following in your control process file:
      #  pre_process :ftp_downloader, {
      #    :host => 'ftp.sec.gov',
      #    :path => 'edgar/Feed/2007/QTR2',
      #    :files => ['20070402.nc.tar.gz', '20070403.nc.tar.gz', '20070404.nc.tar.gz', 
      #               '20070405.nc.tar.gz', '20070406.nc.tar.gz'],
      #    :local_dir => '/data/sec/2007/04',
      #  }
      # The above example will anonymously download via FTP the first week's worth of SEC filing feed data
      # from the second quarter of 2007 and download the files to the local directory +/data/sec/2007/04+.
      def initialize(control, configuration)
        @host = configuration[:host]
        @port = configuration[:port] || 21
        @remote_dir = configuration[:remote_dir] || '/'
        @files = configuration[:files] || []
        @username = configuration[:username] || 'anonymous'
        @password = configuration[:password]
        @local_dir = configuration[:local_dir] || ''
      end
      
      def process
        Net::FTP.open(@host) do |conn|
          conn.connect(@host, @port)
          conn.login(@username, @password)
          remote_files = conn.chdir(@remote_dir)
          @files.each do |f|
            conn.gettextfile(remote_file(f), local_file(f))
          end
        end
      end
      
      private
      attr_accessor :password
      
      def local_file(name)
        File.join(@local_dir, name)
      end
      
      def remote_file(name)
        File.join(@remote_dir, name)
      end
    end
  end
end
The key things to note from this is that you are at present required to:
  • define all your custom processors with in the ETL::Processor module
  • name your custom processor class in the form XXXXProcessor
  • need to extend (or really just adhere to the message interface of) ETL::Processor::Processor class defined in ActiveWarehouse ETL
  • define initialize taking two arguments (look above for guidance)
  • define a process method to do what you need do before or after the control process runs (for pre and post processors respectively)
Hope this helps someone customize ActiveWarehouse more easily, since the only bad thing I have found with ActiveWarehouse is lack of documentation.

If you enjoyed this post Subscribe to our feed

1 Comment

  1. B Grounds |

    I ran across this post through a Google search. I'm not a computer programmer, but I'm trying to figure out how to download a specific type of filings for a specific list of companies from the SEC's ftp website. Can you tell me how to customize the information that using the SEC's FTP website so that it only brings back the specific information I'm looking for, rather than all filings? Also, is there a way to automate this process so that a program on my computer automatically accesses the FTP site each night?

    Thanks for any help you can provide!

     

Post a Comment