Skip to main content

Build simple PDF search engine in Ruby (Part 1)

I decided to build a simple Ruby search engine to search through PDFs.

The main application was that I wanted a quick way to search through songsheets on my church's Web site. I didn't want to repeatedly look through different PDFs to find the song I was interested in.

I was mostly inspired by this example of someone who had written a search engine in 200 lines of Ruby. I knew my program would be much easier because it didn't need to support any crawling; just indexing and querying.

The first challenge was to find a Ruby library that would parse PDFs. I ultimately settled on this because it was easy to work with. It's basically just a Ruby wrapper around pdftohtml that provides high level access to the text objects of a PDF. I don't care about layout, graphics, etc. so this was sufficient.

The PDF code mostly works without problems but it assumes that the directory for pdftohtml exists in $PATH. I used MacPorts to compile pdftohtml so it was stored in /opt/local/bin, and TextMate didn't recognize /opt/local/bin in my $PATH. I did some research and discovered this page that says I need to create a file called ~/.MacOSX/environment.plist and explicitly set the PATH variable:

{ 
  PATH = "/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin";
}

The actual indexing code is straightforward. It's mostly based on the saush engine article. Rather than rehash the site, the index is based on an inverted index. The search engine saves the inverted index in a SQLite database using the DataMapper library.

There are three main "tables": Song, Word, and Location. Song and Word have a many-to-many relationship, where a song has multiple words and a word is used in multiple songs. Location is the mapping table between Song and Word.

Here is the indexing library. Note that it uses DataMapper so it relies on the dm-core and dm-timestamps libraries, as well as stemmer and pdf-struct (the PDF library mentioned earlier). The saush search engine uses dm-more but I couldn't get this to be properly included. But dm-timestamps was all that was needed out of dm-more.

Here is the code for index.rb:

require 'rubygems'
require 'dm-core'
require 'dm-timestamps'
require 'dm-aggregates'
require 'stemmer'
require 'pdf-struct'

DBLOC = 'songdb.sqlite3'

DataMapper.setup(:default, 'sqlite3:///' + DBLOC)

class String
 def words
   words = self.gsub(/[^0-9A-Za-z_\s]/,"").split  # self is the string; no need for parms
   # Get rid of all non-word and non-space characters and split on spaces
   d = []
   words.each { |word| d << word.downcase.stem unless word =~ /^[A-G]+[bgm]?$/ } # Ignore guitar chords
   return d
 end
end

class Song
  include DataMapper::Resource
  
  property :id,          Serial
  property :title,       String, :length => 255
  has n, :locations
  has n, :words, :through => :locations
  property :created_at,   DateTime
  property :updated_at,   DateTime
  
  def self.find(title)
    song = first(:title => title)
    song = new(:title => title) if song.nil?
    return song
  end
  
  def refresh
    update( {:updated_at => DateTime.parse(Time.now.to_s)})
  end
end

class Word
  include DataMapper::Resource
  
  property :id,           Serial
  property :stem,         String
  has n, :locations
  has n, :songs, :through => :locations
  
  def self.find(word)
    wrd = first(:stem => word)
    wrd = new(:stem => word) if wrd.nil?
    return wrd
  end
end

class Location
  include DataMapper::Resource
  
  property :id,           Serial
  property :position,     Integer
  
  belongs_to :word
  belongs_to :song
end

DataMapper.auto_migrate! if ARGV[0] == 'reset' # This issues the necessary Create statements and wipes out existing database

The actual indexing code goes through each PDF. It extracts the words from the song (except the guitar chords) and creates a space-delimited string of words. Then it goes through the string, creating the Word or Song objects if necessary and creating the many-to-many relationship between Word and Song.

Code for pdfindex.rb:

#!/usr/bin/ruby

require 'rubygems'
require 'fileutils'
require 'logger'
require 'index'

SONGDIR = '/Users/rpark/ruby/pdfsearch/'
LOGFILE = 'songsearch.log'
LASTRUN = 'lastrun'

class SongSearch
  def process(file)   # returns string of all stemmed words in song
    array = []
    document = PDF::Extractor.open(file)
    document.elements.each do |element|
      array << element.content
    end
    return array.join(" ").words # .join creates a string separated by delimiter
  rescue => e
    #puts "Exception in parsing #{e}"
    @log.debug "Exception in parsing #{e}"
    nil
  end

  def index(words, filename)
    if words.nil?
      #puts "ERROR parsing #{filename}"
      @log.debug "ERROR parsing #{filename}"
      return
    end
    print "Indexing #{filename}: "
    logmsg = "Indexing #{filename}: "
    song = Song.find(filename)
    unless song.new?
      print "Overwriting... "
      logmsg += "Overwriting... "
      song.refresh
      song.locations.destroy!
    end
    words.each_with_index { |word, index|
      loc = Location.new(:position => index)
      loc.word, loc.song = Word.find(word), song
      loc.save
    }
    puts "#{words.size.to_i} words"
    @log.debug logmsg + "#{words.size.to_i} words"
  end

  def cycle
    lastrun = File.mtime(LASTRUN)
    @log = Logger.new(LOGFILE, 'monthly')
    Dir.glob(SONGDIR + "*.pdf") {
      |file|
      index(process(file), file) if File.mtime(file) > lastrun  # Only process newer songs
    }
    FileUtils.touch LASTRUN
  end
end

search = SongSearch.new
search.cycle

The digger code actually searches through the song database and searches for songs. A song is searched for by passing a string to Digger.search(). It returns a list of songs that the string can be found in, along with a score.

Code for digger.rb:

#!/usr/bin/ruby

require 'index'

class Digger
  SEARCH_LIMIT = 19
  
  def search(for_text)
    @search_params = for_text.words
    wrds = []
    @search_params.each { |param| wrds << "stem = '#{param}'" }
    word_sql = "select * from words where #{wrds.join(" or ")}"
    @search_words = repository(:default).adapter.query(word_sql)
    tables, joins, ids = [], [], []
    @search_words.each_with_index { |w, index|
      tables << "locations loc#{index}"
      joins << "loc#{index}.song_id = loc#{index+1}.song_id"
      ids << "loc#{index}.word_id = #{w.id}"
    }
    joins.pop
    @common_select = "from #{tables.join(', ')} where #{(joins + ids).join(' and ')} group by loc0.song_id"
    rank[0..SEARCH_LIMIT]
  end
  
  def rank
    merge_rankings(frequency_ranking, location_ranking, distance_ranking)
  end
  
  def merge_rankings(*rankings)
    r = {}
    rankings.each { |ranking| r.merge!(ranking) { |key, oldval, newval| oldval + newval} }
    r.sort {|a,b| b[1] <=> a[1]}
  end
  
  def frequency_ranking
    freq_sql= "select loc0.song_id, count(loc0.song_id) as count #{@common_select} order by count desc"
    list = repository(:default).adapter.query(freq_sql)
    rank = {}
    list.size.times { |i| rank[list[i].song_id] = list[i].count.to_f/list[0].count.to_f }
#puts freq_sql
#puts list
#puts rank.inspect
    return rank
  end
  
  def location_ranking
    total = []
    @search_words.each_with_index { |w, index| total << "loc#{index}.position + 1" }
    loc_sql = "select loc0.song_id, (#{total.join(' + ')}) as total #{@common_select} order by total asc"
    list = repository(:default).adapter.query(loc_sql)
    rank = {}
    list.size.times { |i| rank[list[i].song_id] = list[0].total.to_f/list[i].total.to_f }
#puts loc_sql
#puts list
#puts rank.inspect
    return rank
  end

  def distance_ranking
    return {} if @search_words.size == 1
    dist, total = [], []
    @search_words.each_with_index { |w, index| total << "loc#{index}.position" }
    total.size.times { |index| dist << "abs(#{total[index]} - #{total[index + 1]})" unless index == total.size - 1 }
    dist_sql = "select loc0.song_id, (#{dist.join(' + ')}) as dist #{@common_select} order by dist asc"
    list = repository(:default).adapter.query(dist_sql)
    rank = Hash.new
    list.size.times { |i| rank[list[i].song_id] = list[0].dist.to_f/list[i].dist.to_f }
#puts dist_sql
#puts list
#puts rank.inspect
    return rank
  end
end

Note: the biggest disadvantage with this search method is that it doesn't show the search string in its context in the song. Rather than continue with this approach, my thinking is to use a search engine such as Solr to do the search, so I can show the search string within the song.

Comments

Popular posts from this blog

Building a Hadoop cluster

I've recently had to build a Hadoop cluster for a class in information retrieval . My final project involved building a Hadoop cluster. Here are some of my notes on configuring the nodes in the cluster. These links on configuring a single node cluster and multi node cluster were the most helpful. I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead. Here are the configuration files I changed: core-site.xml: fs.default.name hdfs://master:9000 hadoop.tmp.dir /hadoop/tmp A base for other temporary directories. hadoop-env.sh: # Variables required by Mahout export HADOOP_HOME=/hadoop export HADOOP_CONF_DIR=/hadoop/conf export MAHOUT_HOME=/Users/rpark/mahout PATH=/hadoop/bin:/Users/rpark/mahout/bin:$PATH # The java implementation to use. Required. export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home hdfs-site

Creating a Hackintosh

I've always wanted to create a "Hackintosh", i.e. a standard PC that runs OS X. My PC is over 5 years old so it was time for a refresh. I figured this was the best time to give the Hackintosh a go. Hardware CPU:  Intel Quad Core i7 4790 3.6 Ghz Motherboard:  GIGABYTE GA-Z97-HD3 Audio:  ALC 887 Network: Realtek 8111F-VL Network Card:  4 Antennas 802.11ac WiFi BCM94360CD Wireless Network Card Graphics Card:  nVidia 750 GTX Memory:  Corsair Vengeance DDR3-1600 32 GB (4x8 GB) Hard Drive : Seagate ST3000DM001 3 TB SATA3 7200 rpm DVD:  Samsung SH-224DB 24X BIOS Changes The first step was to change the BIOS settings to support OS X. Disabling VT-d is the only setting that is clearly required; the others are questionable but were done by others so I thought they were worth trying. F7  to load Optimized Defaults M.I.T. Advanced Frequency Settings Extreme Memory Profile (X.M.P.): Enabled Miscellaneous Settings PCIe Slot Configuration:  Gen

Connecting to SQL Server from OS X perl

I've been spending my coding time in the offhours working on Perl instead of Ruby. My coding time in general has been very limited, which is part of the reason for the length of time between updates. :) My latest project is to pull data out of a Microsoft SQL Server database for analysis. I'm using perl for various reasons: I need a crossplatform environment, and I need certain libraries that only work on perl. Some of the target users for my code run on Windows. I know that Ruby runs on Windows but it's not the platform of choice for Ruby developers. The vast majority seem to develop either on OS X or Linux. So Ruby on Windows isn't at the maturity that ActiveState perl is on Windows. In fact, I don't even run native perl anymore on my MacBook Pro. I've switched over to ActiveState perl because I don't need to compile anything every time I want to install new CPAN libraries. And because it's ActiveState, I'm that much more confident it will w