I decided to build a simple Ruby search engine to search through PDFs.
The main application was that I wanted a quick way to search through songsheets on my church's Web site. I didn't want to repeatedly look through different PDFs to find the song I was interested in.
I was mostly inspired by this example of someone who had written a search engine in 200 lines of Ruby. I knew my program would be much easier because it didn't need to support any crawling; just indexing and querying.
The first challenge was to find a Ruby library that would parse PDFs. I ultimately settled on this because it was easy to work with. It's basically just a Ruby wrapper around pdftohtml that provides high level access to the text objects of a PDF. I don't care about layout, graphics, etc. so this was sufficient.
The PDF code mostly works without problems but it assumes that the directory for pdftohtml exists in $PATH. I used MacPorts to compile pdftohtml so it was stored in /opt/local/bin, and TextMate didn't recognize /opt/local/bin in my $PATH. I did some research and discovered this page that says I need to create a file called ~/.MacOSX/environment.plist and explicitly set the PATH variable:
The actual indexing code is straightforward. It's mostly based on the saush engine article. Rather than rehash the site, the index is based on an inverted index. The search engine saves the inverted index in a SQLite database using the DataMapper library.
There are three main "tables": Song, Word, and Location. Song and Word have a many-to-many relationship, where a song has multiple words and a word is used in multiple songs. Location is the mapping table between Song and Word.
Here is the indexing library. Note that it uses DataMapper so it relies on the dm-core and dm-timestamps libraries, as well as stemmer and pdf-struct (the PDF library mentioned earlier). The saush search engine uses dm-more but I couldn't get this to be properly included. But dm-timestamps was all that was needed out of dm-more.
Here is the code for index.rb:
The actual indexing code goes through each PDF. It extracts the words from the song (except the guitar chords) and creates a space-delimited string of words. Then it goes through the string, creating the Word or Song objects if necessary and creating the many-to-many relationship between Word and Song.
Code for pdfindex.rb:
The digger code actually searches through the song database and searches for songs. A song is searched for by passing a string to Digger.search(). It returns a list of songs that the string can be found in, along with a score.
Code for digger.rb:
Note: the biggest disadvantage with this search method is that it doesn't show the search string in its context in the song. Rather than continue with this approach, my thinking is to use a search engine such as Solr to do the search, so I can show the search string within the song.
The main application was that I wanted a quick way to search through songsheets on my church's Web site. I didn't want to repeatedly look through different PDFs to find the song I was interested in.
I was mostly inspired by this example of someone who had written a search engine in 200 lines of Ruby. I knew my program would be much easier because it didn't need to support any crawling; just indexing and querying.
The first challenge was to find a Ruby library that would parse PDFs. I ultimately settled on this because it was easy to work with. It's basically just a Ruby wrapper around pdftohtml that provides high level access to the text objects of a PDF. I don't care about layout, graphics, etc. so this was sufficient.
The PDF code mostly works without problems but it assumes that the directory for pdftohtml exists in $PATH. I used MacPorts to compile pdftohtml so it was stored in /opt/local/bin, and TextMate didn't recognize /opt/local/bin in my $PATH. I did some research and discovered this page that says I need to create a file called ~/.MacOSX/environment.plist and explicitly set the PATH variable:
{ PATH = "/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin"; }
The actual indexing code is straightforward. It's mostly based on the saush engine article. Rather than rehash the site, the index is based on an inverted index. The search engine saves the inverted index in a SQLite database using the DataMapper library.
There are three main "tables": Song, Word, and Location. Song and Word have a many-to-many relationship, where a song has multiple words and a word is used in multiple songs. Location is the mapping table between Song and Word.
Here is the indexing library. Note that it uses DataMapper so it relies on the dm-core and dm-timestamps libraries, as well as stemmer and pdf-struct (the PDF library mentioned earlier). The saush search engine uses dm-more but I couldn't get this to be properly included. But dm-timestamps was all that was needed out of dm-more.
Here is the code for index.rb:
require 'rubygems' require 'dm-core' require 'dm-timestamps' require 'dm-aggregates' require 'stemmer' require 'pdf-struct' DBLOC = 'songdb.sqlite3' DataMapper.setup(:default, 'sqlite3:///' + DBLOC) class String def words words = self.gsub(/[^0-9A-Za-z_\s]/,"").split # self is the string; no need for parms # Get rid of all non-word and non-space characters and split on spaces d = [] words.each { |word| d << word.downcase.stem unless word =~ /^[A-G]+[bgm]?$/ } # Ignore guitar chords return d end end class Song include DataMapper::Resource property :id, Serial property :title, String, :length => 255 has n, :locations has n, :words, :through => :locations property :created_at, DateTime property :updated_at, DateTime def self.find(title) song = first(:title => title) song = new(:title => title) if song.nil? return song end def refresh update( {:updated_at => DateTime.parse(Time.now.to_s)}) end end class Word include DataMapper::Resource property :id, Serial property :stem, String has n, :locations has n, :songs, :through => :locations def self.find(word) wrd = first(:stem => word) wrd = new(:stem => word) if wrd.nil? return wrd end end class Location include DataMapper::Resource property :id, Serial property :position, Integer belongs_to :word belongs_to :song end DataMapper.auto_migrate! if ARGV[0] == 'reset' # This issues the necessary Create statements and wipes out existing database
The actual indexing code goes through each PDF. It extracts the words from the song (except the guitar chords) and creates a space-delimited string of words. Then it goes through the string, creating the Word or Song objects if necessary and creating the many-to-many relationship between Word and Song.
Code for pdfindex.rb:
#!/usr/bin/ruby require 'rubygems' require 'fileutils' require 'logger' require 'index' SONGDIR = '/Users/rpark/ruby/pdfsearch/' LOGFILE = 'songsearch.log' LASTRUN = 'lastrun' class SongSearch def process(file) # returns string of all stemmed words in song array = [] document = PDF::Extractor.open(file) document.elements.each do |element| array << element.content end return array.join(" ").words # .join creates a string separated by delimiter rescue => e #puts "Exception in parsing #{e}" @log.debug "Exception in parsing #{e}" nil end def index(words, filename) if words.nil? #puts "ERROR parsing #{filename}" @log.debug "ERROR parsing #{filename}" return end print "Indexing #{filename}: " logmsg = "Indexing #{filename}: " song = Song.find(filename) unless song.new? print "Overwriting... " logmsg += "Overwriting... " song.refresh song.locations.destroy! end words.each_with_index { |word, index| loc = Location.new(:position => index) loc.word, loc.song = Word.find(word), song loc.save } puts "#{words.size.to_i} words" @log.debug logmsg + "#{words.size.to_i} words" end def cycle lastrun = File.mtime(LASTRUN) @log = Logger.new(LOGFILE, 'monthly') Dir.glob(SONGDIR + "*.pdf") { |file| index(process(file), file) if File.mtime(file) > lastrun # Only process newer songs } FileUtils.touch LASTRUN end end search = SongSearch.new search.cycle
The digger code actually searches through the song database and searches for songs. A song is searched for by passing a string to Digger.search(
Code for digger.rb:
#!/usr/bin/ruby require 'index' class Digger SEARCH_LIMIT = 19 def search(for_text) @search_params = for_text.words wrds = [] @search_params.each { |param| wrds << "stem = '#{param}'" } word_sql = "select * from words where #{wrds.join(" or ")}" @search_words = repository(:default).adapter.query(word_sql) tables, joins, ids = [], [], [] @search_words.each_with_index { |w, index| tables << "locations loc#{index}" joins << "loc#{index}.song_id = loc#{index+1}.song_id" ids << "loc#{index}.word_id = #{w.id}" } joins.pop @common_select = "from #{tables.join(', ')} where #{(joins + ids).join(' and ')} group by loc0.song_id" rank[0..SEARCH_LIMIT] end def rank merge_rankings(frequency_ranking, location_ranking, distance_ranking) end def merge_rankings(*rankings) r = {} rankings.each { |ranking| r.merge!(ranking) { |key, oldval, newval| oldval + newval} } r.sort {|a,b| b[1] <=> a[1]} end def frequency_ranking freq_sql= "select loc0.song_id, count(loc0.song_id) as count #{@common_select} order by count desc" list = repository(:default).adapter.query(freq_sql) rank = {} list.size.times { |i| rank[list[i].song_id] = list[i].count.to_f/list[0].count.to_f } #puts freq_sql #puts list #puts rank.inspect return rank end def location_ranking total = [] @search_words.each_with_index { |w, index| total << "loc#{index}.position + 1" } loc_sql = "select loc0.song_id, (#{total.join(' + ')}) as total #{@common_select} order by total asc" list = repository(:default).adapter.query(loc_sql) rank = {} list.size.times { |i| rank[list[i].song_id] = list[0].total.to_f/list[i].total.to_f } #puts loc_sql #puts list #puts rank.inspect return rank end def distance_ranking return {} if @search_words.size == 1 dist, total = [], [] @search_words.each_with_index { |w, index| total << "loc#{index}.position" } total.size.times { |index| dist << "abs(#{total[index]} - #{total[index + 1]})" unless index == total.size - 1 } dist_sql = "select loc0.song_id, (#{dist.join(' + ')}) as dist #{@common_select} order by dist asc" list = repository(:default).adapter.query(dist_sql) rank = Hash.new list.size.times { |i| rank[list[i].song_id] = list[0].dist.to_f/list[i].dist.to_f } #puts dist_sql #puts list #puts rank.inspect return rank end end
Note: the biggest disadvantage with this search method is that it doesn't show the search string in its context in the song. Rather than continue with this approach, my thinking is to use a search engine such as Solr to do the search, so I can show the search string within the song.
Comments
Post a Comment