Suppose I have a file with several millions of lines containing information on SNPs. And suppose I have a database that already contains data for those SNP. And suppose I want to update the entries in the database with the data from the input file.
Please note: this is a quick hack.
require 'rubygems'
require 'progressbar'
MAX_NR_OF_THREADS = 5
nr_of_lines = `wc -l input_file.tsv`.split(/ /)[0].to_i
pbar = ProgressBar.new('processing', nr_of_lines.to_f/MAX_NR_OF_THREADS)
File.open(input_file.tsv).each_slice(MAX_NR_OF_THREADS) do |slice|
pbar.inc
threads = Hash.new
slice.each do |line|
threads[line] = Thread.new do
# do the actual line parsing, DB lookup and DB updates
end
end
threads.values.each do |thread|
thread.join
end
end
pbar.finish
I know this is far from perfect:
- I shouldn't need to create that array.
- This way all concurrent threads wait for each other before the next slice is taken from the input file. If one of the 5 threads takes a really long time, the other ones will wait but could instead start parsing the next lines in the input file.
Don't think less of me for this code...
Just commenting about the database side. You seem to first check if the row exists followed by either update or insert. That requires always two database round trips. If merge statement is not supported, you can use the following technique.
ReplyDeleteIf most of the rows already exists, you can execute update then insert if update count was 0. If most of the rows are new, insert and if error then update. See asktom for more details.
The procedural (row-by-row) approach you are using limits the performance. I've seen very complex cases where sql procedures were upserting rows into database about 10000 rows per second in single thread. Simple cases like yours could be order of magnitude faster.