Aug
12
VCF, tab-delimited files and bioclojure
A lot of the work I do involves extracting data from VCF files ("Variant Call Format"; see http://bit.ly/apUbi8). It's tab-delimited but not quite: some of the columns contains structured data rather than just a value, and the format of these columns might even be different for every single line.
An example line (with the header):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
1 12345 . A G 249.00 0 MQ=23.66;DB;DP=89;MQ0=26;LowMQ=0.2921,0.2921,89 GT:DP:GQ 1/1:89:99.00
The INFO field is actually a list of tag/value pairs (except when it's just a tag), and the meaning of the data in the SAMPLE1 column is explained in the FORMAT column. Not only can different INFO tags be present on different lines, but the FORMAT can change line-by-line.
An example line (with the header):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
1 12345 . A G 249.00 0 MQ=23.66;DB;DP=89;MQ0=26;LowMQ=0.2921,0.2921,89 GT:DP:GQ 1/1:89:99.00
The INFO field is actually a list of tag/value pairs (except when it's just a tag), and the meaning of the data in the SAMPLE1 column is explained in the FORMAT column. Not only can different INFO tags be present on different lines, but the FORMAT can change line-by-line.