Genscan
From genomewiki
Jump to navigationJump to search
Hard mask
Genscan works with hard masked sequence. All masked sequence is converted to N's. UCSC runs this as a cluster job, although it is a simple loop if you are willing to wait. The gensub2 template is:
#LOOP ./runOne.csh $(root1) ../hardMaskedFa/$(lastDir1)/$(file1) #ENDLOOP
The runOne.csh script:
#!/bin/csh -ef set chrom = $1 set result = $2 twoBitToFa /data/genomes/ricCom1/ricCom1.2bit:$chrom stdout \ | maskOutFa stdin hard $result
C-shell script syntax here to manually construct a directory hierarchy to contain all the resulting hard masked fasta files and the chr.list file to use with gensub2:
mkdir -p /data/genomes/ricCom1/bed/genscan/run.hardMask
cd /data/genomes/ricCom1/bed/genscan/run.hardMask
mkdir -p ../hardMaskedFa
set perDirLimit = 4000
set ctgCount = `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | wc -l`
set subDirCount = `echo $ctgCount | awk '{printf "%d", 1+$1/4000}'`
@ dirCount = 0
set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
@ perDirCount = 0
mkdir ../hardMaskedFa/$dirName
/bin/rm -f chr.list
/bin/touch chr.list
foreach chrom ( `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | cut -f1` )
if ($perDirCount < $perDirLimit) then
@ perDirCount += 1
else
@ dirCount += 1
set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
set perDirCount = 1
mkdir ../hardMaskedFa/$dirName
endif
echo $dirName/$chrom.fa >> chr.list
end
The gensub2 constructs the jobList:
gensub2 chr.list single template jobList
Typical parasol cluster run:
para make jobList para check para time > run.time cat run.time Completed: 25763 of 25763 jobs CPU time in finished jobs: 90s 1.50m 0.02h 0.00d 0.000 y IO & Wait Time: 65875s 1097.92m 18.30h 0.76d 0.002 y Average job time: 3s 0.04m 0.00h 0.00d Longest finished job: 6s 0.10m 0.00h 0.00d Submission to last job: 259s 4.32m 0.07h 0.00d