# HiSIF - HiC Significant Interacting Fragments # ---------------------------------- HiSIF was implemented by using C++/C with parallel processing being written in C. It has been compiled and run exclusively on Linux operating systems. This tool only requires the g++ compiler and a reference genome for HG19. Standalone CERN ROOT C++ framework is used to extract fit parameters of the CTS interactions. A small C tool is provided to process the initial data from NCBI. HiSIF supports all of the available three Hi-C protocols (Hi-C, TCC and in situ Hi-C). The tool is named from the fact that it is designed to find the significant interactions from a given sample of Hi-C reads. ## Citation ## Please cite our paper when you use this tool: Zhou, Y., Cheng, X., Yang, Y., Li, T., Li, J., Huang, T.H., Jin, V.X. (2020) Genome-wide chromatin interactions identify characteristic promoter-distal loops. Genome Medicine. Aug 12;12:69. https://doi.org/10.1186/s13073-020-00769-8 Thank you. ## System Requirements ## Operating System: Linux/Unix Cores: at least 32 RAM: 32GB minimum (will not be able to run in-situ) Hard Disk Space: >100GB (more as datasets get larger) # Pipeline Overview # 1) Map using bowtie2 or similar software 2) convert the BAM file format to a 6-column file format 3) use 'split -l' to split the produced file into many for parallelization 4) use the 'proc' tool to sort these text files, producing a directory of chr files 5) use HiSIF with specific parameters ## Install HiSIF ## Simply download the .zip of this project, cd into the unzipped directory, and type make. This will create 2 executables in the bin directory: 1) proc - used for preprocessing of data 2) HiSIF - binary for algorithm ## Quick Start ## After configuration file config_hisif.txt is changed as necessary, perform the following: chmod 755 runhisif.sh ./runhisif.sh config_hisif.txt Configuration file config_hisif.txt including: SAMTOOLS: path for samtools HISIF: path for HiSIF HICPROBAMPATH: the bam file path from HiC-Pro output, leave it empty when HiC-Pro output is not used. BAMPATH: the path with multiple bam files, leave it empty when not multiple bam files BAMFILE: the path and file name of one bam file OUTPUT: the path for saving the HiSIF output SAMPLE: the name of the sample PREPROCESSING: "True" need preprocessing, "False" doesn't need preprocessing REFERENCE_GENOME: the path for reference genome CUTTING_FRAGMENTS: the genome enzyme cutting sites fragments READ_LENGTH: sequencing reading length CUTTING_SITE_EXTENT: the extent of cutting site, default 500 BIN_SIZE: bin size, normally HindIII is 20000, MboI is 3000 FRAGMENT_SIZE: Output fragment size, normally is the resolution THRESHOLD_FIRST: first value of SIF threshold value, default is 1 THRESHOLD_LAST: last value of SIF threshold value, default 5 PERCENTAGE: Percentage of dataset for bootstrapping POISSON_MIXTURE1: Poisson mixture model parameter 1, default 1 POISSON_MIXTURE2: Poisson mixture model parameter 2, default 29 ITERATIONS: Bootstrapping iterations, default 2 CHILD_PROCESSES: Limit number of child processes used, default 0 (no limt) LOGFILE: the log file name, default runhisif.log ## Customized Running ## The running could be customized by user referring to the provided Linux shell file runhisif.sh Or refer to the following introduction. *There are two steps to use HiSIF: pre-processing, and running HiSIF.* ## Pre-Processing ## This method assumes mapping with bowtie2 or a similar tool has been done. Using the SAM/BAM files from the mapping, transfer them to 6-column text file: chr1 pos1 strand1 chr2 pos2 strand2 Please note strand is 1 for positive strand and 0 for negative strand. Each chromosome need only the number and chrX is 23 and chrY is 24. That means chr1/chr2 must be 1-24. Note that with very large output files, the 'proc' executble will fork for each file. It is suggested to split this file into 10 files, to speedup pre-processing. ## Creating the chr-by-chr files ## Before using the 'proc' program, it is suggested to split up the text file into many different files (at least 10), using 'split -l'. The file prefixes when finished are: chr1.tmp chr2.tmp ... chrX.tmp Usage: proc: <-r or -t> -t traditional 6-column file format -r rao format Note: for some in-situ datasets, they produce a different formatting. Please refer to the FORMATS file for more information, and look at how the python script is used. ## Running HiSIF ## Program: HiSIF - HiC Significant Interaction Fragments Version: 1.0.0 HiSIF [options] -g reference genome directory -c cutting sites map .bed file -p poisson mixture model parameters -w readLength, cuttingSiteExtent, binSize -t peakThreshold value, 1, 1.5, 2 and so on -s <0.0-1.0> percentage of dataset for bootstrapping, default is 1 -i bootstrapping iterations, default is 50 -f output fragment size, default is the same as binSize -x limit number of child processes used, default is 0 (no limit) For example: Run the following for HindIII digested Hi-C experiments bin/HiSIF -g -c -p 1 29 -w 50 500 20000 -t 1 -i 2 chrfiles Run the following for MboI digested Hi-C experiments bin/HiSIF -g -c -p 1 29 -w 50 500 3000 -t 1 -i 2 chrfiles ## Results ## All results are prefixed with (SAMPLE)_t(THRESHOLD)_, threshold could be 1, 1.5, 2, 2.5, 3 and so on. (SAMPLE)_t(THRESHOLD)_peak.txt: the significant interaction fragments with columns: chr1 start1 end1 chr2 start2 end2 counts FDR This is the major result, normally users only need this result for the subsequent analysis. (SAMPLE)_t(THRESHOLD)_BootStrapping.txt: system parameters for boot strapping (SAMPLE)_t(THRESHOLD)_maxIteration.txt: system parameters for iteration (SAMPLE)_t(THRESHOLD)_PerChr.txt: number of enzyme cutting fragment for each chromosome (SAMPLE)_t(THRESHOLD)_randomDis.txt: random distribution of sequencing reads (SAMPLE)_t(THRESHOLD)_randomDisProb.txt: random distribution probability of sequencing reads ## Troubleshooting ## 1. When proc is running, there is maybe an error: Error readInteractingRegions:: read error. >>>: Is a directory This is the system reminding of reading the current directory, which does not affect successfully performing. 2. When HiSIF is running, there are maybe some errors like: Error: could not read the first line : Is a directory This is the system reminding of reading some sub-directories in the reference genome, which does not affect successfully performing. 3. Error of ”°Segmentation fault”±: IF too many reads overburden the memory, the segmentation fault error will occur. Please run HiSIF with chromosomes individually to reduce the memory requirements.