基因数据处理33之Avocado运行记录(参考基因组)
发布时间:2021-03-06 16:06:49 所属栏目:大数据 来源:网络整理
导读:1.数据下载: avocaodo的test resource中 2.预处理: cat Homo_sapiens_assembly19.fasta | grep -i -n '' Homo_sapiens_assembly19Head.txt cat Homo_sapiens_assembly19Head.txt cat Homo_sapiens_assembly19.fasta | head -34770016 |tail -787820 Homo
1.数据下载: 2.预处理: cat Homo_sapiens_assembly19.fasta | grep -i -n '>' > Homo_sapiens_assembly19Head.txt cat Homo_sapiens_assembly19Head.txt cat Homo_sapiens_assembly19.fasta | head -34770016 |tail -787820 > Homo_sapiens_assembly19chr20.fasta hadoop fs -put Homo_sapiens_assembly19chr20.fasta /xubo/ref 3.运行记录: hadoop@Master:~/xubo/data/testTools/avocado$ avocado-submit /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam /xubo/ref/Homo_sapiens_assembly19chr20.fasta /xubo/avocado/test201605281620AvocadoZidai6 /home/hadoop/xubo/data/testTools/basic.properties Using SPARK_SUBMIT=/home/hadoop/cloud/spark-1.5.2//bin/spark-submit Loading reads in from /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam [Stage 8:> (0 + 2) / 3]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 4.结果: {"variant": {"variantErrorProbability": null,"contig": {"contigName": "20","contigLength": 63025520,"contigMD5": null,"referenceURL": null,"assembly": null,"species": null,"referenceIndex": null},"start": 224970,"end": 224971,"referenceAllele": "G","alternateAllele": "A","svAllele": null,"isSomatic": false},"variantCallingAnnotations": {"variantIsPassing": null,"variantFilters": [],"downsampled": null,"baseQRankSum": null,"fisherStrandBiasPValue": 0.07825337,"rmsMapQ": 60.0,"mapq0Reads": null,"mqRankSum": 0.0,"readPositionRankSum": null,"genotypePriors": [],"genotypePosteriors": [],"vqslod": null,"culprit": null,"attributes": {}},"sampleId": "NA12878","sampleDescription": null,"processingDescription": null,"alleles": ["Ref","Alt"],"expectedAlleleDosage": null,"referenceReadDepth": 3,"alternateReadDepth": 5,"readDepth": 8,"minReadDepth": null,"genotypeQuality": 2147483647,"genotypeLikelihoods": [-32.696815,-5.5451775,-53.880547],"nonReferenceLikelihoods": [-32.696815,"strandBiasComponents": [],"splitFromMultiAllelic": false,"isPhased": false,"phaseSetId": null,"phaseQuality": null} {"variant": {"variantErrorProbability": null,"start": 225057,"end": 225058,"referenceAllele": "A","alternateAllele": "G","fisherStrandBiasPValue": 0.79760426,"rmsMapQ": 59.228653,"mqRankSum": -0.23090047,"referenceReadDepth": 49,"alternateReadDepth": 41,"readDepth": 90,"genotypeLikelihoods": [-507.5003,-62.383247,-409.40555],"nonReferenceLikelihoods": [-507.5003,"phaseQuality": null} 附录: package org.bdgenomics.avocado.cli import org.apache.spark.rdd.RDD import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkContext,SparkConf} import org.bdgenomics.adam.rdd.ADAMContext._ /** * Created by xubo on 2016/5/27. * 从hdfs下载经过avocado匹配好的数据 */ object parquetRead {。。。} (2)内存不够: hadoop@Master:~/xubo/data/testTools/avocado$ avocado-submit /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam /xubo/ref/Homo_sapiens_assembly19chr22.fasta /xubo/avocado/test201605281620AvocadoZidai5 /home/hadoop/xubo/data/testTools/basic.properties Using SPARK_SUBMIT=/home/hadoop/cloud/spark-1.5.2//bin/spark-submit Loading reads in from /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam 16/05/28 19:18:10 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 6) java.lang.IllegalArgumentException: requirement failed: Received key (ReferenceRegion(20,225058,225059,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/05/28 19:18:10 ERROR TaskSetManager: Task 0 in stage 7.0 failed 1 times; aborting job Command body threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times,most recent failure: Lost task 0.0 in stage 7.0 (TID 6,localhost): java.lang.IllegalArgumentException: requirement failed: Received key (ReferenceRegion(20,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1055) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:998) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:938) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:930) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:930) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:930) at org.apache.spark.rdd.InstrumentedPairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(InstrumentedPairRDDFunctions.scala:485) at org.apache.spark.rdd.InstrumentedPairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(InstrumentedPairRDDFunctions.scala:485) at org.apache.spark.rdd.InstrumentedPairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(InstrumentedPairRDDFunctions.scala:485) at org.apache.spark.rdd.Timer.time(Timer.scala:57) at org.apache.spark.rdd.InstrumentedRDD$.recordOperation(InstrumentedRDD.scala:378) at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:484) at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$adamParquetSave$1.apply$mcV$sp(ADAMRDDFunctions.scala:75) at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$adamParquetSave$1.apply(ADAMRDDFunctions.scala:60) at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$adamParquetSave$1.apply(ADAMRDDFunctions.scala:60) at org.apache.spark.rdd.Timer.time(Timer.scala:57) at org.bdgenomics.adam.rdd.ADAMRDDFunctions.adamParquetSave(ADAMRDDFunctions.scala:60) at org.bdgenomics.avocado.cli.Avocado$$anonfun$run$1.apply$mcV$sp(Avocado.scala:229) at org.bdgenomics.avocado.cli.Avocado$$anonfun$run$1.apply(Avocado.scala:229) at org.bdgenomics.avocado.cli.Avocado$$anonfun$run$1.apply(Avocado.scala:229) at org.apache.spark.rdd.Timer.time(Timer.scala:57) at org.bdgenomics.avocado.cli.Avocado.run(Avocado.scala:228) at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:54) at org.bdgenomics.avocado.cli.Avocado.run(Avocado.scala:82) at org.bdgenomics.utils.cli.BDGCommandCompanion$class.main(BDGCommand.scala:32) at org.bdgenomics.avocado.cli.Avocado$.main(Avocado.scala:52) at org.bdgenomics.avocado.cli.Avocado.main(Avocado.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.IllegalArgumentException: requirement failed: Received key (ReferenceRegion(20,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) (编辑:开发网_开封站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |