Genomic data - cost-effective scaling in the cloud

Tal Franji Tal Franji
Language: English
video in English
The presentation was given on 2021.05.02 at PyCon Israel 2021.

Genomic sequencing and processing data amounts to many terabytes of data. We'll present how single-cell processing pipe-line requires strong/eventual consistency trade-offs which are different from traditional big-data systems.

immunai runs a complex single-cell RNA sequencing pipe-line. The computational-biology and machine-learning tools eco-system revolves around R and Python. We use cost-effective cloud-storage for the large sequencing files while combining them with strongly consistent meta-data. R/python API users can retrieve the data indexed by any application defined set of labels/features. We will discuss the tradeoffs compared to other big-data platforms like Apache Spark, Elastic Search etc.