Genetic Switches, Our Cells, and Cancer

Apr 21, 2025

Earlier this year, I was applying to some internships for over the summer. One of these internships, which was with the Institute of Systems Biology (ISB). As I was reading up about ISB and Baliga labs, I stumbled upon their research about the importance of regulatory gene networks and transcription factors. The specific paper that piqued my interest was about how a mix of experimental and computational tools were used to map transcription factor activity within the mycobacterium tuberculosis. This got me wondering about whether similar techniques can be applied to cancer prognosis. Since a lot of public databases such as the Cancer Genome Atlas exist for cancer, it left me wondering whether I could try answering these questions myself.

This week, I read a bunch more articles about genetic regulation via transcription factors in the context of cancer:

ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context by Margolin et. al
Functional Characterization of Somatic Mutations in Cancer Using Network-Based Inference of Protein Activity by Alvarez et. al
A Single-Cell Tumor Immune Atlas for Precision Oncology by Nieto et. al
Perspective on Oncogenic Processes at the End of the Beginning of Cancer Genomics by Ding et. al
Intricate Genetic Programs Controlling Dormancy in Mycobacterium Tuberculosis by Peterson. et al
Transcriptional Network Modulated by thePprognostic Signature Transcription Factors and Their Long Noncoding RNA Partners in Primary Prostate Cancer by Jiang et. al.

I learned some things from reading these articles:

Firstly, I got an understanding of the way gene regulation works. Only some of the genes in our body actually code for proteins that will be expressed. The other genes are regulatory, meaning that they control whether a different gene will be expressed. Typically, there is a genetic network of such regulatory regions that ultimately control the genes that code for proteins. Transcription factors are the proteins that connect the regulatory regions to the coding regions through various binding sites.

Cancer is the result of errors in the regulation of the cell cycle which result in uncontrolled cell division. There are many, many genes (tumor suppressor genes and proto-oncogenes) that are involved in cancer. Thus, knowing transcription factor binding data can help to predict what exact genes are involved in cancer.

Secondly, I learned that not all transcription factors are created equal. The more binding sites a gene has, the more transcription factors that can bind to it. The more transcription factors that can bind to a cell, the more strongly that gene will be expressed. Aside from binding sites, the activity of a transcription factor impacts gene expression. It doesn’t matter how many binding sites a gene has if the transcription factors are not moving enough to initiate collisions with a binding site. Once transcription factors activity is observed, it can be used to determine which genes are being expressed as a result. Then, the genes can be used to hypothesize patient risk. Basically, the adverse genes that are expressed more are of higher importance for doctors to treat. This can help the researchers to develop targeted treatments, and for clinicians to administer such treatments.

Finally, there is a large precedent for using computer algorithms to map and analyze genome data. Programs like ARACNE can map genes that interact (like genetic switches) and programs like VIPER map transcription factor activity.

But how does this connect to my project?

Since transcription factors can be used to predict what genes have a big impact on cancer, if I can use machine learning to predict what happened transcription factors are most active in breast cancer, I could predict what gene networks are most active. This data can help research design experiments to confirm the findings.

And machine learning is perfect for such a predictive task and I’m planning to analyze the available breast cancer data. I chose breast cancer for a few reasons:

There’s a large data set that’s publicly available for breast cancer
Breast cancer has a lot of sub-types and not all of them are well understood. While some kinds of breast cancer are tied to known genes (BRCA1 and BRCA2), most breast cancer cases are not identifiable in this way. Gene tests like Oncotype DX and MammaPrint can help doctors decide if a patient needs chemotherapy. But they only apply to some types of breast cancer. And even then, they help in only about 30–40% of those patients.
While 60-70% of the breast cancers are detected early, 25-20% of patients with breast cancer see their cancer come back. So, if I find something, it will make a meaningful difference.

My hope is to use a Transcriptional Regulatory Network, a network mapping how transcription factors control genes, to train a machine learning model that can categorize breast cancer patients as high-risk or low-risk for a fast onset or bad cancer outcomes.

By focusing on transcription factors, my goal is that along with predicting risk, my model also predicts the biological reasons behind that risk.

Thank you for reading!

Molecules and Machines

Discussion about this post