Hello, it has been a while. For the last few weeks, I have been busy with finals and AP testing. Now that those are finally over, I can share some of the progress I made this weekend.
First, I watched a four-hour long course on data analysis using python. This course outlined python libraries such as NumPy and Pandas. The NumPy library supports the storage of large arrays and matrices which can be used to efficiently store numbers. The Pandas library also allows storage in a manner similar to that of arrays. The storage structure used in Pandas is called a Series. When storing in an array, the stored items are ordered by an index. These induces are integers starting from zero. In a Series, a user can either use the integer indices but can also manually input them. For example, if a list of the number of mutations in a gene is stored by the Series mutationnum = [57, 12, 34, 28] then the user can either use the numerical index for each number (0 marks 57 mutations, 1 marks 12 mutations, 2 marks 34 mutations, and 3 marks 28 mutations), or the user can manually input index names so that each mutation number can be called using the names of the gene.
After watching the Python course, I set to work downloading data from the Cancer Genome Atlas BRCA data collection experiment. I downloaded two different files, one containing patient information and another containing information about mRNA expressions within the patients. I fed these into Jupyter as Series, an online python compiler, and did some preliminary analysis to practice using Java. Using the .info() function, I saw that the patient information file had 1101 rows, meaning that there are 1099 patients in the study, since the first two rows are table headers. This file also had 110 columns, meaning that there were 109 columns of different facts about each patient. The mRNA file again showed that there were 1100 patients. This file also showed that there are 20,530 different genes that were found to be mutated. Using the .value_counts() function on the patient information file on the column about a prior diagnosis, it was found that of the patients for which this information is available, 1028 have not had a prior cancer diagnosis and 68 have.
The next order of business is probably matching the gene IDs in the mRNA expression data to the gene IDs of known transcription factor coding regions. To do this I will first need to find a list of these known transcription factors and then download them so that I can use the coding software to compare the gene IDs. This will allow me to compare mutation numbers between genes that do and do not code for transcription factors.
Thank you so much for reading and please stay tuned for future updates!