Sara Lewis, PhD
- Sep 27, 2020
- 10 min read

Behind the scenes of gene discovery

Sara Lewis, PhD

When scientists talk about their research, they tend to be result oriented. This is for a good reason; the research process is long, and details seem trivial in comparison to the thrill of discovery. Scientists like to focus on the WHY they do research, not the HOW. With the advancement of science, the research methods and terminology have become increasingly specialized. Here, I explain the scientific process behind gene discovery that we do in the Kruer laboratory.

How do you start a gene discovery research project?

You might not realize it, but the process of getting a research project ready for clinical trials begins almost two years before patients are even recruited!

Scientific research cannot happen without funding, and it can take years to secure funding from government and foundations through competitive grants. Scientists spend months developing a research project and conducting the preliminary studies to demonstrate they have the best team for those studies. The head of the Molecular and Cellular Neurogenetics laboratory, Michael Kruer, MD does a lot of work writing grants and securing funding, incorporating the team member’s ideas, feedback, and preliminary data. Then there is a several month period during which grants are reviewed, and funding decisions are made. The most important step here is to convince grant reviewers that the project has scientific merit and will significantly benefit to society. With only 10% of grant proposals funded, scientists often revise and resubmit each grant several times, incorporating feedback from reviewers and colleagues to make their proposal stronger.

With the budget cuts of the last several years, funded grants may undergo negotiations to lower project costs, trying to avoid impacting the quality of the science. This may decrease the number of personnel, participants, or follow-up experiments. Unfortunately, it takes time to alter proposals, thus scientists also apply for many small grants to keep their laboratory funded and running.

For genomics studies, scientists also need to set up a computational infrastructure. Each exome, which is the part of the genome that encodes the protein sequences, is comprised of 30 million base pairs. It takes 8 Gigabytes of space to store one exome on a computer, which is the equivalent of 8 standard movies. Scientists need lots of computing power to compare the genome sequence of a patient against the ones of their father and mother. Computers need 24 TB of space to store the raw data from 500 trios (patient, mother, father with backups), which is equivalent to 1.3 times the entire Library of Congress! In addition to our big data infrastructure, we add numerous layers of encryption and security to keep all our information safe. We take participant privacy very seriously.

Before beginning, each proposal needs to be approved by an ethics board. The Institutional Review Board (IRB) makes sure that the rights and welfare of patients are protected throughout the study and after it is completed. The board is composed of scientists, employees of Phoenix Children’s Hospital and other collaborating institutions, and members of the community. Among other things, the board reviews how the research project is presented to the patients and their families to make sure that they are fully aware of the project’s implications and understand what they are agreeing to. The board also guides how information is shared between team members, and with outside scientists, clinicians, government databases, as well as with the public. This is to ensure that patient’s information is always protected. Our clinical coordinator, Bethany Norton, works hard with the IRB office to make sure we protect our research participants and provide with the information they need about the study.

Once a patient has joined the study, what happens next?

One of the Kruer lab’s goals is to uncover how cerebral palsy can be genetic and what that means for patients [https://pubmed.ncbi.nlm.nih.gov/30963790/]. We hope to bring meaningful genetic testing closer to reality, and we are unique in that we conduct both genomic sequencing studies on patients with movement disorders and study the variants and genes in our laboratory.

We work with the biobank at the Phoenix Children’s Hospital, to extract genomic DNA from saliva samples. Samples are then checked to make sure there is enough DNA present in the sample to successfully sequence it, and that it has been properly preserved to prevent degradation. Our lab manager Helen Magee plays a crucial role in this process, by keeping our inventory of samples organized and accessible. We send the DNA samples to a large genetic facility run by our collaborators at Yale University for whole exome sequencing [https://ghr.nlm.nih.gov/primer/testing/sequencing].

Exome sequencing data is made up of many short, overlapping reads of regions in the genome that encode for proteins. The sequence is lined up with the human genome template, with each part having around 40 reads associated with it, called coverage. Many of these bases will be different from the reference genome, also known as variants. Then the entire exome is compared against the mother’s and father’s exomes to understand how each variant was inherited. This sequencing work is done by Postdoctoral Fellow Somayeh Bakhtiari, PhD.

In genetic studies, scientists look for rare alleles (alternative forms of the gene) in patients, with the hypothesis that patients will have genetic differences with their unaffected parents, and that these differences could help understand their disorder. Different possibilities include:

-De novo (new) variants that are not present in mother or father

-Two copies of recessive variants that aren’t common in the population, either the same allele (2 copies=homozygous) or different alleles (1 each of different alleles=compound heterozygotes)

-X-linked variants that are present in the mother and inherited in a male patient

How do we study candidate genes?

Most people don’t realize that everyone’s genome contains many variants, but the vast majority of them are not harmful. After sequencing, we still have a long list of candidate variants for each patient. How do we make sense of our findings?

Identifying cerebral palsy (CP) genes has been a huge challenge. We don’t have a lot of information about what genes can cause this disorder, creating quite the paradox: how do we know what to look for and if we found it? This is one of the biggest reasons why our study of 250 patients took several years to analyze [https://www.nature.com/articles/s41588-020-0695-1]. We had a long list of candidate variants, and we had to wait to sequence many patients to see if genes presented variants in more than one patient. In the end, only a few genes were recurrent, so we had to use alternate approaches to identify good candidate genes. Postdoctoral fellow Sheetal Shetty, PhD used network biology to identify genes in shared pathways. For instance, several genes regulate the cytoskeleton, which is the part of the cell that regulates shape. We then showed that mutations in these genes can cause fruit flies to have trouble walking, meaning they really are important for movement disorders.

While we were able to narrow down the genes that might be most relevant to CP, we had a much harder time determining whether a variant was the cause of a patient’s CP. Although we used the latest bioinformatics tools to predict how candidate variants could alter protein function, we cannot be certain of their role until we test it. This is a major focus of Research Scientist, Sergio Padilla Lopez, PhD, who tests these candidate variants in yeast. To call a variant “pathogenic” or “disease causing”, the change needs to be confirmed experimentally in a laboratory. At this point in time, most of our candidate variants are new, because CP patients have yet to undergo extensive genetic sequencing. Studying what variants do is one of the biggest bottlenecks for returning results to patients.

Gene discovery takes a long time because scientists investigate whether new variants are involved in the pathology. In contrast, clinical sequencing is much quicker, as it only searches for presence of known variants or genes from a list. In our next study, we will assess twice the number of patients to find more genes and learn about different types of CP, which will significantly extend the duration of the project. Given the worldwide research efforts in gene discovery, an increasing number of variants and genes are being discovered, accelerating our knowledge of genetic pathologies.

Where does the information go? Peer-reviewed publications

In health research, the main way to share scientific breakthroughs is via peer-reviewed publications. This is the standard for advancing scientific research and communicating with research and clinical communities. [https://www.nature.com/articles/s41588-020-0695-1].

Peer reviewers are established scientists who are invited to read submitted manuscripts. They provide critical feedback on the scientific merit of the study, the analysis and presentation of the data, and the impact the study will have on the field. Peer reviewers make recommendations for acceptance or rejection of the manuscript. The most prestigious journals accept less than 10% of submissions. Even if a manuscript is accepted, reviewers will often ask authors to do additional analysis or experiments to strengthen the paper. These recommendations are typically made to rule out alternate interpretations of the data. To understand the scrutiny of peer review, imagine spending years doing a research project and writing a paper for your class to present your results. Then three experts on your research topic, potentially your top competitors, are asked to find any mistake you might have overlooked and grade your work. This is a hard but crucial process to go through. Scientists have high expectations for integrity, and they are dedicated to keeping each other accountable. This is why there is such a big difference in quality and trust in scientific results published through a peer review process (as in academic journals), and those that are not peer reviewed (such as blogs).

The peer review process can take several months or longer to complete before a manuscript is deemed ready for publication. Much of this time is spent to respond to both the requests of the peer reviewers and the journal’s scientific rigor policies. Journal editors want to see raw data and have exacting standards for reporting methods and reagents to ensure reproducibility. While it can be frustrating to wait a few years before the results of a scientific study are made available to the public, it is part of the scientific process to require everything is double checked by knowledgeable experts before publication.

Where does the information go? Data sharing

To promote the advancement of science, NIH and other funding agencies encourage all funded investigators to share their data in a common database under the Genomics Data Sharing Policy [https://osp.od.nih.gov/wp-content/uploads/Responsible_Use_of_Human_Genomic_Data_Informational_Resource.pdf]. This policy presents the enormous benefits of shared data, including developing new tools and methods as well as increasing statistical power and cohort diversity. Sharing data can also help answer research questions that may arise in the future, thus potentially saving time and money. For example, studies investigating subtle effects such as genetic risk factors need to include a lot more patients than Mendelian inheritance (1 or 2 ‘bad’ copies of the gene predict acquiring the disease). This policy also covers the protections required to ensure patients’ privacy.

Some of the databases we work with include ANVIL [https://www.genome.gov/Funded-Programs-Projects/Computational-Genomics-and-Data-Science-Program/Genomic-Analysis-Visualization-Informatics-Lab-space-AnVIL]

dbGAP, a repository of all genomes studied for disease [https://www.ncbi.nlm.nih.gov/gap/], and Clinvar [https://www.ncbi.nlm.nih.gov/clinvar/]. Clinvar will become increasingly important in helping clinicians interpret variants in the future, as it provides a place to store information about all variants of unknown significance, along with the associated patient’s medical diagnosis. The government initiatives arising from the Human Genome project have created tools and knowledge to help interpret meaning from raw genetic code. The scientific community is built on these important resources and agreements to share all scientific information openly, while protecting patient’s privacy by de-identifying the data and other security measures.

How does the research benefit participants, even if they never see their results? Why might I not get results returned?

We are only at the beginning of gene discovery, and we need more information to interpret most of the genes that we are discovering. Many of our findings are still preliminary, and need to be replicated either through identifying additional patients with similar mutations and phenotypes, or through in-depth study of the individual gene and variant function. We have a skilled and hardworking team, but not enough time to pursue every variant.

Another limitation is that we are a research laboratory, not a clinically certified lab. We use novel tools and techniques to generate hypotheses that we can be tested in future studies. In contrast, clinical sequencing facilities focus on verifiable results that have real world impact for patients. They work in a team of genetic councilors and clinical geneticist, to bring results and offer educational resources to patients and families.

Despite this, we still dedicate a portion of our efforts towards helping patients when we can. In a recent article, we found patients with mutations in genes with known treatments, including GNB1 (ethosuximide), CTNNB1 (levodopa), and for AMPD2 (5-aminoimidazole-4-carboxamide riboside). Even if we cannot return results to everyone right now, these studies are a major step towards our goal of bringing genetic testing of cerebral palsy closer to reality.

How do you study novel genes?

We collaborate with other labs and clinicians to elucidate what these genes might be doing in the context of disease. We have around a dozen genes that we are examining in one of these many ways. However, these studies also take time, sometimes several years, to collect enough data to publish. These are a crucial step going from a hypothetical risk gene to a carefully characterized disease gene with information on likely underlying pathology. We investigate gene function using model organisms, by manipulating the variant and measuring the effect on function or physiology. This sort of genetic manipulation cannot be done in humans. For example, we have studied variants, such as EEF1A2, to assess how they disrupt function in yeast [https://pubmed.ncbi.nlm.nih.gov/32196822/]. We investigate changes to protein levels and localization. Using patient’s cells, we can study phenotypes, such as changes to cell size and shape, as well as functions like metabolism. We study locomotor and coordination in fruit flies, to determine if a gene of interest plays a role in movement, such as ADD3 [https://pubmed.ncbi.nlm.nih.gov/23836506/].

We also focus on patient studies. In a clinical report, we describe how a specific change to the KCNMA1 ion channel affected 3 patients. We were also able to show a specific treatment identified from studies of the variant protein in human cells actually improved symptoms [https://pubmed.ncbi.nlm.nih.gov/32633875/].

In summary, there’s a lot of steps in gene discovery research and the process takes years. Our team brings together a lot of different expertise to make this happen and is working diligently to ask the follow up questions needed to understand the genes and variants being discovered. We’re hoping this will change what sequencing options are available for patients with cerebral palsy in the future, as well as identify new therapies and pathways to target for drug development.

Behind the scenes of gene discovery

Recent Posts