SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 1 SLE111, Bioinformatics Assignment Gene Identification, an Introduction to Bioinformatics and Databases Assessment This assignment is marked out of 70, and is worth 7% of your final mark for the unit. It is to be submitted to the Moodle Link in Week 11. No submissions will be accepted after that time. Read this entire document, including the supplementary information at the end, before commencing the assignment. Learning Outcomes (Aims): 1. To gain experience in using the National Centre for Biotechnology Information website, which gives access to numerous nucleic acid and protein databases, and much more. 2. To identify the gene that an unknown sequence of DNA (the gene fragment) is part of. 3. To appreciate the differences between nucleotide sequence (e.g., of a gene) and amino acid sequence (e.g., of a protein), and discern properties of such sequences. 4. To comprehend and synthesise information from scientific papers and cite these papers in the proper format. 5. To manage your time effectively. Introduction/Background: You are a member of a laboratory that has sequenced DNA from a gene fragment derived from a mixed sample of pond organisms. It is your job to find the meaning behind the unknown DNA sequence. Your supervisor has asked you to analyse the sequence and answer some related questions. A website you need to familiarize yourself with is at the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov)and, within that site, you will be usingNCBI BLAST: http://blast.ncbi.nlm.nih.gov/ BLAST stands for Basic Local Alignment Search Tool. It is a program that compares a query sequence of nucleotides or amino acids (depending on which type of BLAST that is used) against all the nucleotide and amino acid sequences that have been lodged in genetic databases worldwide. The ‘alignment’ refers to how the comparisons are made: that is, by aligning your query sequence against all sequences in the database and looking for the best match. A BLAST search is the standard way for biologists to identify sequences of genes or proteins, or their closest relatives. SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 2 The unknown sequence The unknown sequence represents the partial sequence of a protein-encoding gene that has been lodged in the genetic databases. Assignment instructions and questions (Follow these steps in order) In order to identify the unknown sequence provided, you will need to do a BLAST search. Step 1. Go to the BLAST page on the NCBI web site. There are five search options available under ‘Basic BLAST’: nucleotide blast, protein blast, blastx, tblastn, tblastx. Step 2. Choose which type of BLAST program you need to run (only consider the first two options). That is, what type of sequence are you submitting – is it nucleotide or protein? (look at the unknown sequence, shown in Step 3, before you make your choice) Step 3. Copy and paste this sequence (the unknown sequence) TTCAGCTAGGAAAAGAGTCAACCCAAGGATTGGGTTGCGGTGCAAACCCA GAATCAGGGCGTCGAGCCGCGGAAGAAAGTAAAGAGGAAATTGCCAGAT ATATTGCAGATGCTAATATGGTATTTATAACTGCCGGTATGGGTGGTGGA ACAGGAACAGGGGCCGCGCCTGTAGTAGCCGAGGTTTGCATGGAAAAGG ACATTCTAACGGTGGCAGTGGTCACTAAACCATTTAGCTTTGAGGGGAAG CATCGCGCTCGCCTAGCAAACGAAGGAATAAGGTCTCTCGAAGATCGTGT TGACACGCTAATAATAATTCCAAATCAAAATATATTCAAGCTCATTAACG CGTCGACGTCAATGGCCGATGCGTTCGGCCTGGCTGACGACATTTTGTTGG CCGGCGTGAAGAGCATCACGGACCTGATGGTTCGGCCGGGACTGATCAAC TTGGACTTCGCAGATGTCCGCACGGTGATGAGCGGGATGGGCCACGCCA into the box where it says “Enter Query Sequence” (assume that the sequence is FASTA – which just refers to a format that the program can recognise).You can give the job a title if you wish, but it is not necessary. Step 4. You will need to “Choose Search Set”. In making this decision, consider the source of your unknown sequence; that is, is it from human, mouse, or some other source?Leave the “Program Selection” as ‘Highly similar sequences’. Step 5. Then, click ‘BLAST’. It may take a few minutes for your query to be answered. Consider what is happening to your sequence in these few minutes: it is being compared to every known sequence in the databases! – gazillions of them … Question 1 (4 marks) Which type of BLAST search did you choose, and why? Step 6. Once your BLAST search has returned a result, you will see will see a Graphic Summary of the SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 3 gene, Descriptions, and Alignments. Hold your mouse over the lower, thinner red line in the Graphic Summary and notice what it tells you above the box. If there is more than one sequence that produced a significant alignment from the Blast search, choose the sequence that you consider has the greatest amount of similarity with your unknown sequence. Notice that the colour of the line in the Graphic Summary represents the alignment score (with red being the highest match, pink is next best, etc) and the length and position of the line represents where the query sequence (your unknown sequence) matches the retrieved sequence. In the Descriptions section, click on the accession number (the number under ‘Accession’, on the far right of the screen) of the sequence that you think matches (aligns) best with your unknown sequence. (The accession number is a unique identifier given to a sequence when it is submitted to a sequence database.) This will take you to the GenBank entry for the gene. There are many things listed here and some of them will enable you to answer the following questions: Question 2 (4 marks) What is the name of the gene? Hint. The gene name is an acronym of 8 letters that refers to the organism the gene is from, what sort (family) of gene it is, and its particular type. Gene names are usually acronyms or abbreviations that are just a few letters in length; they may include numbers. The nomenclature for naming genes can be different among different species. The entire descriptive name of the gene is written at the top of the page; this is not what we want you to write down as the answer. Question 3 (5 marks) 3A. (3 marks) What organism has the unknown sequence come from? Provide the binomial species name (e.g., humans are Homo sapiens) in the correct scientific format. Note that GenBank does not display the correct scientific format. 3B. (2 marks) What subgroup does the organism belong to? This subgroup is a higher order taxon (the next level up from Class) described in some detail in Chapter 28 of Campbell Biology. Hint. In the GenBank entry, there is a line that refers to the Organism that the gene comes from. This gives the species name and the various taxonomic groups, or taxa (see Campbell Fig. 1.14) that the organism belongs in, from biggest to smallest, starting with either Eukaryote or Prokaryote and ending with the genus. See the higher order name immediately following Eukaryota/Prokaryota (you will also find this taxon name in the paper that you download); this higher order name is shown in Figure 28.2 of Campbell 10e (Fig. 28.3 of Campbell 9e), and the text refers to it as a ‘subgroup’ of a clade, with a clade being a group (big or small) of related organisms. Question 4 (8 marks) 4A. (4 marks) Write the full reference for the publication (i.e., the paper/manuscript) that first described the mRNA (transcript) and protein that the unknown sequence is part of. Use the exact standard referencing format that you would see in a reference list from a paper in the Journal of Cell Biology (see Referencing note at end of assignment). Marks will be deducted for incorrect SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 4 punctuation/formatting – even for a misplaced initial or comma. Journals are very fussy about this, and so should you be … 4B. (4 marks) Download the PDF file of this publication (‘paper’, as we call it) through the library website and submit this paper with your assignment to the Drop
box. You may submit as a separate file to the rest of your assignment. Note that the downloaded PDF will include a covering page from the journal; you may omit this last page if you wish. Do not use Google to find the paper: this is an exercise in learning how to correctly source journal articles using the Deakin library, and in following instructions. See also Supplementary Information at the end of this assignment sheet. Now, let’s consider the sequence of the transcript (cDNA) of the gene and its translation product, that is, the protein. If you haven’t already done so, read the Supplementary Information B (below) on the Klp1 gene and how cDNA sequences are derived from mRNA. The same is true for the gene you are now looking at. Step 7. Click the blue-highlighted “CDS” (coding segment/sequence), which will shade, in brown, the sequence of nucleotides that codes for the protein. Question 5 (4 marks) How many nucleotides are there in the coding sequence for this protein? Hint: use the numbers in the left-hand column to help you count, or the numbers next to CDS. Question 6 (4 marks) What is the STOP codon for the gene? Remember, there are 3 alternative stop codons; give the nucleotide sequence (in cDNA format) for the stop codon used by this gene. Question 7 (6 marks) What do we call the two sets of sequences that are downstream (3 prime) of the coding sequence in this transcript of the gene in question? Remember, you are looking at a cDNA here, which is an mRNA (transcript) in DNA format. Hint: see text book. Question 8 (4 marks) State one function of one of the sets of sequences from Q 7. Step 8. To retrieve the GenBank entry for the protein that is encoded by this gene, click the highlighted accession number that is next to ‘/protein_id’ (under Features, and CDS). This entry looks similar to the nucleotide one, but gives us a bit more information. Use this entry, the original paper you downloaded in Q4, and any other sources you can think of, to answer Q 9. Question 9 (3 marks) How many amino acids are there in the entire protein that the unknown sequence encodes part of? SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 5 Given that 3 nucleotides code for one amino acid, you might expect that the answer to Q 9 is exactly 1/3 of the answer to Q 5, but it is not. Question 10 (3 marks) State why the number of nucleotides that encode the protein is not exactly 3 times the number of amino acids that are translated. Question 11 (25 marks) (19 marks) In less than 100 of your own words, write a description of the protein that the gene in Q 2 encodes, that is, the protein that is partially encoded by the unknown sequence. Include in your description: the organelle the protein is targeted to (that is, where it is found in the cell); what the proposed function of this protein is; what other type of protein/s this protein is similar to; what types of cells (not organelles) use versions of this protein to divide themselves, and any other information you would like to provide. (6 marks) Include at least three in-text citations of pertinent references in these 100 words. The references must be journal articles; i.e, ‘papers’ – not websites etc. These references might come from your own searching on the web (Hint: use Pubmed), or from the paper you have retrieved in Q 4. One of these references may be the one that you gave for Q 4A. See Referencing Note, below. The 100 words do not include the in-text citations or the reference list. Referencing note. You are required to cite references in the proper fashion. See the Referencing section at the end of prac 3 in your SLE111 practical manual for general information on how to use the proper referencing system. But, the format that is shown in the prac manual is not the format used by all journals. For this assignment, you are to use the format for an Article from the Journal of Cell Biology (the ‘JCB’). So… go the library website, log in, then download an article from the JCB and see how the JCB formats references. Supplementary Information A (below) will help you retrieve a JCB paper through the Deakin library. Note that the format for referencing in the JCB contains more detail (e.g., the name of the article) than does the journal containing the paper that describes the gene that is the subject of this assignment. Unlike in recent issues of the JCB, however, you are not required here to give the URL for the papers you cite; the last entry in your reference should be the page numbers. Plagiarism note. Remember that all work that you submit must be your own. Your assignment will be run through Turnitin on submission. SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 6 SLE111 Bioinformatics Assignment: supplementary information. A. Instructions for downloading the journal article (paper) for the Bioinformatics Assignment: Once you have identified the name of the journal that the paper describing the gene/protein is in, 1. Go to the Deakin library web site and log in. 2. Click that you are a Deakin College Student 3. Click on grey Search box. Then go to the A_Z journals section by clicking on the black Options button. 4. Go to Search electronic journals making sure that the drop-down menu states “Journal title begins with” or “Title Equals” 5. Search for the journal. Hint: use one word (no numbers) for this search. Use the same search method to retrieve a PDF of any paper from a 2013 issue of the Journal of Cell Biology. Once you have done this, note the format for references used by the JCB and use this exact format for all your citations/references in this assignment. That is, look at the list of references at the end of any article from 2013, and use the same format – noting where initials for authors are, where punctuation marks are, if the journal name is in italics, etc etc. You do not, however, need to provide a web link to any paper you are citing (you will note that some refs in the JCB have web links). B. The example sequence (the Klp1 sequence) that was used in the Intro to Bioinformatics lecture had a GenBank entry described as “C. reinhardtii mRNA for kinesin-like protein”. Why does the GenBank entry show DNA sequence (it contains only A, C, G, T) and not RNA sequence (there is no U)? The reason is that mRNA from Chlamydomonas reinhardtii was isolated for the sequencing of this gene, but it was first ‘reverse transcribed’ into complementary DNA (cDNA). The sequence of cDNA is shown. This will also be relevant when you do the assignment. To know more about reverse transcription and why biologists do it, see Campbell 10e concept 20.2. You will learn more about cDNA libraries and reverse transcription in the lectures on biotechnology. C. Submission format. Submit your typed answers to the questions, along with the required scientific article (paper), all in PDF format, to the Assignment Dropbox on Moodle. Your submission will be checked by the plagiarism detection program, Turnitin. If you wish to run your submission through Turnitin yourself ahead of submission, note that it may take 36 hours for a report to be generated.
SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 1 SLE111, Bioinformatics Assignment Gene Identification, an Introduction to Bioinformatics and Databases Assessment This assignment is marked out of 70, and is worth 7% of your final mark for the unit. It is to be submitted to the Moodle Link in Week 11. No submissions will be accepted after that time. Read this entire document, including the supplementary information at the end, before commencing the assignment. Learning Outcomes (Aims): 1. To gain experience in using the National Centre for Biotechnology Information website, which gives access to numerous nucleic acid and protein databases, and much more. 2. To identify the gene that an unknown sequence of DNA (the gene fragment) is part of. 3. To appreciate the differences between nucleotide sequence (e.g., of a gene) and amino acid sequence (e.g., of a protein), and discern properties of such sequences. 4. To comprehend and synthesise information from scientific papers and cite these papers in the proper format. 5. To manage your time effectively. Introduction/Background: You are a member of a laboratory that has sequenced DNA from a gene fragment derived from a mixed sample of pond organisms. It is your job to find the meaning behind the unknown DNA sequence. Your supervisor has asked you to analyse the sequence and answer some related questions. A website you need to familiarize yourself with is at the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov)and, within that site, you will be usingNCBI BLAST: http://blast.ncbi.nlm.nih.gov/ BLAST stands for Basic Local Alignment Search Tool. It is a program that compares a query sequence of nucleotides or amino acids (depending on which type of BLAST that is used) against all the nucleotide and amino acid sequences that have been lodged in genetic databases worldwide. The ‘alignment’ refers to how the comparisons are made: that is, by aligning your query sequence against all sequences in the database and looking for the best match. A BLAST search is the standard way for biologists to identify sequences of genes or proteins, or their closest relatives. SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 2 The unknown sequence The unknown sequence represents the partial sequence of a protein-encoding gene that has been lodged in the genetic databases. Assignment instructions and questions (Follow these steps in order) In order to identify the unknown sequence provided, you will need to do a BLAST search. Step 1. Go to the BLAST page on the NCBI web site. There are five search options available under ‘Basic BLAST’: nucleotide blast, protein blast, blastx, tblastn, tblastx. Step 2. Choose which type of BLAST program you need to run (only consider the first two options). That is, what type of sequence are you submitting – is it nucleotide or protein? (look at the unknown sequence, shown in Step 3, before you make your choice) Step 3. Copy and paste this sequence (the unknown sequence) TTCAGCTAGGAAAAGAGTCAACCCAAGGATTGGGTTGCGGTGCAAACCCA GAATCAGGGCGTCGAGCCGCGGAAGAAAGTAAAGAGGAAATTGCCAGAT ATATTGCAGATGCTAATATGGTATTTATAACTGCCGGTATGGGTGGTGGA ACAGGAACAGGGGCCGCGCCTGTAGTAGCCGAGGTTTGCATGGAAAAGG ACATTCTAACGGTGGCAGTGGTCACTAAACCATTTAGCTTTGAGGGGAAG CATCGCGCTCGCCTAGCAAACGAAGGAATAAGGTCTCTCGAAGATCGTGT TGACACGCTAATAATAATTCCAAATCAAAATATATTCAAGCTCATTAACG CGTCGACGTCAATGGCCGATGCGTTCGGCCTGGCTGACGACATTTTGTTGG CCGGCGTGAAGAGCATCACGGACCTGATGGTTCGGCCGGGACTGATCAAC TTGGACTTCGCAGATGTCCGCACGGTGATGAGCGGGATGGGCCACGCCA into the box where it says “Enter Query Sequence” (assume that the sequence is FASTA – which just refers to a format that the program can recognise).You can give the job a title if you wish, but it is not necessary. Step 4. You will need to “Choose Search Set”. In making this decision, consider the source of your unknown sequence; that is, is it from human, mouse, or some other source?Leave the “Program Selection” as ‘Highly similar sequences’. Step 5. Then, click ‘BLAST’. It may take a few minutes for your query to be answered. Consider what is happening to your sequence in these few minutes: it is being compared to every known sequence in the databases! – gazillions of them … Question 1 (4 marks) Which type of BLAST search did you choose, and why? Step 6. Once your BLAST search has returned a result, you will see will see a Graphic Summary of the SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 3 gene, Descriptions, and Alignments. Hold your mouse over the lower, thinner red line in the Graphic Summary and notice what it tells you above the box. If there is more than one sequence that produced a significant alignment from the Blast search, choose the sequence that you consider has the greatest amount of similarity with your unknown sequence. Notice that the colour of the line in the Graphic Summary represents the alignment score (with red being the highest match, pink is next best, etc) and the length and position of the line represents where the query sequence (your unknown sequence) matches the retrieved sequence. In the Descriptions section, click on the accession number (the number under ‘Accession’, on the far right of the screen) of the sequence that you think matches (aligns) best with your unknown sequence. (The accession number is a unique identifier given to a sequence when it is submitted to a sequence database.) This will take you to the GenBank entry for the gene. There are many things listed here and some of them will enable you to answer the following questions: Question 2 (4 marks) What is the name of the gene? Hint. The gene name is an acronym of 8 letters that refers to the organism the gene is from, what sort (family) of gene it is, and its particular type. Gene names are usually acronyms or abbreviations that are just a few letters in length; they may include numbers. The nomenclature for naming genes can be different among different species. The entire descriptive name of the gene is written at the top of the page; this is not what we want you to write down as the answer. Question 3 (5 marks) 3A. (3 marks) What organism has the unknown sequence come from? Provide the binomial species name (e.g., humans are Homo sapiens) in the correct scientific format. Note that GenBank does not display the correct scientific format. 3B. (2 marks) What subgroup does the organism belong to? This subgroup is a higher order taxon (the next level up from Class) described in some detail in Chapter 28 of Campbell Biology. Hint. In the GenBank entry, there is a line that refers to the Organism that the gene comes from. This gives the species name and the various taxonomic groups, or taxa (see Campbell Fig. 1.14) that the organism belongs in, from biggest to smallest, starting with either Eukaryote or Prokaryote and ending with the genus. See the higher order name immediately following Eukaryota/Prokaryota (you will also find this taxon name in the paper that you download); this higher order name is shown in Figure 28.2 of Campbell 10e (Fig. 28.3 of Campbell 9e), and the text refers to it as a ‘subgroup’ of a clade, with a clade being a group (big or small) of related organisms. Question 4 (8 marks) 4A. (4 marks) Write the full reference for the publication (i.e., the paper/manuscript) that first described the mRNA (transcript) and protein that the unknown sequence is part of. Use the exact standard referencing format that you would see in a reference list from a paper in the Journal of Cell Biology (see Referencing note at end of assignment). Marks will be deducted for incorrect SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 4 punctuation/formatting – even for a misplaced initial or comma. Journals are very fussy about this, and so should you be … 4B. (4 marks) Download the PDF file of this publication (‘paper’, as we call it) through the library website and submit this paper with your assignment to the Drop
box. You may submit as a separate file to the rest of your assignment. Note that the downloaded PDF will include a covering page from the journal; you may omit this last page if you wish. Do not use Google to find the paper: this is an exercise in learning how to correctly source journal articles using the Deakin library, and in following instructions. See also Supplementary Information at the end of this assignment sheet. Now, let’s consider the sequence of the transcript (cDNA) of the gene and its translation product, that is, the protein. If you haven’t already done so, read the Supplementary Information B (below) on the Klp1 gene and how cDNA sequences are derived from mRNA. The same is true for the gene you are now looking at. Step 7. Click the blue-highlighted “CDS” (coding segment/sequence), which will shade, in brown, the sequence of nucleotides that codes for the protein. Question 5 (4 marks) How many nucleotides are there in the coding sequence for this protein? Hint: use the numbers in the left-hand column to help you count, or the numbers next to CDS. Question 6 (4 marks) What is the STOP codon for the gene? Remember, there are 3 alternative stop codons; give the nucleotide sequence (in cDNA format) for the stop codon used by this gene. Question 7 (6 marks) What do we call the two sets of sequences that are downstream (3 prime) of the coding sequence in this transcript of the gene in question? Remember, you are looking at a cDNA here, which is an mRNA (transcript) in DNA format. Hint: see text book. Question 8 (4 marks) State one function of one of the sets of sequences from Q 7. Step 8. To retrieve the GenBank entry for the protein that is encoded by this gene, click the highlighted accession number that is next to ‘/protein_id’ (under Features, and CDS). This entry looks similar to the nucleotide one, but gives us a bit more information. Use this entry, the original paper you downloaded in Q4, and any other sources you can think of, to answer Q 9. Question 9 (3 marks) How many amino acids are there in the entire protein that the unknown sequence encodes part of? SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 5 Given that 3 nucleotides code for one amino acid, you might expect that the answer to Q 9 is exactly 1/3 of the answer to Q 5, but it is not. Question 10 (3 marks) State why the number of nucleotides that encode the protein is not exactly 3 times the number of amino acids that are translated. Question 11 (25 marks) (19 marks) In less than 100 of your own words, write a description of the protein that the gene in Q 2 encodes, that is, the protein that is partially encoded by the unknown sequence. Include in your description: the organelle the protein is targeted to (that is, where it is found in the cell); what the proposed function of this protein is; what other type of protein/s this protein is similar to; what types of cells (not organelles) use versions of this protein to divide themselves, and any other information you would like to provide. (6 marks) Include at least three in-text citations of pertinent references in these 100 words. The references must be journal articles; i.e, ‘papers’ – not websites etc. These references might come from your own searching on the web (Hint: use Pubmed), or from the paper you have retrieved in Q 4. One of these references may be the one that you gave for Q 4A. See Referencing Note, below. The 100 words do not include the in-text citations or the reference list. Referencing note. You are required to cite references in the proper fashion. See the Referencing section at the end of prac 3 in your SLE111 practical manual for general information on how to use the proper referencing system. But, the format that is shown in the prac manual is not the format used by all journals. For this assignment, you are to use the format for an Article from the Journal of Cell Biology (the ‘JCB’). So… go the library website, log in, then download an article from the JCB and see how the JCB formats references. Supplementary Information A (below) will help you retrieve a JCB paper through the Deakin library. Note that the format for referencing in the JCB contains more detail (e.g., the name of the article) than does the journal containing the paper that describes the gene that is the subject of this assignment. Unlike in recent issues of the JCB, however, you are not required here to give the URL for the papers you cite; the last entry in your reference should be the page numbers. Plagiarism note. Remember that all work that you submit must be your own. Your assignment will be run through Turnitin on submission. SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 6 SLE111 Bioinformatics Assignment: supplementary information. A. Instructions for downloading the journal article (paper) for the Bioinformatics Assignment: Once you have identified the name of the journal that the paper describing the gene/protein is in, 1. Go to the Deakin library web site and log in. 2. Click that you are a Deakin College Student 3. Click on grey Search box. Then go to the A_Z journals section by clicking on the black Options button. 4. Go to Search electronic journals making sure that the drop-down menu states “Journal title begins with” or “Title Equals” 5. Search for the journal. Hint: use one word (no numbers) for this search. Use the same search method to retrieve a PDF of any paper from a 2013 issue of the Journal of Cell Biology. Once you have done this, note the format for references used by the JCB and use this exact format for all your citations/references in this assignment. That is, look at the list of references at the end of any article from 2013, and use the same format – noting where initials for authors are, where punctuation marks are, if the journal name is in italics, etc etc. You do not, however, need to provide a web link to any paper you are citing (you will note that some refs in the JCB have web links). B. The example sequence (the Klp1 sequence) that was used in the Intro to Bioinformatics lecture had a GenBank entry described as “C. reinhardtii mRNA for kinesin-like protein”. Why does the GenBank entry show DNA sequence (it contains only A, C, G, T) and not RNA sequence (there is no U)? The reason is that mRNA from Chlamydomonas reinhardtii was isolated for the sequencing of this gene, but it was first ‘reverse transcribed’ into complementary DNA (cDNA). The sequence of cDNA is shown. This will also be relevant when you do the assignment. To know more about reverse transcription and why biologists do it, see Campbell 10e concept 20.2. You will learn more about cDNA libraries and reverse transcription in the lectures on biotechnology. C. Submission format. Submit your typed answers to the questions, along with the required scientific article (paper), all in PDF format, to the Assignment Dropbox on Moodle. Your submission will be checked by the plagiarism detection program, Turnitin. If you wish to run your submission through Turnitin yourself ahead of submission, note that it may take 36 hours for a report to be generated.
SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 1 SLE111, Bioinformatics Assignment Gene Identification, an Introduction to Bioinformatics and Databases Assessment This assignment is marked out of 70, and is worth 7% of your final mark for the unit. It is to be submitted to the Moodle Link in Week 11. No submissions will be accepted after that time. Read this entire document, including the supplementary information at the end, before commencing the assignment. Learning Outcomes (Aims): 1. To gain experience in using the National Centre for Biotechnology Information website, which gives access to numerous nucleic acid and protein databases, and much more. 2. To identify the gene that an unknown sequence of DNA (the gene fragment) is part of. 3. To appreciate the differences between nucleotide sequence (e.g., of a gene) and amino acid sequence (e.g., of a protein), and discern properties of such sequences. 4. To comprehend and synthesise information from scientific papers and cite these papers in the proper format. 5. To manage your time effectively. Introduction/Background: You are a member of a laboratory that has sequenced DNA from a gene fragment derived from a mixed sample of pond organisms. It is your job to find the meaning behind the unknown DNA sequence. Your supervisor has asked you to analyse the sequence and answer some related questions. A website you need to familiarize yourself with is at the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov)and, within that site, you will be usingNCBI BLAST: http://blast.ncbi.nlm.nih.gov/ BLAST stands for Basic Local Alignment Search Tool. It is a program that compares a query sequence of nucleotides or amino acids (depending on which type of BLAST that is used) against all the nucleotide and amino acid sequences that have been lodged in genetic databases worldwide. The ‘alignment’ refers to how the comparisons are made: that is, by aligning your query sequence against all sequences in the database and looking for the best match. A BLAST search is the standard way for biologists to identify sequences of genes or proteins, or their closest relatives. SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 2 The unknown sequence The unknown sequence represents the partial sequence of a protein-encoding gene that has been lodged in the genetic databases. Assignment instructions and questions (Follow these steps in order) In order to identify the unknown sequence provided, you will need to do a BLAST search. Step 1. Go to the BLAST page on the NCBI web site. There are five search options available under ‘Basic BLAST’: nucleotide blast, protein blast, blastx, tblastn, tblastx. Step 2. Choose which type of BLAST program you need to run (only consider the first two options). That is, what type of sequence are you submitting – is it nucleotide or protein? (look at the unknown sequence, shown in Step 3, before you make your choice) Step 3. Copy and paste this sequence (the unknown sequence) TTCAGCTAGGAAAAGAGTCAACCCAAGGATTGGGTTGCGGTGCAAACCCA GAATCAGGGCGTCGAGCCGCGGAAGAAAGTAAAGAGGAAATTGCCAGAT ATATTGCAGATGCTAATATGGTATTTATAACTGCCGGTATGGGTGGTGGA ACAGGAACAGGGGCCGCGCCTGTAGTAGCCGAGGTTTGCATGGAAAAGG ACATTCTAACGGTGGCAGTGGTCACTAAACCATTTAGCTTTGAGGGGAAG CATCGCGCTCGCCTAGCAAACGAAGGAATAAGGTCTCTCGAAGATCGTGT TGACACGCTAATAATAATTCCAAATCAAAATATATTCAAGCTCATTAACG CGTCGACGTCAATGGCCGATGCGTTCGGCCTGGCTGACGACATTTTGTTGG CCGGCGTGAAGAGCATCACGGACCTGATGGTTCGGCCGGGACTGATCAAC TTGGACTTCGCAGATGTCCGCACGGTGATGAGCGGGATGGGCCACGCCA into the box where it says “Enter Query Sequence” (assume that the sequence is FASTA – which just refers to a format that the program can recognise).You can give the job a title if you wish, but it is not necessary. Step 4. You will need to “Choose Search Set”. In making this decision, consider the source of your unknown sequence; that is, is it from human, mouse, or some other source?Leave the “Program Selection” as ‘Highly similar sequences’. Step 5. Then, click ‘BLAST’. It may take a few minutes for your query to be answered. Consider what is happening to your sequence in these few minutes: it is being compared to every known sequence in the databases! – gazillions of them … Question 1 (4 marks) Which type of BLAST search did you choose, and why? Step 6. Once your BLAST search has returned a result, you will see will see a Graphic Summary of the SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 3 gene, Descriptions, and Alignments. Hold your mouse over the lower, thinner red line in the Graphic Summary and notice what it tells you above the box. If there is more than one sequence that produced a significant alignment from the Blast search, choose the sequence that you consider has the greatest amount of similarity with your unknown sequence. Notice that the colour of the line in the Graphic Summary represents the alignment score (with red being the highest match, pink is next best, etc) and the length and position of the line represents where the query sequence (your unknown sequence) matches the retrieved sequence. In the Descriptions section, click on the accession number (the number under ‘Accession’, on the far right of the screen) of the sequence that you think matches (aligns) best with your unknown sequence. (The accession number is a unique identifier given to a sequence when it is submitted to a sequence database.) This will take you to the GenBank entry for the gene. There are many things listed here and some of them will enable you to answer the following questions: Question 2 (4 marks) What is the name of the gene? Hint. The gene name is an acronym of 8 letters that refers to the organism the gene is from, what sort (family) of gene it is, and its particular type. Gene names are usually acronyms or abbreviations that are just a few letters in length; they may include numbers. The nomenclature for naming genes can be different among different species. The entire descriptive name of the gene is written at the top of the page; this is not what we want you to write down as the answer. Question 3 (5 marks) 3A. (3 marks) What organism has the unknown sequence come from? Provide the binomial species name (e.g., humans are Homo sapiens) in the correct scientific format. Note that GenBank does not display the correct scientific format. 3B. (2 marks) What subgroup does the organism belong to? This subgroup is a higher order taxon (the next level up from Class) described in some detail in Chapter 28 of Campbell Biology. Hint. In the GenBank entry, there is a line that refers to the Organism that the gene comes from. This gives the species name and the various taxonomic groups, or taxa (see Campbell Fig. 1.14) that the organism belongs in, from biggest to smallest, starting with either Eukaryote or Prokaryote and ending with the genus. See the higher order name immediately following Eukaryota/Prokaryota (you will also find this taxon name in the paper that you download); this higher order name is shown in Figure 28.2 of Campbell 10e (Fig. 28.3 of Campbell 9e), and the text refers to it as a ‘subgroup’ of a clade, with a clade being a group (big or small) of related organisms. Question 4 (8 marks) 4A. (4 marks) Write the full reference for the publication (i.e., the paper/manuscript) that first described the mRNA (transcript) and protein that the unknown sequence is part of. Use the exact standard referencing format that you would see in a reference list from a paper in the Journal of Cell Biology (see Referencing note at end of assignment). Marks will be deducted for incorrect SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 4 punctuation/formatting – even for a misplaced initial or comma. Journals are very fussy about this, and so should you be … 4B. (4 marks) Download the PDF file of this publication (‘paper’, as we call it) through the library website and submit this paper with your assignment to the Drop
box. You may submit as a separate file to the rest of your assignment. Note that the downloaded PDF will include a covering page from the journal; you may omit this last page if you wish. Do not use Google to find the paper: this is an exercise in learning how to correctly source journal articles using the Deakin library, and in following instructions. See also Supplementary Information at the end of this assignment sheet. Now, let’s consider the sequence of the transcript (cDNA) of the gene and its translation product, that is, the protein. If you haven’t already done so, read the Supplementary Information B (below) on the Klp1 gene and how cDNA sequences are derived from mRNA. The same is true for the gene you are now looking at. Step 7. Click the blue-highlighted “CDS” (coding segment/sequence), which will shade, in brown, the sequence of nucleotides that codes for the protein. Question 5 (4 marks) How many nucleotides are there in the coding sequence for this protein? Hint: use the numbers in the left-hand column to help you count, or the numbers next to CDS. Question 6 (4 marks) What is the STOP codon for the gene? Remember, there are 3 alternative stop codons; give the nucleotide sequence (in cDNA format) for the stop codon used by this gene. Question 7 (6 marks) What do we call the two sets of sequences that are downstream (3 prime) of the coding sequence in this transcript of the gene in question? Remember, you are looking at a cDNA here, which is an mRNA (transcript) in DNA format. Hint: see text book. Question 8 (4 marks) State one function of one of the sets of sequences from Q 7. Step 8. To retrieve the GenBank entry for the protein that is encoded by this gene, click the highlighted accession number that is next to ‘/protein_id’ (under Features, and CDS). This entry looks similar to the nucleotide one, but gives us a bit more information. Use this entry, the original paper you downloaded in Q4, and any other sources you can think of, to answer Q 9. Question 9 (3 marks) How many amino acids are there in the entire protein that the unknown sequence encodes part of? SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 5 Given that 3 nucleotides code for one amino acid, you might expect that the answer to Q 9 is exactly 1/3 of the answer to Q 5, but it is not. Question 10 (3 marks) State why the number of nucleotides that encode the protein is not exactly 3 times the number of amino acids that are translated. Question 11 (25 marks) (19 marks) In less than 100 of your own words, write a description of the protein that the gene in Q 2 encodes, that is, the protein that is partially encoded by the unknown sequence. Include in your description: the organelle the protein is targeted to (that is, where it is found in the cell); what the proposed function of this protein is; what other type of protein/s this protein is similar to; what types of cells (not organelles) use versions of this protein to divide themselves, and any other information you would like to provide. (6 marks) Include at least three in-text citations of pertinent references in these 100 words. The references must be journal articles; i.e, ‘papers’ – not websites etc. These references might come from your own searching on the web (Hint: use Pubmed), or from the paper you have retrieved in Q 4. One of these references may be the one that you gave for Q 4A. See Referencing Note, below. The 100 words do not include the in-text citations or the reference list. Referencing note. You are required to cite references in the proper fashion. See the Referencing section at the end of prac 3 in your SLE111 practical manual for general information on how to use the proper referencing system. But, the format that is shown in the prac manual is not the format used by all journals. For this assignment, you are to use the format for an Article from the Journal of Cell Biology (the ‘JCB’). So… go the library website, log in, then download an article from the JCB and see how the JCB formats references. Supplementary Information A (below) will help you retrieve a JCB paper through the Deakin library. Note that the format for referencing in the JCB contains more detail (e.g., the name of the article) than does the journal containing the paper that describes the gene that is the subject of this assignment. Unlike in recent issues of the JCB, however, you are not required here to give the URL for the papers you cite; the last entry in your reference should be the page numbers. Plagiarism note. Remember that all work that you submit must be your own. Your assignment will be run through Turnitin on submission. SLE111 2015_T3. Bioinformatics Assignment Modified from © Deakin University, 2015, CELLS & GENES SLE111 6 SLE111 Bioinformatics Assignment: supplementary information. A. Instructions for downloading the journal article (paper) for the Bioinformatics Assignment: Once you have identified the name of the journal that the paper describing the gene/protein is in, 1. Go to the Deakin library web site and log in. 2. Click that you are a Deakin College Student 3. Click on grey Search box. Then go to the A_Z journals section by clicking on the black Options button. 4. Go to Search electronic journals making sure that the drop-down menu states “Journal title begins with” or “Title Equals” 5. Search for the journal. Hint: use one word (no numbers) for this search. Use the same search method to retrieve a PDF of any paper from a 2013 issue of the Journal of Cell Biology. Once you have done this, note the format for references used by the JCB and use this exact format for all your citations/references in this assignment. That is, look at the list of references at the end of any article from 2013, and use the same format – noting where initials for authors are, where punctuation marks are, if the journal name is in italics, etc etc. You do not, however, need to provide a web link to any paper you are citing (you will note that some refs in the JCB have web links). B. The example sequence (the Klp1 sequence) that was used in the Intro to Bioinformatics lecture had a GenBank entry described as “C. reinhardtii mRNA for kinesin-like protein”. Why does the GenBank entry show DNA sequence (it contains only A, C, G, T) and not RNA sequence (there is no U)? The reason is that mRNA from Chlamydomonas reinhardtii was isolated for the sequencing of this gene, but it was first ‘reverse transcribed’ into complementary DNA (cDNA). The sequence of cDNA is shown. This will also be relevant when you do the assignment. To know more about reverse transcription and why biologists do it, see Campbell 10e concept 20.2. You will learn more about cDNA libraries and reverse transcription in the lectures on biotechnology. C. Submission format. Submit your typed answers to the questions, along with the required scientific article (paper), all in PDF format, to the Assignment Dropbox on Moodle. Your submission will be checked by the plagiarism detection program, Turnitin. If you wish to run your submission through Turnitin yourself ahead of submission, note that it may take 36 hours for a report to be generated.
Need a Professional Writer to Work on this Paper and Give you Original Paper? CLICK HERE TO GET THIS PAPER WRITTEN
No comments:
Post a Comment