Data Types, Index, and Slice#

Data types#


In Python, the data type is set when we assign a value to a variable. Different data types can do different things.

The most common data types are

  • Strings (str) for text (surrounded by either single quotation marks or double quotation marks)

  • Integers (int) for whole numbers, positive or negative, without decimals, of unlimited length

  • Floating point numbers (float) for numbers, positive or negative, containing one or more decimals

  • Booleans (bool), with only two possible values: true and false

  • Lists (list) for multiple ordered and changeable items of different data types within one variable (created using square brackets)

  • Tuples (tuple) for multiple ordered and unchangeable items of different data types within one variable (created using round brackets)

  • Dictionaries (dict) for storing a collection of ordered, changeable, and non-duplicate data as “key : value” pairs (created using curly brackets; pairs are separated using commas, and keys and values are separated using colons)

Use type(variable_name) to identify the data type of any variable.

Use len(sequence_name) to determine the length of a sequence (e.g. a string, list, or tuple). Use sequence_name.count(value) to count the number of items with a specified value within a sequence.

Use sorted(list_name, key=myFunc (optional), reverse=True (optional)) to sort a list in ascending (by default), descending (optional) or user-defined (optional) order. The original list variable is unchanged. Of note, strings are sorted in alphabetical order based on their first letter (A-Z). However, words that start with uppercase letters come before words that start with lowercase letters. For case-insensitive sorting, set the key argument to key=str.lower, which converts all strings to lowercase characters.

Use dictionary_name.get(key_name) to return the value of a specified key in a dictionary. If the key is not found, it returns None.

Index and slice for sequence-based data types#


Sequence-based data types (e.g. a string, list, or tuple) are indexed, with _the first item having index 0. To select the first item, use sequence_name[0]; to select the second item, use sequence_name[1]

If we have a long sequence and want to select an item towards the end, we can count backwards, starting at the index number -1.

The syntax for selecting a subset of an existing sequence, a slice, is: sequence_name[start:end]. When we specify the end item for the slice, it goes up to but does not include that item of the sequence!

Use sequence_name[:end] to have the slice starting from the beginning of the sequence. Use sequence_name[start:] to have the slice going to the end of the sequence.

The syntax for selecting a slice with regular steps, is: sequence_name[start:end:step]. A negative step goes backwards. For example, use sequence_name[::2] to select every other element from the sequence, and use sequence_name[::-1] to select the elements in reverse order.

Indexing and slicing sequence-based data types

Examples#


Please pay attention to the use of comments (with #) to express the units of variables or to describe the meaning of commands.

Example

Determine the data type, length, and number of tryptophan residues for this LRRK2 protein sequence containing one letter code amino acids. Select residue 2019. The G2019S mutation in LRRK2 is the most common genetic determinant of Parkinson’s disease identified to date. It lies in the protein’s kinase domain. Select this kinase domain, it includes residue 1879 to residue 2138.

protseqLRRK2 = "MASGSCQGCEEDEETLKKLIVRLNNVQEGKQIETLVQILEDLLVFTYSERASKLFQGKNIHVPLLIVLDSYMRVASVQQVGWSLLCKLIEVCPGTMQSLMGPQDVGNDWEVLGVHQLILKMLTVHNASVNLSVIGLKTLDLLLTSGKITLLILDEESDIFMLIFDAMHSFPANDEVQKLGCKALHVLFERVSEEQLTEFVENKDYMILLSALTNFKDEEEIVLHVLHCLHSLAIPCNNVEVLMSGNVRCYNIVVEAMKAFPMSERIQEVSCCLLHRLTLGNFFNILVLNEVHEFVVKAVQQYPENAALQISALSCLALLTETIFLNQDLEEKNENQENDDEGEEDKLFWLEACYKALTWHRKNKHVQEAACWALNNLLMYQNSLHEKIGDEDGHFPAHREVMLSMLMHSSSKEVFQASANALSTLLEQNVNFRKILLSKGIHLNVLELMQKHIHSPEVAESGCKMLNHLFEGSNTSLDIMAAVVPKILTVMKRHETSLPVQLEALRAILHFIVPGMPEESREDTEFHHKLNMVKKQCFKNDIHKLVLAALNRFIGNPGIQKCGLKVISSIVHFPDALEMLSLEGAMDSVLHTLQMYPDDQEIQCLGLSLIGYLITKKNVFIGTGHLLAKILVSSLYRFKDVAEIQTKGFQTILAILKLSASFSKLLVHHSFDLVIFHQMSSNIMEQKDQQFLNLCCKCFAKVAMDDYLKNVMLERACDQNNSIMVECLLLLGADANQAKEGSSLICQVCEKESSPKLVELLLNSGSREQDVRKALTISIGKGDSQIISLLLRRLALDVANNSICLGGFCIGKVEPSWLGPLFPDKTSNLRKQTNIASTLARMVIRYQMKSAVEEGTASGSDGNFSEDVLSKFDEWTFIPDSSMDSVFAQSDDLDSEGSEGSFLVKKKSNSISVGEFYRDAVLQRCSPNLQRHSNSLGPIFDHEDLLKRKRKILSSDDSLRSSKLQSHMRHSDSISSLASEREYITSLDLSANELRDIDALSQKCCISVHLEHLEKLELHQNALTSFPQQLCETLKSLTHLDLHSNKFTSFPSYLLKMSCIANLDVSRNDIGPSVVLDPTVKCPTLKQFNLSYNQLSFVPENLTDVVEKLEQLILEGNKISGICSPLRLKELKILNLSKNHISSLSENFLEACPKVESFSARMNFLAAMPFLPPSMTILKLSQNKFSCIPEAILNLPHLRSLDMSSNDIQYLPGPAHWKSLNLRELLFSHNQISILDLSEKAYLWSRVEKLHLSHNKLKEIPPEIGCLENLTSLDVSYNLELRSFPNEMGKLSKIWDLPLDELHLNFDFKHIGCKAKDIIRFLQQRLKKAVPYNRMKLMIVGNTGSGKTTLLQQLMKTKKSDLGMQSATVGIDVKDWPIQIRDKRKRDLVLNVWDFAGREEFYSTHPHFMTQRALYLAVYDLSKGQAEVDAMKPWLFNIKARASSSPVILVGTHLDVSDEKQRKACMSKITKELLNKRGFPAIRDYHFVNATEESDALAKLRKTIINESLNFKIRDQLVVGQLIPDCYVELEKIILSERKNVPIEFPVIDRKRLLQLVRENQLQLDENELPHAVHFLNESGVLLHFQDPALQLSDLYFVEPKWLCKIMAQILTVKVEGCPKHPKGIISRRDVEKFLSKKRKFPKNYMSQYFKLLEKFQIALPIGEEYLLVPSSLSDHRPVIELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLVGSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGEEHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSFGSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQQDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGIAQYCCRMGIKTSEGTPGFRAPEVARGNVIYNQQADVYSFGLLLYDILTTGGRIVEGLKFPNEFDELEIQGKLPDPVKEYGCAPWPMVEKLIKQCLKENPQERPTSAQVFDILNSAELVCLTRRILLPKNVIVECMVATHHNSRNASIWLGCGHTDRGQLSFLDLNTEGYTSEEVADSRILCLALVHLPVEKESWIVSGTQSGTLLVINTEDGKKRHTLEKMTDSVTCLYCNSFSKQSKQKNFLLVGTADGKLAIFEDKTVKLKGAAPLKILNIGNVSTPLMCLSESTNSTERNVMWGGCGTKIFSFSNDFTIQKLIETRTSQLFSYAAFSDSNIITVVVDTALYIAKQNSPVVEVWDKKTEKLCGLIDCVHFLREVMVKENKESKHKMSYSGRVKTLCLQKNTALWIGTGGGHILLLDLSTRRLIRVIYNFCNSVRVMMTAQLGSLKNVMLVLGYNRKNTEGTQKQKEIQSCLTVWDINLPHEVQNLEKHIEVRKELAEKMRRTSVE"   #create a string using double quotation marks
type_protseqLRRK2 = type(protseqLRRK2)   #determine the data type
len_protseqLRRK2 = len(protseqLRRK2)   #determine the length of the string
Wcount_protseqLRRK2 = protseqLRRK2.count("W")   #count the number of times W appears in the string
res2019_protseqLRRK2 = protseqLRRK2[2018]   #select residue 2019 (2018 as the first item has index 0)
kinase_protseqLRRK2 = protseqLRRK2[1878:2138]   #select residues 1879 (1878 as the first item has index 0) to 2138 (the first item has index 0; however, it goes up to but does not include item 2138)

print("This variable has data type", type_protseqLRRK2, ".", 
      "The length of the sequence is", len_protseqLRRK2, "residues .",
      "Trp appears", Wcount_protseqLRRK2, "times .",
      "Amino acid 2019 is a", res2019_protseqLRRK2, ".",
      "The sequence of the LRRK2 kinase domain is", kinase_protseqLRRK2, ".")   #Print and place all values that we calculated in a readable sentence. The print function can take more than one object. By default, it separates objects with a space.
This variable has data type <class 'str'> . The length of the sequence is 2527 residues . Trp appears 26 times . Amino acid 2019 is a G . The sequence of the LRRK2 kinase domain is QAPEFLLGDGSFGSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQQDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGIAQYCCRMGIKTSEGTPGFRAPEVARGNVIYNQQADVYSFGLLLYDILTTGGRIVEGLKFPNEFDELEIQGKLPDPVKEYGCAPWPMVEKLIKQCLKENPQERPTSAQVFDILNSAELV .

Example

Determine the number of “key : value” pairs in the following dictionary with genetic codes, with key = code and value = amino acid. Retrieve the genetic code translation for TTC.

GeneticCode = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 
len_GeneticCode = len(GeneticCode)   #determine the length of the dictionary

GeneticCode_TTC = GeneticCode.get('TTC')   #look up the cleavage site stored for NotI

print("This dictionary contains", len_GeneticCode, "pairs .",
      "The genetic code translation stored for TTC is", GeneticCode_TTC, ".")   #Print and place all values that we calculated in a readable sentence. The print function can take more than one object. By default, it separates objects with a space.
This dictionary contains 64 pairs . The genetic code translation stored for TTC is F .

Example

Determine the data type and length for this list with amino acids. Sort the list in alphabetical order.

AA3Letter = ["ALA", "ARG", "ASN", "ASP", "CYS", "GLN", "GLU", "GLY", "HIS", "ILE", "LEU", "LYS", "MET", "PHE", "PRO", "SER", "THR", "TRP", "TYR", "VAL"]
type_AA3Letter = type(AA3Letter)   #determine and print the data type

len_AA3Letter = len(AA3Letter)   #determine the length of the list
AA3Lettersorted = sorted(AA3Letter, key=str.lower)   #sort the list in ascending order

print("This variable has data type", type_AA3Letter, ".", 
      "This list contains", len_AA3Letter, "amino acids .",
      "After sorting the concentrations in ascending order, this list becomes", AA3Lettersorted, ".")   #Print and place all values that we calculated in a readable sentence. The print function can take more than one object. By default, it separates objects with a space.
This variable has data type <class 'list'> . This list contains 20 amino acids . After sorting the concentrations in ascending order, this list becomes ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL'] .

Exercises#


Exercise

Define the EcoRI DNA recognition sequence (GAATTC) as a string. Determine length and number of adenine bases for this sequence.

Exercise

Calculate the melting temperature of the primer with sequence “GACTGCGTTAGGATTGGC”. The melting temperature (°C) can be computed as 64.9 + 41 x (GC - 16.4)/N, with GC the total number of G and C bases in the primer and N the primer length.

Exercise

Determine the data type and length for this list with substrate concentrations. Sort the list in ascending order.

subconc = [1, 4, 500, 0, 15, 250, 30, 125, 2, 8, 60]

Exercise

Determine the number of “key : value” pairs in the following dictionary of restriction enzymes, with key = name and value = cleavage site. Retrieve the cleavage site for NotI.

REs = {
    'EcoRI' : 'GAATTC',
    'BamHI' : 'GGATCC',
    'EarI' : 'CTCTTC',
    'ScaI' : 'AGTACT',
    'NotI' : 'GCGGCCGC',
    'TaqI' : 'TCGA',
    'FokI' : 'GGATG',
    'HindIII' : 'AAGCTT'
    }   #Create a dictionary with restriction enzyme names (keys) and cleavage sites (values). Both are strings.