python - Multiple sequence comparison of arbitrary string with oriented characters -


the core problem is:

i looking algorithm calculate maximum parsimonious distance between set of strings. distance mean similar damerau–levenshtein distance i.e. minimal number of deletions, insertions, substitution , transposition of characters or adjacent characters blocks. instead of regular strings want investigate strings oriented characters.

thus, string like:

  1. (a,1) (b,1) (c,1) (d,1)

and possible derivatives be:

  1. (a,1) (c,0) (b,0) (d,1)
  2. (a,1) (c,1) (b,1) (d,1)
  3. (a,1) (b,0) (c,0) (d,1)

where a,b,c,d character identities , 1 = forward , 0 = reverse.

here, derivative 1. have distance 2, since cut out block bc , re paste inverted (1 cut, 1 paste). derivative 2. have 2 well, since may cut out c , re paste in front of b (1 cut, 1 paste) whereas number 3. require 4 operations (2 cuts, 2 pastes) transformed. analogous, deletion or insertion of blocks yield distance 1.

if define (x,0) , (x,1) 2 different non oriented characters (x0, x1) possible x, example 3. result in distance of 2 since cut out block b1c1 , insert block b0c0 in 2 steps.

a real world example:

the genes in bacterial genome considered oriented character (a,0), (b,0)... determining sequence distance, genomic orientation of homologue genes in 2 related bacteria used evolutionary marker trace. fact bacterial genomes circular strings introduces additional border condition abc equal bca.

real genomes have unique genes no equivalent in partner giving rise place holder character @. place holders reduce information content of comparison lower bound, since e.g. (a,1)(b,1)@(c,1) can transformed (a,1)@@@(b,1)@(c,1) insertion of block @@@. orientation restores partially information content since may find (a,1)@@@(b,0)@(c,1) indicating minimum distance of 3. better algorithm compare multiple related sequences (genomes) simultaneously, since find intermediates in evolutionary history, increases resolution.

i realize, there several questions posted on text string comparison. fail expandable include orientation. in addition, there exists wealth of methods deal biological sequences, in particular multiple sequence analysis. however, limited macromolecule sequences not exist in alternate orientations , invoke specific weights particular character match.

if there exists python library allow necessary customization solve problem, fantastic. suitable orientation aware algorithm helpful.

i believe following code can out:

from itertools import permutations random import randint pprint import pprint   def generate_genes():     """     generates boilerplate list of genes     @rtype : list     """     tuple_list = []      in range(16):          binary_var = bin(i)[2:]          if len(binary_var) != 4:             binary_var = "0" * (4 - len(binary_var)) + binary_var          tuple_list.append([('a', (1 if binary_var[0] == '1' else 0)),                            ('b', (1 if binary_var[1] == '1' else 0)),                            ('c', (1 if binary_var[2] == '1' else 0)),                            ('d', (1 if binary_var[3] == '1' else 0))])      return tuple_list   def all_possible_genes():     """ generates possible combinations of abcd genes     @return: returns list of combinations     @rtype: tuple     """     gene_list = generate_genes()     all_possible_permutations = []     gene in gene_list:         all_possible_permutations.append([var var in permutations(gene)])      return all_possible_permutations   def gene_stringify(gene_tuple):     """     @type gene_tuple : tuple     @param gene_tuple: gene tuple generated     """      return "".join(str(var[0]) var in gene_tuple if var[1])   def dameraulevenshtein(seq1, seq2):     """calculate damerau-levenshtein distance between sequences.      distance number of additions, deletions, substitutions,     , transpositions needed transform first sequence     second. although used strings, sequences of     comparable objects work.      transpositions exchanges of *consecutive* characters; other     operations self-explanatory.      implementation o(n*m) time , o(m) space, n , m     lengths of 2 sequences.      >>> dameraulevenshtein('ba', 'abc')     2     >>> dameraulevenshtein('fee', 'deed')     2      works arbitrary sequences too:     >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])     2     """     # codesnippet:d0de4716-b6e6-4161-9219-2903bf8f547f     # conceptually, based on len(seq1) + 1 * len(seq2) + 1 matrix.     # however, current , 2 previous rows needed @ once,     # store those.     oneago = none     thisrow = range(1, len(seq2) + 1) + [0]     x in xrange(len(seq1)):         # python lists wrap around negative indices, put         # leftmost column @ *end* of list. matches         # zero-indexed strings , saves calculation.         twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]         y in xrange(len(seq2)):             delcost = oneago[y] + 1             addcost = thisrow[y - 1] + 1             subcost = oneago[y - 1] + (seq1[x] != seq2[y])             thisrow[y] = min(delcost, addcost, subcost)             # block deals transpositions             if (x > 0 , y > 0 , seq1[x] == seq2[y - 1]                 , seq1[x - 1] == seq2[y] , seq1[x] != seq2[y]):                 thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)     return thisrow[len(seq2) - 1]   if __name__ == '__main__':     genes = all_possible_genes()     list1 = genes[randint(0, 15)][randint(0, 23)]     list2 = genes[randint(0, 15)][randint(0, 23)]      print gene_stringify(list1)     pprint(list1)     print gene_stringify(list2)     pprint(list2)     print dameraulevenshtein(gene_stringify(list1), gene_stringify(list2)) 

credits

michael homer algorithm


Comments

Popular posts from this blog

java - Run a .jar on Heroku -

java - Jtable duplicate Rows -

validation - How to pass paramaters like unix into windows batch file -