C/C++ Implementation of Levenshtein Distance Algorithm for Approximate String Matching

C++ The Levenshtein is a measure of how costly it is to adapt a string into another one. If you assign a cost to adding a single character, switching one character for another, and removing a character then you can compute the cost between any two given strings.

Changing a character can be seen as removing a char and adding another one so when adding has cost 1 and removing has cost of one a modification has cost of 2.

The difference between two strings can also be measured in terms of the Levenshtein distance: the distance measure if you think the cost as the “distance” between two strings.

Text comparison is becoming an ever more relevant matter for many fast growing areas such as information retrieval, computational biology, online searching. Levenshtein distance can be used mostly to edit distance, explaining the problem and its relevance.

int levDistance(const std::string source, const std::string target)
{

  // Step 1

  const int n = source.length();
  const int m = target.length();
  if (n == 0) {
    return m;
  }
  if (m == 0) {
    return n;
  }

  // Good form to declare a TYPEDEF

  typedef std::vector< std::vector > Tmatrix; 

  Tmatrix matrix(n+1);

  // Size the vectors in the 2.nd dimension. Unfortunately C++ doesn't
  // allow for allocation on declaration of 2.nd dimension of vec of vec

  for (int i = 0; i <= n; i++) {
    matrix[i].resize(m+1);
  }

  // Step 2

  for (int i = 0; i <= n; i++) {
    matrix[i][0]=i;
  }

  for (int j = 0; j <= m; j++) {
    matrix[0][j]=j;
  }

  // Step 3

  for (int i = 1; i <= n; i++) {

    const char s_i = source[i-1];

    // Step 4

    for (int j = 1; j <= m; j++) {

      const char t_j = target[j-1];

      // Step 5

      int cost;
      if (s_i == t_j) {
        cost = 0;
      }
      else {
        cost = 1;
      }

      // Step 6

      const int above = matrix[i-1][j];
      const int left = matrix[i][j-1];
      const int diag = matrix[i-1][j-1];
      int cell = min( above + 1, min(left + 1, diag + cost));

      // Step 6A: Cover transposition, in addition to deletion,
      // insertion and substitution. This step is taken from:
      // Berghel, Hal ; Roach, David : "An Extension of Ukkonen's 
      // Enhanced Dynamic Programming ASM Algorithm"
      // (http://www.acm.org/~hlb/publications/asm/asm.html)

      if (i>2 && j>2) {
        int trans=matrix[i-2][j-2]+1;
        if (source[i-2]!=t_j) trans++;
        if (s_i!=target[j-2]) trans++;
        if (cell>trans) cell=trans;
      }

      matrix[i][j]=cell;
    }
  }

  // Step 7

  return matrix[n][m];
}

4 thoughts on “C/C++ Implementation of Levenshtein Distance Algorithm for Approximate String Matching”

  1. Please fix this line:

    typedef std::vector Tmatrix;

    I think it’s supposed to be:

    typedef std::vector< std::vector > Tmatrix;

  2. The commet processor is stripping out the second field in my reply
    Curly brackets should be replaced with “”

    typedef std::vector{ std::vector {int} } Tmatrix;

  3. “`
    Tmatrix matrix(n+1);

    // Size the vectors in the 2.nd dimension. Unfortunately C++ doesn’t
    // allow for allocation on declaration of 2.nd dimension of vec of vec

    for (int i = 0; i <= n; i++) {
    matrix[i].resize(m+1);
    }
    “`
    Actually, you can do this:
    Tmatrix matrix(n + 1, std::vector(m + 1));
    Also, the for loop can be replaced with std::iota:
    std::iota(matrix[0].begin(), matrix[0].end(), 0);

Leave a Reply to Marc Cancel reply

Your email address will not be published. Required fields are marked *