Title: | Text Alignment with Smith-Waterman |
---|---|
Description: | Find similarities between texts using the Smith-Waterman algorithm. The algorithm performs local sequence alignment and determines similar regions between two strings. The Smith-Waterman algorithm is explained in the paper: "Identification of common molecular subsequences" by T.F.Smith and M.S.Waterman (1981), available at <doi:10.1016/0022-2836(81)90087-5>. This package implements the same logic for sequences of words and letters instead of molecular sequences. |
Authors: | Jan Wijffels [aut, cre, cph] (Rewrite of functionalities from the textreuse R package), Vrije Universiteit Brussel - DIGI: Brussels Platform for Digital Humanities [cph], Lincoln Mullen [ctb, cph] |
Maintainer: | Jan Wijffels <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.4 |
Built: | 2024-11-15 04:51:23 UTC |
Source: | https://github.com/digi-vub/text.alignment |
Align text using the Smith-Waterman algorithm.
The Smith–Waterman algorithm performs local sequence alignment.
It finds similar regions between two strings.
Similar regions are a sequence of either characters or words which are found by matching the characters or words of 2 sequences of strings.
If the word/letter is the same in each text, the alignment score is increased with the match score, while if they are not the same the local alignment score drops by the gap score.
If one of the 2 texts contains extra words/letters, the score drops by the mismatch score.
smith_waterman( a, b, type = c("characters", "words"), match = 2L, mismatch = -1L, gap = -1L, lower = TRUE, similarity = function(x, y) ifelse(x == y, 2L, -1L), tokenizer, collapse, edit_mark = "#", implementation = c("R", "Rcpp"), ... )
smith_waterman( a, b, type = c("characters", "words"), match = 2L, mismatch = -1L, gap = -1L, lower = TRUE, similarity = function(x, y) ifelse(x == y, 2L, -1L), tokenizer, collapse, edit_mark = "#", implementation = c("R", "Rcpp"), ... )
a |
a character string of length one |
b |
a character string of length one |
type |
either 'characters' or 'words' indicating to align based on a sequence of characters or a sequence of words. Defaults to 'characters'. |
match |
integer value of a score to assign a match (a letter/word from a and b which are the same during alignment). This value should be bigger than zero. Defaults to 2. |
mismatch |
integer value of a score to assign a mismatch (leave out 1 word / 1 letter from 1 of the 2 input strings during alignment). This value should be smaller or equal to zero. Defaults to -1. |
gap |
integer value of a score to assign a gap (drop 1 word / letter from each of the 2 input strings during alignment). This value should be smaller or equal to zero. Defaults to -1. |
lower |
logical indicating to lowercase text before doing the alignment. Defaults to TRUE. |
similarity |
optionally, a function to compare 2 characters or words. This function should have 2 arguments x and y with the 2 letters / words to compare and should return 1 number indicating the similarity between x and y. See the examples. |
tokenizer |
a function to tokenise text into either a sequence of characters or a sequence of words.
Defaults to |
collapse |
separator used to combined characters / words back together in the output. Defaults to ” for type 'characters' and a space for type 'words' |
edit_mark |
separator to indicated a gap/mismatch between sequences. Defaults to the hashtag symbol. |
implementation |
either 'R' or 'Rcpp' indicating to perform the alignment in Rcpp or with plain R code. Defaults to 'R'. |
... |
extra arguments passed on to |
The code uses similar code as the textreuse::local_align
function and also allows to align character sequences next to aligning word sequences
an object of class smith_waterman which is a list with elements
type: The alignment type
edit_mark:The edit_mark
sw: The Smith-Waterman local alignment score
similarity: Score between 0 and 1, calculated as the Smith-Waterman local alignment score / (the number of letters/words in the shortest text times the match weight)
weights: The list of weights provided to the function: match, mismatch and gap
matches: The number of matches found during alignment
mismatches: The number of mismatches found during alignment
a: A list with alignment information from the text provided in a
. The list elements documented below
b: A list with alignment information from the text provided in b
. The list elements documented below
Elements a
and b
are both lists which contain
text: The provided character string of either a or b
tokens: A character vector of the tokenised texts of a or b
n: The length of tokens
similarity: The similarity to a calculated as the Smith-Waterman local alignment score / (the number of letters/words in the a or b text times the match weight)
alignment: A list with the following elements
text: The aligned text from either a or b where gaps/mismatches are filled up with the edit_mark
symbol
tokens: The character vector of tokens which form the aligned text
n: The length of the aligned text
gaps: The number of gaps during alignment
from: The starting position in the full tokenised tokens
element from either a or b where the aligned text is found. See the example.
to: The end position in the full tokenised tokens
element from either a or b where the aligned text is found. See the example.
https://en.wikipedia.org/wiki/Smith-Waterman_algorithm
## align sequence of letters smith_waterman("Joske Vermeulen", "Jiske Vermoelen") smith_waterman("Joske Vermeulen", "Ik zoek naar Jiske Versmoelen, waar is die te vinden") smith_waterman("Joske", "Jiske") smith_waterman("Joske", "Jiske", similarity = function(x, y) ifelse(x == y | (x == "o" & y == "i"), 2L, -1L)) ## align sequence of words a <- "The answer is blowin' in the wind." b <- "As the Bob Dylan song says, the answer is blowing in the wind." smith_waterman(a, b) smith_waterman(a, b, type = "characters") smith_waterman(a, b, type = "words") smith_waterman(a, b, type = "words", similarity = function(x, y) adist(x, y)) smith_waterman(a, b, type = "words", tokenizer = function(x) unlist(strsplit(x, "[[:space:]]"))) x <- smith_waterman(a, b, type = "words") x$b$tokens[x$b$alignment$from:x$b$alignment$to] # examples on aligning text files a <- system.file(package = "text.alignment", "extdata", "example1.txt") a <- readLines(a) a <- paste(a, collapse = "\n") b <- system.file(package = "text.alignment", "extdata", "example2.txt") b <- readLines(b) b <- paste(b, collapse = "\n") smith_waterman(a, b, type = "characters") smith_waterman(a, b, type = "words") smith_waterman("Gistel Hof", b, type = "characters") smith_waterman("Bailiestraat", b, type = "characters") smith_waterman("Lange rei", b, type = "characters") # examples on extracting where elements were found x <- smith_waterman("Lange rei", b) x$b$tokens[x$b$alignment$from:x$b$alignment$to] as.data.frame(x) as.data.frame(x, alignment_id = "alignment-ab") x <- lapply(c("Lange rei", "Gistel Hof", NA, "Test"), FUN = function(a){ x <- smith_waterman(a, b) x <- as.data.frame(x) x }) x <- do.call(rbind, x) x
## align sequence of letters smith_waterman("Joske Vermeulen", "Jiske Vermoelen") smith_waterman("Joske Vermeulen", "Ik zoek naar Jiske Versmoelen, waar is die te vinden") smith_waterman("Joske", "Jiske") smith_waterman("Joske", "Jiske", similarity = function(x, y) ifelse(x == y | (x == "o" & y == "i"), 2L, -1L)) ## align sequence of words a <- "The answer is blowin' in the wind." b <- "As the Bob Dylan song says, the answer is blowing in the wind." smith_waterman(a, b) smith_waterman(a, b, type = "characters") smith_waterman(a, b, type = "words") smith_waterman(a, b, type = "words", similarity = function(x, y) adist(x, y)) smith_waterman(a, b, type = "words", tokenizer = function(x) unlist(strsplit(x, "[[:space:]]"))) x <- smith_waterman(a, b, type = "words") x$b$tokens[x$b$alignment$from:x$b$alignment$to] # examples on aligning text files a <- system.file(package = "text.alignment", "extdata", "example1.txt") a <- readLines(a) a <- paste(a, collapse = "\n") b <- system.file(package = "text.alignment", "extdata", "example2.txt") b <- readLines(b) b <- paste(b, collapse = "\n") smith_waterman(a, b, type = "characters") smith_waterman(a, b, type = "words") smith_waterman("Gistel Hof", b, type = "characters") smith_waterman("Bailiestraat", b, type = "characters") smith_waterman("Lange rei", b, type = "characters") # examples on extracting where elements were found x <- smith_waterman("Lange rei", b) x$b$tokens[x$b$alignment$from:x$b$alignment$to] as.data.frame(x) as.data.frame(x, alignment_id = "alignment-ab") x <- lapply(c("Lange rei", "Gistel Hof", NA, "Test"), FUN = function(a){ x <- smith_waterman(a, b) x <- as.data.frame(x) x }) x <- do.call(rbind, x) x
Extract misaligned elements from the Smith-Waterman alignment, namely
before_alignment: Sections in a or b which were occurring before the alignment
wrong_alignment: Sections in a or b which were mismatched in the alignment
after_alignment: Sections in a or b which were occurring after the alignment
smith_waterman_misaligned(x, type = c("a", "b"))
smith_waterman_misaligned(x, type = c("a", "b"))
x |
an object of class |
type |
either 'a' or 'b' indicating to return elements misaligned from |
a list of character vectors of misaligned elements
before_alignment: Sections in a or b which were occurring before the alignment
wrong_alignment: Sections in a or b which were mismatched in the alignment
after_alignment: Sections in a or b which were occurring after the alignment
combined: The combination of before_alignment
, wrong_alignment
and after_alignment
partial: Logical, where TRUE indicates that there is at least a partial alignment and FALSE indicating no alignment between a
and b
was done (alignment score of 0)
sw <- smith_waterman("ab test xy", "cd tesst ab", type = "characters") sw misses <- smith_waterman_misaligned(sw, type = "a") str(misses) misses <- smith_waterman_misaligned(sw, type = "b") str(misses) a <- system.file(package = "text.alignment", "extdata", "example1.txt") a <- readLines(a) a <- paste(a, collapse = "\n") b <- system.file(package = "text.alignment", "extdata", "example2.txt") b <- readLines(b) b <- paste(b, collapse = "\n") sw <- smith_waterman(a, b, type = "characters") smith_waterman_misaligned(sw, type = "a") smith_waterman_misaligned(sw, type = "b")
sw <- smith_waterman("ab test xy", "cd tesst ab", type = "characters") sw misses <- smith_waterman_misaligned(sw, type = "a") str(misses) misses <- smith_waterman_misaligned(sw, type = "b") str(misses) a <- system.file(package = "text.alignment", "extdata", "example1.txt") a <- readLines(a) a <- paste(a, collapse = "\n") b <- system.file(package = "text.alignment", "extdata", "example2.txt") b <- readLines(b) b <- paste(b, collapse = "\n") sw <- smith_waterman(a, b, type = "characters") smith_waterman_misaligned(sw, type = "a") smith_waterman_misaligned(sw, type = "b")
Utility function to perform all pairwise combinations of alignments between text.
smith_waterman_pairwise(a, b, FUN = identity, ...)
smith_waterman_pairwise(a, b, FUN = identity, ...)
a |
a data.frame with columns doc_id and text. Or a character vector where the names of the character vector respresent a doc_id and the character vector corresponds to the text. |
b |
a data.frame with columns doc_id and text. Or a character vector where the names of the character vector respresent a doc_id and the character vector corresponds to the text. |
FUN |
a function to apply on an object of class |
... |
other arguments passed on to |
a list of pairwise Smith-Waterman comparisons after which the FUN argument is applied on all of these pairwise alignments.
The output of the result of FUN is enriched by adding a list element
a_doc_id and b_doc_id which correspond to the doc_id's provided in a
and b
and which can be used
in order to identify the match.
x <- data.frame(doc_id = c(1, 2), text = c("This is some text", "Another set of texts."), stringsAsFactors = FALSE) y <- data.frame(doc_id = c(1, 2, 3), text = c("were as some thing", "else, another set", NA_character_), stringsAsFactors = FALSE) alignments <- smith_waterman_pairwise(x, y) alignments alignments <- smith_waterman_pairwise(x, y, FUN = as.data.frame) do.call(rbind, alignments) alignments <- smith_waterman_pairwise(x, y, FUN = function(x) list(sim = x$similarity)) do.call(rbind, alignments) x <- c("1" = "This is some text", "2" = "Another set of texts.") y <- c("1" = "were as some thing", "2" = "else, another set", "3" = NA_character_) alignments <- smith_waterman_pairwise(x, y)
x <- data.frame(doc_id = c(1, 2), text = c("This is some text", "Another set of texts."), stringsAsFactors = FALSE) y <- data.frame(doc_id = c(1, 2, 3), text = c("were as some thing", "else, another set", NA_character_), stringsAsFactors = FALSE) alignments <- smith_waterman_pairwise(x, y) alignments alignments <- smith_waterman_pairwise(x, y, FUN = as.data.frame) do.call(rbind, alignments) alignments <- smith_waterman_pairwise(x, y, FUN = function(x) list(sim = x$similarity)) do.call(rbind, alignments) x <- c("1" = "This is some text", "2" = "Another set of texts.") y <- c("1" = "were as some thing", "2" = "else, another set", "3" = NA_character_) alignments <- smith_waterman_pairwise(x, y)
Tokenise text into a sequence of characters
tokenize_letters(x)
tokenize_letters(x)
x |
a character string of length 1 |
a character vector with the sequence of characters in x
tokenize_letters("This function just chunks up text in letters\nOK?") tokenize_letters("Joske Vermeulen")
tokenize_letters("This function just chunks up text in letters\nOK?") tokenize_letters("Joske Vermeulen")
Tokenise text into a sequence of words. The function uses strsplit
to split text into words
by using the [:space:] and [:punct:] character classes.
tokenize_spaces_punct(x)
tokenize_spaces_punct(x)
x |
a character string of length 1 |
a character vector with the sequence of words in x
tokenize_spaces_punct("This just splits. Text.alongside\nspaces right?") tokenize_spaces_punct("Also .. multiple punctuations or ??marks") tokenize_spaces_punct("Joske Vermeulen")
tokenize_spaces_punct("This just splits. Text.alongside\nspaces right?") tokenize_spaces_punct("Also .. multiple punctuations or ??marks") tokenize_spaces_punct("Joske Vermeulen")