Структура данных или алгоритм для быстрого поиска различий между строками

19

У меня есть массив из 100 000 строк, все длиной $k$ . Я хочу сравнить каждую строку с любой другой строкой, чтобы увидеть, отличаются ли любые две строки на 1 символ. Прямо сейчас, когда я добавляю каждую строку в массив, я проверяю ее по каждой строке, уже находящейся в массиве, которая имеет временную сложность $\frac{n(n-1)}{2} k$ .

Есть ли структура данных или алгоритм, который может сравнивать строки друг с другом быстрее, чем то, что я уже делаю?

Некоторая дополнительная информация:

Порядок имеет значение: abcdeи xbcdeотличается на 1 символ, а abcdeи edcbaотличается на 4 символа.
Для каждой пары строк, которые отличаются на один символ, я буду удалять одну из этих строк из массива.
Сейчас я ищу строки, которые отличаются только на 1 символ, но было бы хорошо, если бы эта разница в 1 символ могла быть увеличена, скажем, до 2, 3 или 4 символов. Тем не менее, в этом случае, я думаю, эффективность важнее, чем способность увеличивать лимит различий между персонажами.
$k$ обычно находится в диапазоне 20-40.

— JGut
источник

4

Поиск строкового словаря с 1 ошибкой является довольно известной проблемой, например, cs.nyu.edu/~adi/CGL04.pdf

— KWillets,

1

20-40mer может использовать немало места. Вы можете посмотреть на фильтр Блума ( en.wikipedia.org/wiki/Bloom_filter ), чтобы проверить, являются ли вырожденные строки - набор всех элементов из одной, двух или более замен на тестовом элементе - «возможно» или «определенно» не-в "наборе кмерс. Если вы получили «возможно», то дополнительно сравните две строки, чтобы определить, является ли это ложным срабатыванием. Случаи «определенно не в» - это настоящие негативы, которые уменьшат общее количество буквенных сравнений, которые вы должны выполнить, ограничивая сравнения только потенциальными «возможными попаданиями».

— Алекс Рейнольдс

Если вы работали с меньшим диапазоном k, вы можете использовать набор битов для хранения хеш-таблицы логических значений для всех вырожденных строк (например, github.com/alexpreynolds/kmer-boolean для игрушечного примера). Однако для k = 20-40 требования к пространству для набора битов слишком велики.

— Алекс Рейнольдс

12

Можно достичь $O(nk \log k)$ наихудшего времени работы.

Давайте начнем с простого. Если вам нужно простое в реализации решение, которое будет эффективным на многих входах, но не на всех, вот простое, прагматичное, простое в реализации решение, которого на практике достаточно для многих ситуаций. Однако в худшем случае он возвращается к квадратичному времени работы.

Возьмите каждую строку и сохраните ее в хеш-таблице с ключом в первой половине строки. Затем выполните итерации по хеш-таблицам. Для каждой пары строк в одном и том же сегменте проверьте, отличаются ли они на 1 символ (т. Е. Проверьте, отличается ли их вторая половина на 1 символ).

Затем возьмите каждую строку и сохраните ее в хеш-таблице, на этот раз с ключом во второй половине строки. Снова проверьте каждую пару строк в одном ведре.

Предполагая, что строки хорошо распределены, время выполнения, вероятно, будет около . Более того, если существует пара строк, которые отличаются на 1 символ, она будет найдена во время одного из двух проходов (поскольку они отличаются только на 1 символ, этот различающийся символ должен находиться либо в первой, либо во второй половине строки, поэтому вторая или первая половина строки должна быть одинаковой). Однако в худшем случае (например, если все строки начинаются или заканчиваются одинаковыми символами), это ухудшает время выполнения , поэтому его время выполнения в худшем случае не является улучшением грубой силы , $O(nk)$ $k/2$ $O(n^2 k)$

В качестве оптимизации производительности, если в одном сегменте слишком много строк, вы можете рекурсивно повторить один и тот же процесс, чтобы найти пару, отличающуюся одним символом. Рекурсивный вызов будет выполняться для строк длиной . $k/2$

Если вам небезразлично время работы в худшем случае:

С учетом вышеописанной оптимизации производительности, я считаю, что наихудшее время выполнения - . $O(nk \log k)$

— DW
источник

3

Если

строк имеют одну и ту же первую половину, что вполне может случиться в реальной жизни, то вы не улучшили сложность.

Ω (n)

$\Omega(n)$

— einpoklum

@einpoklum, конечно! Вот почему я написал утверждение во втором предложении о том, что в худшем случае оно прибегает к квадратичному времени выполнения, а также утверждение в моем последнем предложении, описывающее, как добиться сложности

худшем случае, если вы заботитесь о худшем случае. Но, наверное, я не очень четко выразил это, поэтому я соответственно отредактировал свой ответ. Теперь лучше?

O (n k \log k)

$O(nk \log k)$

— DW

15

Мое решение похоже на j_random_hacker's, но использует только один хэш-набор.

Я бы создал хеш-набор строк. Для каждой строки во входе добавьте в набор строк. В каждой из этих строк замените одну из букв специальным символом, которого нет ни в одной из строк. Пока вы добавляете их, убедитесь, что их нет в наборе. Если они есть, то у вас есть две строки, которые отличаются только (максимум) одним символом. $k$

Пример со строками 'abc', 'adc'

Для abc мы добавляем '* bc', 'a * c' и 'ab *'

Для adc мы добавляем '* dc', 'a * c' и 'ad *'

Когда мы добавляем 'a * c' во второй раз, мы замечаем, что он уже находится в наборе, поэтому мы знаем, что есть две строки, которые отличаются только одной буквой.

Общее время работы этого алгоритма составляет . Это потому, что мы создаем новых строк для всех строк на входе. Для каждой из этих строк нам нужно вычислить хеш, который обычно занимает времени. $O(n*k^2)$ $k$ $n$ $O(k)$

Хранение всех строк занимает пространства. $O(n*k^2)$

Дальнейшие улучшения

Мы можем улучшить алгоритм дальше, не сохраняя измененные строки напрямую, а вместо этого сохраняя объект со ссылкой на исходную строку и индекс маскируемого символа. Таким образом, нам не нужно создавать все строки, и нам нужно только место для хранения всех объектов. $O(n*k)$

Вам нужно будет реализовать пользовательскую хеш-функцию для объектов. Мы можем взять реализацию Java в качестве примера, см. Документацию по Java . Java hashCode умножает значение Юникода каждого символа на (с длиной строки и индексом, основанным на одном символе. Обратите внимание, что каждая измененная строка отличается только на один символ от оригинала. Мы можем легко вычислить вклад этого символа в хэш-код. Мы можем вычесть это и добавить вместо этого наш маскирующий символ. Для вычисления требуется . Это позволяет нам сократить общее время выполнения до $31^{k-i}$ $k$ $i$ $O(1)$ $O(n*k)$

— Саймон Принс
источник

4

@JollyJoker Да, космос - это проблема для этого метода. Вы можете уменьшить пространство, не сохраняя измененные строки, а вместо этого сохраняя объект со ссылкой на строку и замаскированный индекс. Это должно оставить вас с O (NK) место.

— Саймон Принс

Чтобы вычислить

хэшей для каждой строки за время

, я думаю, вам понадобится специальная самодельная хеш-функция (например, вычислить хеш исходной строки за время

, а затем XOR с каждым из удаленных символов в

каждый раз (хотя это, вероятно, довольно плохая хеш-функция в других отношениях). Кстати, это очень похоже на мое решение, но с одной хеш-таблицей вместо

отдельных и заменой символа на «*» вместо его удаления.

k

$k$

O (k)

$O(k)$

O (k)

$O(k)$

O (1)

$O(1)$

k

$k$

— j_random_hacker

@SimonPrins С пользовательскими equalsи hashCodeметодами, которые могут работать. Простое создание строки в стиле * в этих методах должно сделать ее пуленепробиваемой; Я подозреваю, что у некоторых других ответов здесь будут проблемы столкновения хэша.

— JollyJoker

1

@WW Я изменил свой пост, чтобы отразить тот факт, что вычисление хэшей занимает время

и добавил решение, чтобы уменьшить общее время выполнения до

.

O (k)

$O(k)$

O (n * k)

$O(n*k)$

— Саймон Принс

1

@SimonPrins Наихудший случай может быть nk ^ 2 из-за проверки равенства строк в hashset.contains, когда хэши сталкиваются. Конечно, в худшем случае, когда каждая строка имеет точно такой же хэш, который потребует довольно много ручного набора строк, особенно чтобы получить тот же хэш *bc, a*c, ab*. Интересно, можно ли это показать невозможным?

— JollyJoker

7

Я бы сделал хеш-таблиц , каждая из которых имеет строку длиной в качестве ключа и список чисел (идентификаторов строк) в качестве значения. Хеш-таблица будет содержать все строки, которые были обработаны до сих пор, но символ в позиции удален . Например, если , то будет содержать список всех рассмотренных до сих пор строк, имеющих шаблон $k$ $H_1, \dots, H_k$ $(k-1)$ $H_i$ $i$ $k=6$ $H_3[ABDEF]$ , где означает «любой символ». Затем для обработки входной строки : $AB\cdot DEF$ $\cdot$ $j$ $s_j$

Для каждого в диапазоне от 1 до k :
- Сформируйте строку , удалив символ из . $s_j'$ $i$ $s_j$
- Посмотри . Каждый идентификатор строки здесь идентифицирует исходную строку, которая либо равна , либо отличается только в позиции . Выведите их как совпадения для строки . (Если вы хотите исключить точные дубликаты, сделайте тип значения хеш-таблиц парой (идентификатор строки, удаленный символ), чтобы можно было проверить те, у которых был удален тот же символ, который мы только что удалили из .) $H_i[s_j']$ $s$ $i$ $s_j$ $s_j$
- Вставьте в для использования в будущем. $j$ $H_i$

Если мы храним каждый хэш-ключ явно, то мы должны использовать пространство и, таким образом, иметь временную сложность, по крайней мере, такую. Но, как описано Саймоном Принсом , можно представить серию модификаций строки (в его случае описанную как изменение отдельных символов на , в моем случае как удаление) неявным образом, так что все ключей хеш-функции для конкретной строки должны просто пространство, приводящее к пространству в целом, и открывающее возможность $O(nk^2)$ * $k$ $O(k)$ $O(nk)$ $O(nk)$ время тоже. Чтобы достичь этой временной сложности, нам нужен способ для вычисления хэшей для всех вариаций строки длины за время : например, это может быть сделано с использованием полиномиальных хэшей, как предлагает DW (и это скорее всего, гораздо лучше, чем просто XOR для удаленного символа с хешем для исходной строки). $k$ $k$ $O(k)$

Уловка неявного представления Саймона Принса также означает, что «удаление» каждого символа фактически не выполняется, поэтому мы можем использовать обычное представление строки на основе массива без снижения производительности (вместо связанных списков, как я изначально предлагал).

— j_random_hacker
источник

2

Хорошее решение. Примером подходящей сделанной на заказ хеш-функции может быть полиномиальный хеш.

— DW

Спасибо @DW Не могли бы вы немного уточнить, что вы подразумеваете под "полиномиальным хешем"? Погугление термина не принесло мне ничего, что казалось бы окончательным. (Пожалуйста, не стесняйтесь редактировать мой пост напрямую, если хотите.)

— j_random_hacker

1

Просто прочитайте строку как базовое число

по модулю

, где

на некоторое простое число меньше, чем размер вашей хэш-карты, и

является корнем примитива

, а

больше размера алфавита. Это называется «полиномиальным хэшем», потому что это похоже на вычисление полинома, коэффициенты которого задаются строкой в точке

. Я оставлю это в качестве упражнения, чтобы выяснить, как вычислить все нужные хеши за

время. Обратите внимание, что этот подход не застрахован от противника, если вы случайно не выберете оба

удовлетворяющие требуемым условиям.

q

$q$

p

$p$

p

$p$

q

$q$

p

$p$

q

$q$

q

$q$

O (k)

$O(k)$

p, q

$p,q$

— user21820

1

I think this solution can be further refined by observing that only one of the k hash tables needs to exist at any one time, thus reducing the memory requirement.

— Michael Kay

1

@MichaelKay: That won't work if you want to compute the

k

$k$ hashes of the possible alterations of a string in

O (k)

$O(k)$ time. You still need to store them somewhere. So if you only check one position at a time, you will take

k

$k$ times as long as if you check all positions together using

k

$k$ times as many hashtable entries.

— user21820

2

Here is a more robust hashtable approach than the polynomial-hash method. First generate $k$ random positive integers $r_{1..k}$ that are coprime to the hashtable size $M$ . Namely, $0 \le r_i < M$ . Then hash each string $x_{1..k}$ to $(\sum_{i=1}^k x_i r_i ) \bmod M$ . There is almost nothing an adversary can do to cause very uneven collisions, since you generate $r_{1..k}$ on run-time and so as $k$ increases the maximum probability of collision of any given pair of distinct strings goes quickly to $1/M$ . It is also obvious how to compute in $O(k)$ time all the possible hashes for each string with one character changed.

If you really want to guarantee uniform hashing, you can generate one random natural number $r(i,c)$ less than $M$ for each pair $(i,c)$ for $i$ from $1$ to $k$ and for each character $c$ , and then hash each string $x_{1..k}$ to $(\sum_{i=1}^k r(i,x_i) ) \bmod M$ . Then the probability of collision of any given pair of distinct strings is exactly $1/M$ . This approach is better if your character set is relatively small compared to $n$ .

— user21820
источник

2

A lot of the algorithms posted here use quite a bit of space on hash tables. Here's an $O(1)$ auxiliary storage $O((n \lg n) \cdot k^2)$ runtime simple algorithm.

The trick is to use $C_k(a, b)$ , which is a comparator between two values $a$ and $b$ that returns true if $a < b$ (lexicographically) while ignoring the $k$ th character. Then algorithm is as follows.

First, simply sort the strings regularly and do a linear scan to remove any duplicates.

Then, for each $k$ :

Sort the strings with $C_k$ as comparator.
Strings that differ only in $k$ are now adjacent and can be detected in a linear scan.

— orlp
источник

1

Two strings of length k, differing in one character, share a prefix of length l and a suffix of length m such that k=l+m+1.

The answer by Simon Prins encodes this by storing all prefix/suffix combinations explicitly, i.e. abc becomes *bc, a*c and ab*. That's k=3, l=0,1,2 and m=2,1,0.

As valarMorghulis points out, you can organize words in a prefix tree. There's also the very similar suffix tree. It's fairly easy to augment the tree with the number of leaf nodes below each prefix or suffix; this can be updated in O(k) when inserting a new word.

The reason you want these sibling counts is so you know, given a new word, whether you want to enumerate all strings with the same prefix or whether to enumerate all strings with the same suffix. E.g. for "abc" as input, the possible prefixes are "", "a" and "ab", while the corresponding suffixes are "bc", "c" and "". As it obvious, for short suffixes it's better to enumerate siblings in the prefix tree and vice versa.

As @einpoklum points out, it's certainly possible that all strings share the same k/2 prefix. That's not a problem for this approach; the prefix tree will be linear up to depth k/2 with each node up to k/2 depth being the ancestor of 100.000 leaf nodes. As a result, the suffix tree will be used up to (k/2-1) depth, which is good because the strings have to differ in their suffixes given that they share prefixes.

[edit] As an optimization, once you've determined the shortest unique prefix of a string, you know that if there's one different character, it must be the last character of the prefix, and you'd have found the near-duplicate when checking a prefix that was one shorter. So if "abcde" has a shortest unique prefix "abc", that means there are other strings that start with "ab?" but not with "abc". I.e. if they'd differ in only one character, that would be that third character. You don't need to check for "abc?e" anymore.

By the same logic, if you would find that "cde" is a unique shortest suffix, then you know you need to check only the length-2 "ab" prefix and not length 1 or 3 prefixes.

Note that this method works only for exactly one character differences and does not generalize to 2 character differences, it relies one one character being the separation between identical prefixes and identical suffixes.

— MSalters
источник

Are you suggesting that for each string

s

$s$ and each

1 \leq i \leq k

$1 \le i \le k$ , we find the node

P [s_{1}, \dots, s_{i - 1}]

$P[s_1, \dots, s_{i-1}]$ corresponding to the length-

(i - 1)

$(i-1)$ prefix in the prefix trie, and the node

S [s_{i + 1}, \dots, s_{k}]

$S[s_{i+1}, \dots, s_k]$ corresponding to the length-

(k - i - 1)

$(k-i-1)$ suffix in the suffix trie (each takes amortised

O (1)

$O(1)$ time), and compare the number of descendants of each, choosing whichever has fewer descendants, and then "probing" for the rest of the string in that trie?

— j_random_hacker

1

What is the running time of your approach? It looks to me like in the worst case it might be quadratic: consider what happens if every string starts and ends with the same

k / 4

$k/4$ characters.

— D.W.

The optimization idea is clever and interesting. Did you have in mind a particular way to do the check for mtaches? If the "abcde" has the shortest unique prefix "abc", that means we should check for some other string of the form "ab?de". Did you have in mind a particular way to do that, that will be efficient? What's the resulting running time?

— D.W.

@D.W.: The idea is that to find strings in the form "ab?de", you check the prefix tree how many leaf nodes exist below "ab" and in the suffix tree how many nodes exist under "de", then choose the smallest of the two to enumerate. When all strings begin and end with the same k/4 characters; that means the first k/4 nodes in both trees have one child each. And yes, every time you need those trees, those have to be traversed which is an O(n*k) step.

— MSalters

To check for a string of the form "ab?de" in the prefix trie, it suffices to get to the node for "ab", then for each of its children

v

$v$ , check whether the path "de" exists below

v

$v$ . That is, don't bother enumerating any other nodes in these subtries. This takes

O (a h)

$O(ah)$ time, where

a

$a$ is the alphabet size and

h

$h$ is the height of the initial node in the trie.

h

$h$ is

O (k)

$O(k)$ , so if the alphabet size is

O (n)

$O(n)$ then it is indeed

O (n k)

$O(nk)$ time overall, but smaller alphabets are common. The number of children (not descendants) is important, as well as the height.

— j_random_hacker

1

Storing strings in buckets is a good way (there are already different answers outlining this).

An alternative solution could be to store strings in a sorted list. The trick is to sort by a locality-sensitive hashing algorithm. This is a hash algorithm which yields similar results when the input is similar[1].

Each time you want to investigate a string, you could calculate its hash and lookup the position of that hash in your sorted list (taking $O(log(n))$ for arrays or $O(n)$ for linked lists). If you find that the neighbours (considering all close neighbours, not only those with an index of +/- 1) of that position are similar (off by one character) you found your match. If there are no similar strings, you can insert the new string at the position you found (which takes $O(1)$ for linked lists and $O(n)$ for arrays).

One possible locality-sensitive hashing algorithm could be Nilsimsa (with open source implementation available for example in python).

[1]: Note that often hash algorithms, like SHA1, are designed for the opposite: producing greatly differing hashes for similar, but not equal inputs.

Disclaimer: To be honest, I would personally implement one of the nested/tree-organized bucket-solutions for a production application. However, the sorted list idea struck me as an interesting alternative. Note that this algorithm highly depends on the choosen hash algorithm. Nilsimsa is one algorithm I found - there are many more though (for example TLSH, Ssdeep and Sdhash). I haven't verified that Nilsimsa works with my outlined algorithm.

— tessi
источник

1

Interesting idea, but I think we would need to have some bounds on how far apart two hash values can be when their inputs differ by just 1 character -- then scan everything within that range of hash values, instead of just neighbours. (It's impossible to have a hash function that produces adjacent hash values for all possible pairs of strings that differ by 1 character. Consider the length-2 strings in a binary alphabet: 00, 01, 10 and 11. If h(00) is adjacent to both h(10) and h(01) then it must be between them, in which case h(11) can't be adjacent to them both, and vice versa.)

— j_random_hacker

Looking at neighbors isn't sufficient. Consider the list abcd, acef, agcd. There exists a matching pair, but your procedure will not find it, as abcd is not a neighbor of agcd.

— D.W.

You both are right! With neighbours I didn't mean only "direct neighbours" but thought of "a neighbourhood" of close positions. I didn't specify how many neighbours need to be looked at since that depends on the hash algorithm. But you're right, I should probably note this down in my answer. thanks :)

— tessi

1

"LSH... similar items map to the same “buckets” with high probability" - since it's probability algorithm, result isn't guaranteed. So it depends on TS whether he needs 100% solution or 99.9% is enough.

— Bulat

1

One could achieve the solution in $O(nk+ n^2)$ time and $O(nk)$ space using enhanced suffix arrays (Suffix array along with the LCP array) that allows constant time LCP (Longest Common Prefix) query (i.e. Given two indices of a string, what is the length of the longest prefix of the suffixes starting at those indices). Here, we could take advantage of the fact that all strings are of equal length. Specifically,

Build the enhanced suffix array of all the $n$ strings concatenated together. Let $X = x_1.x_2.x_3 .... x_n$ where $x_i, \forall 1 \le i \le n$ is a string in the collection. Build the suffix array and LCP array for $X$ .

Now each $x_i$ starts at position $(i-1)k$ in the zero-based indexing. For each string $x_i$ , take LCP with each of the string $x_j$ such that $j<i$ . If LCP goes beyond the end of $x_j$ then $x_i = x_j$ . Otherwise, there is a mismatch (say $x_i[p] \ne x_j[p]$ ); in this case take another LCP starting at the corresponding positions following the mismatch. If the second LCP goes beyond the end of $x_j$ then $x_i$ and $x_j$ differ by only one character; otherwise there are more than one mismatches.

for (i=2; i<= n; ++i){
    i_pos = (i-1)k;
    for (j=1; j < i; ++j){
        j_pos = (j-1)k;
        lcp_len = LCP (i_pos, j_pos);
        if (lcp_len < k) { // mismatch
            if (lcp_len == k-1) { // mismatch at the last position
            // Output the pair (i, j)
            }
            else {
              second_lcp_len = LCP (i_pos+lcp_len+1, j_pos+lcp_len+1);
              if (lcp_len+second_lcp_len>=k-1) { // second lcp goes beyond
                // Output the pair(i, j)
              }
            }
        }
    }
}

You could use SDSL library to build the suffix array in compressed form and answer the LCP queries.

Analysis: Building the enhanced suffix array is linear in the length of $X$ i.e. $O(nk)$ . Each LCP query takes constant time. Thus, querying time is $O(n^2)$ .

Generalisation: This approach can also be generalised to more than one mismatches. In general, running time is $O(nk + qn^2)$ where $q$ is the number of allowed mismatches.

If you wish to remove a string from the collection, instead of checking every $j<i$ , you could keep a list of only 'valid' $j$ .

— Ritu Kundu
источник

Can i say that

O (k n^{2})

$O(kn^2)$ algo is trivial - just compare each string pair and count number of matches? And k in this formula practically can be omitted, since with SSE you can count matching bytes in 2 CPU cycles per 16 symbols (i.e. 6 cycles for k=40).

— Bulat

Apologies but I could not understand your query. The above approach is

O (n k + n^{2})

$O(nk + n^2)$ and not

O (k n^{2})

$O(kn^2)$ . Also, it is virtually alphabet-size independent. It could be used in conjunction with the hash-table approach -- Once two strings are found to have the same hashes, they could be tested if they contain a single mismatch in

O (1)

$O(1)$ time.

— Ritu Kundu

My point is that k=20..40 for the question author and comparing such small strings require only a few CPU cycles, so practical difference between brute force and your approach probably doesn't exist.

— Bulat

1

One improvement to all the solutions proposed. They all require $O(nk)$ memory in the worst case. You can reduce it by computing hashes of strings with * instead each character, i.e. *bcde, a*cde... and processing at each pass only variants with hash value in certain integer range. F.e. with even hash values in the first pass, and odd hash values in the second one.

You can also use this approach to split the work among multiple CPU/GPU cores.

— Bulat
источник

Clever suggestion! In this case, the original question says

n = 100, 000

$n=100,000$ and

k \approx 40

$k\approx 40$ , so

O (n k)

$O(nk)$ memory doesn't seem likely to be an issue (that might be something like 4MB). Still a good idea worth knowing if one needs to scale this up, though!

— D.W.

0

This is a short version of @SimonPrins' answer not involving hashes.

Assuming none of your strings contain an asterisk:

Create a list of size $nk$ where each of your strings occurs in $k$ variations, each having one letter replaced by an asterisk (runtime $\mathcal{O}(nk^2)$ )
Sort that list (runtime $\mathcal{O}(nk^2\log nk)$ )
Check for duplicates by comparing subsequent entries of the sorted list (runtime $\mathcal{O}(nk^2)$ )

An alternative solution with implicit usage of hashes in Python (can't resist the beauty):

def has_almost_repeats(strings,k):
    variations = [s[:i-1]+'*'+s[i+1:] for s in strings for i in range(k)]
    return len(set(variations))==k*len(strings)

— Bananach
источник

Thanks. Please also mention the

k

$k$ copies of exact duplicates, and I'll +1. (Hmm, just noticed I made the same claim about

O (n k)

$O(nk)$ time in my own answer... Better fix that...)

— j_random_hacker

@j_random_hacker I don't know what exactly the OP wants reported, so I left step 3 vague but I think it is trivial with some extra work to report either (a) a binary any duplicate/no duplicates result or (b) a list of pairs of strings that differ in at most one position, without duplicates. If we take the OP literally ("...to see if any two strings..."), then (a) seems to be desired. Also, if (b) were desired then of course simply creating a list of pairs may take

O (n^{2})

$\mathcal{O}(n^2)$ if all strings are equal

— Bananach

0

Вот мой взгляд на 2+ несоответствия искателя. Обратите внимание, что в этом посте я рассматриваю каждую строку как круговую, например подстрока длины 2 в индексе k-1состоит из символа, str[k-1]за которым следует str[0]. И подстрока длины 2 в индексе -1одинакова!

Если у нас есть Mнесоответствия между двумя строками длины k, они имеют совпадающую подстроку длиной не менее $mlen(k,M) = \lceil{k/M}\rceil-1$ поскольку в худшем случае несоответствующие символы разбивают (круговую) строку на Mсегменты одинакового размера. Fe с k=20и M=4«худший» матч может иметь образец abcd*efgh*ijkl*mnop*.

Теперь алгоритм поиска всех несоответствий до Mсимволов среди строк kсимволов:

для каждого я от 0 до к-1
- разбить все строки на группы по str[i..i+L-1], где L = mlen(k,M). Например, если у L=4вас есть алфавит только из 4 символов (из ДНК), это составит 256 групп.
- Группы меньше ~ 100 строк можно проверить с помощью алгоритма грубой силы
- Для больших групп мы должны выполнить вторичное деление:
  - Удалить из каждой строки в группе Lсимволы, которые мы уже сопоставили
  - для каждого j от i-L + 1 до kL-1
    - разбить все строки на группы по str[i..i+L1-1], где L1 = mlen(k-L,M). Fe , если k=20, M=4, alphabet of 4 symbols, так L=4и L1=3, это сделает 64 групп.
    - остальное оставлено как упражнение для читателя: D

Почему мы не начинаем jс 0? Поскольку мы уже создали эти группы с одинаковым значением i, поэтому работа с j<=i-Lбудет точно эквивалентна работе с заменой значений i и j.

Дальнейшая оптимизация:

В каждой позиции также учитывайте строки str[i..i+L-2] & str[i+L]. Это только удваивает количество созданных рабочих мест, но позволяет увеличить Lна 1 (если моя математика верна). Таким образом, например, вместо 256 групп, вы разделите данные на 1024 группы.
Если некоторые $L[i]$ становится слишком маленьким, мы всегда можем использовать *хитрость: для каждого i in in 0..k-1удалите i-й символ из каждой строки и создайте работу, ища M-1несоответствия в этих строках длины k-1.

— Bulat
источник

0

I work everyday on inventing and optimizing algos, so if you need every last bit of performance, that is the plan:

Check with * in each position independently, i.e. instead of single job processing n*k string variants - start k independent jobs each checking n strings. You can spread these k jobs among multiple CPU/GPU cores. This is especially important if you are going to check 2+ char diffs. Smaller job size will also improve cache locality, which by itself can make program 10x faster.
If you are going to use hash tables, use your own implementation employing linear probing and ~50% load factor. It's fast and pretty easy to implement. Or use an existing implementation with open addressing. STL hash tables are slow due to use of separate chaining.
You may try to prefilter data using 3-state Bloom filter (distinguishing 0/1/1+ occurrences) as proposed by @AlexReynolds.
For each i from 0 to k-1 run the following job:
- Generate 8-byte structs containing 4-5 byte hash of each string (with * at i-th position) and string index, and then either sort them or build hash table from these records.

For sorting, you may try the following combo:

first pass is MSD radix sort in 64-256 ways employing TLB trick
second pass is MSD radix sort in 256-1024 ways w/o TLB trick (64K ways total)
third pass is insertion sort to fix remaining inconsistencies

— Bulat
источник