Coding Problems

DNA Sequencing Compression

In DNA analysis systems, repeated markers often appear next to each other. Instead of storing every marker individually, we can summarize consecutive identical markers using a compact format.

V
Vikalp Veer

🧬 DNA Sequencing Compression

A DNA sequencing machine represents genetic markers as digits.

Given a DNA marker sequence, the compression system summarizes it by reading consecutive identical markers from left to right.

For each group:

count + marker

For example:

111221

is compressed as:

  • three 1s -> 31
  • two 2s -> 22
  • one 1 -> 11

So the compressed sequence becomes:

312211

Now suppose the first DNA sequence is:

1

After each compression cycle, the output becomes the input for the next cycle.

Given an integer n, return the DNA sequence after n compression cycles.


Example Input: n = 5

Output: 111221

Explanation:

Cycle 1: 1 Cycle 2: 11 Cycle 3: 21 Cycle 4: 1211 Cycle 5: 111221


🔧 Solution

The main idea is to scan the current DNA sequence and group consecutive identical markers.

For example:

111221

We scan it as:

111 | 22 | 1

Then we describe each group:

  • 111 -> 31
  • 22 -> 22
  • 1 -> 11

So the next sequence is:

312211

This is just run-length encoding applied repeatedly.

</> Python Implementation

class Solution: def countAndSay(self, n: int) -> str: sequence = "1" for _ in range(n - 1): sequence = self.compress(sequence) return sequence def compress(self, sequence: str) -> str: # Implementation similar to encoder of RLE algorithm. result = [] i = 0 while i < len(sequence): j = i while j < len(sequence) and sequence[j] == sequence[i]: j += 1 count = j - i marker = sequence[i] result.append(str(count)) result.append(marker) i = j return "".join(result)

Tags

Arraystwo-pointerMediumalgorithm