Loops

stringr


A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header. Name the resulting data frame sequences.

Your colleague wants to calculate the GC content of each DNA sequence (i.e., the percentage of bases that are either G or C) and knows just a little R. They sent you the following code which will calculate the GC content for a single sequence:

library(stringr)

sequence <- "attggc"
Gs <- str_count(sequence, "g")
Cs <- str_count(sequence, "c")
gc_content <- (Gs + Cs) / str_length(sequence) * 100 

This code uses the excellent stringr package for working with the sequence data. You’ll need to install this package before using it.

  1. Convert the last three lines of this code into a function to calculate the GC content of a DNA sequence. Name that function get_gc_content.

  2. Use a for loop and your function to calculate the GC content of each sequence and store the results in a new vector. The function should work on a single sequence at a time and the for loop should repeatedly call the function and store the output.

  3. Use a for loop and your function to calculate the GC content of each sequence and store the results in a new data frame. To do this you’ll need to use an index to loop over the rows of the data frame.

    Fill in the following for loop to complete this exercise:

# pre-allocate the memory with one row for each sequence
gc_contents <- data.frame(gc_content = numeric(nrow(_______)))

# loop over sequences using an index for the row and
# store the output in gc_contents
for (i in 1:nrow(__________)){
  ________[i,] <- get_gc_content(sequences[____])
}
[click here for output]