Count Rows Where Value Is 1 or 2 in R
Introduction
This is an R notebook document. IT allows you to blend R code, notes and outputs with nice formatting and is a nice manner to record and present R analyses. You can make up one of these documents in newer versions of R studio by going to File > New File > R Notebook.
I'll use this notebook to provide worked answers to all of the exercises we did in the course.
Insertion to R
Exercise 1
-
Habituate R to calculate
-
31 * 78
-
697/41
These are just simple mathematical expressions entered on the console.
31 * 78
## [1] 2418
697 / 41
## [1] 17
- Ascribe the value of 39 to x
- Assign the value of 22 to y
- Make z the note value of x-y
- Display z in the console
Here we purpose the pointer symbols to earn the assignments. They seat point left or suited but they must point from the information towards the variable name in which you want to store them.
You bum retrieve the data stored in a varied past just entering the varible name.
39 -> x y <- 22 x-y -> z z
## [1] 17
- Compute the square settle of 2345 and do a log2 transformation on the result
Both of these operations require the exercise of functions. The ideal solution is to nest the two occasion calls together thus that the call to sqrt is inside the arguments to log2.
log2(sqrt(2345))
## [1] 5.597686
Exercise 2
- Create a vector titled vec1 containing the number 2 5 8 12 16
There is no mathematical relationship 'tween these numbers game so we need to manually create the vector victimisation the 'c' function.
c(2,5,8,12,16) -> vec1
- Use x:y notation to make a endorsement transmitter named vec2 containing the numbers 5 to 9
Because this is a series we can take a shortcut to make it using the language shortcut which makes integer series between two values.
5:9 -> vec2
- Subtract vec2 from vec1 and deal the result
vec1 - vec2
## [1] -3 -1 1 4 7
Because we have the same list of values in both vec1 and vec2 the combining weight positions will be paired up and we will understand the subtraction results from the same positions, ie 2-5, 5-6, 8-7 etc.
- Use seq() to make a vector of 100 values start at 2 and maximizing by 3 each time.
We need to ply the from, away and length.out parameters to seq to create this serial. A we'Ra using this in the close exercise we will save it into a new variable.
seq(from=2,by=3,length.out=100) -> identification number.series number.series
## [1] 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 ## [18] 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101 ## [35] 104 107 110 113 116 119 122 125 128 131 134 137 140 143 146 149 152 ## [52] 155 158 161 164 167 170 173 176 179 182 185 188 191 194 197 200 203 ## [69] 206 209 212 215 218 221 224 227 230 233 236 239 242 245 248 251 254 ## [86] 257 260 263 266 269 272 275 278 281 284 287 290 293 296 299
- Extract the values at positions 5,10,15 and 20 in the vector of values you just created.
- Excerpt the values at positions 10 to 30
Both of these require making a selection in the vector using the [ ] note. Inside the square brackets you put a vector of indicator positions, so the problem here is to create the vector of index positions.
The first 1 is limited enough that you could make it manually with c() Oregon A it's a series you could use seq() to make it. It doesn't matter which.
The second one is an integer serial publication so we can use the 10:30 notation to establish this quick.
number.series[c(5,10,15,20)]
## [1] 14 29 44 59
number.series[seq(from=5,to=20,by=5)]
## [1] 14 29 44 59
number.series[10:30]
## [1] 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89
Exercise 3
- Enter a list of colours (supplied in the inquiry) into a vector called mouse.colour
We will exercise the c() function to manually make this vector. Because the data we are entering is schoolbook we need to surround each name with quotes so that R doesn't try to treat it equally a varied name.
c("chromatic","red","yellow","brown") -> mouse.colourise
- Display the back element in the vector.
This is a simple selection with a single index put over.
mouse.colourise[2]
## [1] "blood-red"
Enter some numerical weight data (supplied in the question) into a transmitter named mouse.weight
c(23,21,18,26) -> mouse.weighting
- Join the deuce vectors into a data frame called mouse.info containing 2 columns and 4 rows. Call the first column colour and the second united burden.
We will utilize the data.frame() function to exercise this. We don't need to delimitate the number of rows / columns as these are defined past the data we use to create the data frame. The number of columns is the number of vectors we provide and the number of rows is the keep down of information points in each of those vectors.
If we pass the vectors to the data.frame() function As key out=transmitter pairs then we can set the names for the columns when we create the data frame.
Remember that data.form() will not save the result - you still need to manipulation an arrow at the end to give it a cite. You will see that the notebook formats the data frame we get as a courteous looking table.
information.cast(colour=mouse.colorize, weight=black eye.weight) -> black eye.info shiner.info
## tinge weight ## 1 purple 23 ## 2 red 21 ## 3 old 18 ## 4 brown 26
- Display clean row 3
- Reveal retributory chromatography column 1
- Display words 4 column 1
All of these utilise 2 dimensional selections on the data shape. You again use the [ ] note but this time you need to pass two vectors, the first for the rows and the second for the columns.
When we are onanism a single column we can also use the [[ ]] notation, or ideally use dataframe$pillar notation as we have a named tower.
mouse.info[3,]
## people of colour weight ## 3 yellow 18
mouse.info[,1]
## [1] purple red dishonourable brown ## Levels: brown rhetorical bolshy yellow
mouse.info[[1]]
## [1] purple red yellow brown ## Levels: John Brown purple ruby unhealthy
sneak.info$semblance
## [1] purple red chicken brown ## Levels: John Brown purpurate red yellow
mouse.info[4,1]
## [1] brown ## Levels: brown empurple red yellow
The rationality you see to it the text saying "Levels:…" after each line is because R converts character vectors to element vectors when you produce a data frame. A factor vector stores wholly of the different values it can salt away as a separate place, and this is what is shown in the levels value.
Exercise 4
- put away your functioning directory to where your data is stored
We could do this through the R studio port using Session -> Set Working Directory -> Select directory but in this script I will have to call setwd() direct. Because of the way notebooks work I'll have to Re-do this in every section which loads data.
setwd("O:/Training/Introduction to R/R_intro_data_files")
- Read the file "small_file.txt" into a new structure
This is a lozenge circumscribed file and then we toilet use read.delim for this.
setwd("O:/Education/Introduction to R/R_intro_data_files") scan.delim("small_file.txt") -> small.file small.charge
## Sample Distance Class ## 1 x_1 45 A ## 2 x_2 82 B ## 3 x_3 81 C ## 4 x_4 56 D ## 5 x_5 96 A ## 6 x_6 85 B ## 7 x_7 65 C ## 8 x_8 96 D ## 9 x_9 60 A ## 10 x_10 62 B ## 11 x_11 80 C ## 12 x_12 63 D ## 13 x_13 50 A ## 14 y_1 64 B ## 15 y_2 43 C ## 16 y_3 98 D ## 17 y_4 78 A ## 18 y_5 53 B ## 19 y_6 100 C ## 20 y_7 79 D ## 21 y_8 84 A ## 22 y_9 68 B ## 23 y_10 99 C ## 24 y_11 65 D ## 25 y_12 55 A ## 26 y_13 98 B ## 27 z_1 56 C ## 28 z_2 83 D ## 29 z_3 81 A ## 30 z_4 69 B ## 31 z_5 50 C ## 32 z_6 72 D ## 33 z_7 54 A ## 34 z_8 56 B ## 35 z_9 87 C ## 36 z_10 84 D ## 37 z_11 80 A ## 38 z_12 68 B ## 39 z_13 95 C ## 40 z_14 93 D
- Read the single file "Child_Variants.csv" into a new data structure.
setwd("O:/Grooming/Introduction to R/R_intro_data_files") read.csv("Child_Variants.csv") -> child.variants head(child.variants)
## CHR POS dbSNP REF ALT QUAL FILTER GENE HGVS ## 1 1 69270 . A G 16 PASS OR4F5 c.180A>G(p.=) ## 2 1 69511 rs75062661 A G 200 PASS OR4F5 c.421A>G_p.Thr141Ala ## 3 1 69761 . A T 200 PASS OR4F5 c.671A>T_p.Asp224Val ## 4 1 69897 rs75758884 T C 59 PASS OR4F5 c.807T>C(p.=) ## 5 1 877831 rs6672356 T C 200 PASS SAMD11 c.1027T>C_p.Trp343Arg ## 6 1 881627 rs2272757 G A 200 PASS NOC2L c.1843C>T(p.=) ## EXON ENST MutantReads COVERAGE MutantReadPercent CLASS PTV ## 1 1/1 ENST00000335137 3 4 75 Scandium 0 ## 2 1/1 ENST00000335137 24 27 88 NSC 0 ## 3 1/1 ENST00000335137 8 8 100 National Security Council 0 ## 4 1/1 ENST00000335137 3 3 100 South Carolina 0 ## 5 10/14 ENST00000342066 10 11 90 NSC 0 ## 6 16/19 ENST00000327044 52 56 92 SC 0 ## Sampling FAMILY ## 1 D88849 COG0215 ## 2 D88849 COG0215 ## 3 D88849 COG0215 ## 4 D88849 COG0215 ## 5 D88849 COG0215 ## 6 D88849 COG0215
- Display row 11
This is a simple selection. Becuase it's a quarrel we're selecting there is no crosscut so we require to specify both rows and columns (even though the column selection is blank)
child.variants[11,]
## CHR POS dbSNP REF AL QUAL FILTER GENE HGVS Coding DNA ## 11 1 889159 rs13302945 A C 200 PASS NOC2L c.888+3T>G . ## ENST MutantReads COVERAGE MutantReadPercent CLASS PTV SAMPLE ## 11 ENST00000327044 26 29 89 SS 0 D88849 ## FAMILY ## 11 COG0215
- Calculate the mean of the column named MutantReadPercent.
Entertain this in two steps. Basic we require to get ahead at the data in that pillar. We give notice exercise that using $ as we know the make of the column. And then we can surpass that into the mean function.
mean(nestling.variants$MutantReadPercent)
## [1] 59.88219
Exercise 5
- Find out how many rows in small_file have a length which is < 65.
For all filtering intend of the problem in a structured way.
- Elicit the vector of values you want to dribble against (in this case small_file$Distance)
- Apply the logical try out (in this case < 65)
- Either use the logical vector produced to filter the information using a selction, or use sum() to get the depend of the hits.
small.single file$Length
## [1] 45 82 81 56 96 85 65 96 60 62 80 63 50 64 43 98 78 ## [18] 53 100 79 84 68 99 65 55 98 56 83 81 69 50 72 54 56 ## [35] 87 84 80 68 95 93
small.file$Length < 65
## [1] Literal FALSE Simulated TRUE FALSE FALSE FALSE FALSE TRUE TRUE Sham ## [12] TRUE Apodeictic TRUE TRUE FALSE Delusive TRUE FALSE Traitorously FALSE FALSE ## [23] FALSE FALSE TRUE FALSE TRUE FALSE Insincere FALSE TRUE Put on TRUE ## [34] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
sum(small.file$Length < 65)
## [1] 14
- Create a filtered variant of the child.variants data which includes only rows where MutantReadPercent >= 70.
This is the same 3 footprint problem.
- Extract the MutantReadPercent pillar using the $ notation
- Do the synthetic test on this vector (>=70)
- Use this dianoetic transmitter to define which rows to keep in a standard 2 dimension selection happening the innovational information frame.
To hold the outturn from the first two commands a sensible length I'm leaving to put them inside a call to the head() subprogram. I'm increasing the number of values shown from the default option (which is only 6) soh you begin a better feeling for what the data looks care.
maneuver(shaver.variants$MutantReadPercent, n=100)
## [1] 75 88 100 100 90 92 97 95 80 89 89 81 33 90 100 90 100 ## [18] 100 100 89 82 100 90 100 100 92 93 90 85 93 94 57 31 90 ## [35] 34 51 39 44 42 56 90 26 28 93 93 95 89 77 100 54 37 ## [52] 50 38 30 42 76 47 42 27 50 45 22 35 36 38 45 34 47 ## [69] 54 59 28 91 22 66 57 100 35 34 46 100 26 60 100 44 33 ## [86] 82 25 21 83 100 85 36 84 75 52 94 88 78 83 86
headspring(tiddler.variants$MutantReadPercent >= 70, n=100)
## [1] TRUE Honorable TRUE TRUE Rightful TRUE TRUE Geographical TRUE TRUE TRUE ## [12] TRUE Off-key Lawful TRUE TRUE Honest TRUE Even TRUE TRUE Trustworthy ## [23] Harmonious TRUE Echt TRUE TRUE Honest TRUE Reliable TRUE FALSE FALSE ## [34] TRUE FALSE Sour FALSE FALSE FALSE FALSE TRUE FALSE FALSE Confessedly ## [45] TRUE TRUE Geographic TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [56] Dependable FALSE Specious FALSE FALSE FALSE Mistaken FALSE FALSE FALSE FALSE ## [67] FALSE Insincere Pretended FALSE FALSE TRUE Trumped-up Fictitious FALSE Genuine Mistaken ## [78] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE ## [89] TRUE TRUE TRUE FALSE TRUE Sure FALSE Truthful TRUE Faithful TRUE ## [100] Trustworthy
kid.variants[child.variants$MutantReadPercent >= 70,] -> child.variants.filtered nrow(child.variants)
## [1] 25822
nrow(child.variants.filtered)
## [1] 9976
- From the filtered data frame we require to experience the relative frequency with which G, A, T and C bases were mutated.
This filtration is going to use the REF column in the baby.variants.filtered dataset. We'Re going away to need to do an equality (==) test to compare this to G, A, T and C. In that instance we only care nearly how many of the values are true so we can use of goods and services a essence() on the logical vector to get that value.
We can start with G.
head(tyke.variants.filtered$REF, n=100)
## [1] A A A T T G A T T G A G G A C G T T A A C T A T G A G T T C A A C C G ## [36] T T G T C A C T C T A T G T T A C T G T C T T C A C T T T G A A A C C ## [71] C C G A G C A A T G T C A C T T T G C G G C G C C G A T A G ## 268 Levels: A AAAAAC AAAAT AAAATAAATAAATAAATAAATAAAT AAAG ... TTTTTTTGTC
head(tike.variants.filtered$REF == "G", n=100)
## [1] FALSE FALSE Imitative FALSE FALSE Genuine Traitorously Wrong Incorrect Honest FALSE ## [12] Accurate TRUE Artificial FALSE TRUE FALSE FALSE Unrealistic FALSE FALSE FALSE ## [23] FALSE FALSE TRUE FALSE TRUE Delusive FALSE FALSE FALSE FALSE FALSE ## [34] FALSE TRUE FALSE FALSE TRUE FALSE Untrue FALSE Spurious FALSE FALSE ## [45] FALSE Simulated FALSE TRUE FALSE FALSE Mistaken FALSE FALSE TRUE FALSE ## [56] FALSE FALSE FALSE FALSE FALSE Traitorously Hollow FALSE FALSE TRUE FALSE ## [67] FALSE Pretended FALSE Treasonably FALSE FALSE TRUE FALSE TRUE FALSE Spurious ## [78] FALSE FALSE TRUE FALSE FALSE FALSE FALSE Unharmonious FALSE FALSE TRUE ## [89] Treasonably TRUE TRUE FALSE Truthful Inconstant FALSE TRUE FALSE FALSE FALSE ## [100] TRUE
sum(child.variants.filtered$REF == "G")
## [1] 2347
We can today duplicate this with A, T and C. We sack besides hive away the values in a vector by putt the code inside the c() social occasion. We'll also name the slots in the vector with the letter they present.
c( sum(child.variants.filtered$REF == "G"), sum(child.variants.filtered$REF == "A"), core(child.variants.filtered$REF == "T"), sum(minor.variants.filtered$REF == "C") ) -> mutation.counts names(mutation.counts) <- c('G','A','T','C') mutation.counts
## G A T C ## 2347 2584 2616 2288
In the advanced course we will show you how you could have achieved this use in a single mathematical operation rather than victimisation multiple filters by using the tapply routine.
Practice session 6
- In the original child variants dataset draw a histogram of the MutantReadPercent column, try increasing the keep down of categories (breaks) to 50.
For this we pass the vector of MutantReadPercent values to the hist() function, which draws histograms.
hist(child.variants$MutantReadPercent,breaks=50)
- Plot of ground a boxplot of the MutantReadPercent values from some the original child variants and the same newspaper column from the filtered dataset you made in Drill 5 (MutantReadPercent>=70). Check that the distributions await contrastive.
The boxplot function put up take multiple vectors as input sol we derriere just pass the two vectors separtately. If we do this all the same the datasets will not be named (since the serve doesn't know what they are called). To fix this we would have to either explicitly set the names victimisation the 'names=' parameter to the function. Alternatively we could put the two vectors into a list which would allow us to set the slot names for the list, and these will be picked up by boxplot mechanically. Here we'll do it both shipway so you can regard the options for how to brawl this.
boxplot(child.variants$MutantReadPercent, shaver.variants.filtered$MutantReadPercent, name calling=c("Original","Filtered"))
boxplot( list( Freehanded=child.variants$MutantReadPercent, Filtered=minor.variants.filtered$MutantReadPercent ) )
- Plot the results of the vector created in Exercise 5 ('mutation.counts') as a barplot. Use the names.arg function to show which mutation is which and use the rainbow() function to give the bars different colours
In our case, because we put one-armed bandit names onto our mutation.counts vector we don't need to habit names.arg as barplot will use the slot name calling by default. We'll change the names late to show how you would do this in any event.
The rainbow() function of necessity to have sex how many colours to generate. We could antimonopoly hard-code this since we know at that place are 4 values, but it's generally nicer to calculate the value we need from the information. That way if we later want to change the number of mutatations we counted the code will silent work.
barplot(mutation.counts, col=rainbow(length(mutation.counts)))
If we did wishing to explicitly set the names then we could do that.
barplot( mutation.counts, gap=rainbow(length(chromosomal mutation.counts)), names.arg=c("Guanine","Adenine","Thymine","Cytosine") )
Exercise 7
- Read in the file'neutrophils.csv'. This is a comma-finite file so use understand.csv().
setwd("O:/Training/Introduction to R/R_intro_data_files") read.csv("neutrophils.csv") -> neutrophils nou(neutrophils)
## DMSO TGX.221 PI103 Akt1 ## 1 144.43930 99.61073 41.95241 111.8013 ## 2 135.71670 115.35760 57.46430 124.1805 ## 3 57.88828 106.44840 41.01954 126.7738 ## 4 66.71269 115.89830 63.12587 130.9577 ## 5 73.36981 75.96729 NA 88.6273 ## 6 83.43180 NA NA 147.8813
- Make over a boxplot of the 4 samples and put a suitable title on the plot.
Since our information set up contains only the data we want to plat, and boxplot can take a list (which a data skeletal frame is) as input, we Calif. just slip away the whole thing to the boxplot function.
boxplot(neutrophils, main="Range of values for different samples")
- Use the colMeans() routine to calculate the mean of each dataset. Plot the means American Samoa a barplot
If we try this along the data atomic number 3 IT stands then we North Korean won't get the resolve we wish, since there are missing (Sodium) values in our dataset.
colMeans(neutrophils)
## DMSO TGX.221 PI103 Akt1 ## 100.8013 NA NA NA
To buzz off colMeans to ignore the NA values when calculating the mean we can pass the na.rm=TRUE choice.
colMeans(neutrophils, na.rm = Sincere) -> mean.values intend.values
## DMSO TGX.221 PI103 Akt1 ## 100.80126 102.65646 50.89053 103.91649
Finally we can plot these.
barplot(mean.values)
- Read in the 'brain_bodyweight.txt' file. The number 1 column contains species names, non data, soh use rowing.name calling=1 to set these up correctly for your data frame.
This is a tab-delimited school tex file, so we're using study.delim to learn this in. In accession to the row.names parameter I'm also going to set break.names=FALSE indeed R doesn't modify the column names (to move out spaces).
setwd("O:/Training/Introduction to R/R_intro_data_files") read.delim("brain_bodyweight.txt", row.names=1, check.name calling = FALSE) -> brain.bodyweight heading(brain.bodyweight)
## Physical structure angle (kg) Brain weight (g) ## Cow 465.00 423.0 ## Grey Wildcat 36.33 119.5 ## Goat 27.66 115.0 ## Guinea Raven 1.04 5.5 ## Diplodocus 11700.00 50.0 ## Asian Elephant 2547.00 4603.0
You can see that the species name calling have been correctly set every bit row name calling and that the column names are in tact.
- Log transform the information (base2)
Since all of the data entrap is information we can buoy simply pass the whole construction to the log2 function. Since we're non expiration to use the untransformed information again we'll overwrite the original data frame.
log2(brain.bodyweight) -> brain.bodyweight head(brain.bodyweight)
## Body exercising weight (kg) Brain weight unit (g) ## Overawe 8.86108691 8.724514 ## Grey-haired Wolf 5.18308946 6.900867 ## Goat 4.78972925 6.845490 ## Case 0.05658353 2.459432 ## Diplodocus 13.51422091 5.643856 ## Asian Elephant 11.31458324 12.168359
- Make a scatterplot with default parameters with the log transformed data.
We make over a scatterplot with the plot() affair. We need to bye the cardinal vectors as the x and y values. Since they are in the correct order in our data frame we nates, in this case, sensible pass the whole dataframe. This also has the advantage of setting the correct labels for the axes as they will be taken from the column names.
plot(brain.bodyweight)
- Create the like scatterplot only experiment with some parameters.
game( brain.bodyweight, pch=19, gap="scarlet", cex=1.5, main="Brainweight vs Bodyweight", xlab="Bodyweight (log2 kg)", ylab="Brainweight (log2 kilo)" )
- Translate in the file chr_data.txt
setwd("O:/Training/Introduction to R/R_intro_data_files") read.delim("chr_data.txt") -> chr.data head(chr.data)
## chr position GM06990_ABL1 GM06990_MLLT3 ## 1 8 1 -0.04661052 -0.1816292 ## 2 8 100001 0.25605089 0.4625507 ## 3 8 200001 0.25605089 -0.1816292 ## 4 8 300001 0.86137348 -0.5037192 ## 5 8 400001 0.86137348 -0.5037192 ## 6 8 500001 0.86137348 -0.5037192
- Remove the first column
This will involve a selection for the columns we want to keep, then overwriting the original data with the selected subset. We could plainly manage a survival of the fittest for columns 2,3 and 4.
head(chr.data[,2:4])
## position GM06990_ABL1 GM06990_MLLT3 ## 1 1 -0.04661052 -0.1816292 ## 2 100001 0.25605089 0.4625507 ## 3 200001 0.25605089 -0.1816292 ## 4 300001 0.86137348 -0.5037192 ## 5 400001 0.86137348 -0.5037192 ## 6 500001 0.86137348 -0.5037192
We also likewise use a little trick, which is that if you usance negative values for the index positions, you lav say which positions you don't want, instead of which ones you set want.
chr.information[,-1] -> chr.data point(chr.information)
## view GM06990_ABL1 GM06990_MLLT3 ## 1 1 -0.04661052 -0.1816292 ## 2 100001 0.25605089 0.4625507 ## 3 200001 0.25605089 -0.1816292 ## 4 300001 0.86137348 -0.5037192 ## 5 400001 0.86137348 -0.5037192 ## 6 500001 0.86137348 -0.5037192
- Draw a line graph of the position against the ABL1 dataset
For this we can use the plat() function with type="l" (thats a the letter l rather than the number 1).
plot of ground( chr.data$position, chr.data$GM06990_ABL1, type="l", col="blue", xlab="Chromosome Emplacement (bp)", ylab="z-score", main="Z scores vs Genomic Position", lwd=2 ) caption( "topright", "ABL1", fill="blue" )
Count Rows Where Value Is 1 or 2 in R
Source: https://www.bioinformatics.babraham.ac.uk/training/Introduction_to_R/r_intro_answers.html
0 Response to "Count Rows Where Value Is 1 or 2 in R"
Post a Comment