1. Explain about data import in R language
To import data in R language, R command is used. To start the R commander GUI, the user must type in the command Rcmdr into the console. There are 3 different ways in which data can be imported in R languageUsers can select the data set in the dialog box or enter the name of the data set (if they know).
Data can also be entered directly using the editor of R Commander via Data->New Data Set. However, this works well when the data set is not too large.
Data can also be imported from a URL or from a plain text file (ASCII), from any other statistical package or from the clipboard.
2. Two vectors X and Y are defined as follows – X <- c(3, 2, 4) and Y <- c(1, 2). What will be output of vector Z that is defined as Z <- X*Y.
When the vectors have different lengths, the multiplication starts with the smaller vector and it is uninterrupted till all the elements in the larger vector have been multiplied.The output of the above code will be Z (3, 4, 4)
3. How missing values and impossible values are represented in R language?
NaN (Not a Number) is used to replace impossible values whereas NA (Not Available) is used to replace missing values. The best way to answer this question is to mention that deleting missing values is not a good idea because the probable cause for missing value could be some problem with data collection or programming or the query. It is good to find the root cause of the missing values and then take necessary steps handle them.4. R language has several packages for solving a particular problem. How do you make a decision on which one is the best to use
CRAN package ecosystem has more than 6000 packages. The best way for beginners to answer this question is to mention that they would look for a package that follows good software development principles. The next thing would be to look for user reviews and find out if other data scientists or analysts have been able to solve a similar problem.5. What is the best way to communicate the results of data analysis using R language?
By merge the data, code and analysis outcome in a single document using knitr for reproducible research is the best way to communicate the results of data analysis using R language. This helps others to verify the findings, add to them and occupy in discussions. Reproducible research makes it easy to redo the experiments by inserting new data and applying it to a different problem.6. How many data structures does R language have?
R language has two kinds of data structure that is :Homogeneous data structures have same type of objects – Vector, Matrix ad Array. Heterogeneous data structures have different type of objects – Data frames and lists.
7. Explain about the significance of transpose in R language
Transpose t () is the easiest method for reshaping the data before analysis.8. What are with () and BY () functions used for?
With () function is used to request an expression for a given dataset and BY () function is used for seeking a function each level of factors.9. What are the different type of sorting algorithms available in R language?
10. What is the best way to use Hadoop and R together for analysis?
HDFS can be used for storing the data for long-term. Map Reduce jobs submitted from either Oozie, Pig or Hive can be used to encode, improve and sample the data sets from HDFS into R. This helps to support complex analysis tasks on the subset of data prepared in R.11. What will be the output of log (-5.8) when executed on R console?
While executing the log (-5.8) on R console will show a warning sign that NaN (Not a Number) will be processed because it is not possible to take the log of negative number.12. What is the difference between data frame and a matrix in R?
Data frame can contain heterogeneous inputs mean while a matrix cannot. In matrix only homogeneous data types can be stored whereas in a data frame there can be various data types like characters, integers or other data frames.13. What are factor variable in R language?
Factor variables are definite variables that hold either string or numeric values. Factor variables are used in various types of graphics and particularly for statistical modelling where the correct number of degrees of freedom is assigned to them.14. What is meant by K-nearest neighbour?
K-Nearest Neighbour is one of the simplest machine learning classification algorithms that is a subset of supervised learning based on lazy learning. In this algorithm the function is approximated locally and any computations are deferred until classification.15. If you want to know all the values in c (1, 3, 5, 7, 10) that are not in c (1, 5, 10, 12, 14). Which in-built function in R can be used to do this? Also, how this can be achieved without using the in-built function.
16. Differentiate between lapply and sapply.
If the programmers want the return to be a data frame or a vector, then sapply function is used. Whereas if a programmer wants the output to be a list then lapply is used. There one more function known as vapply which is chosen over sapply as vapply allows the programmer to specific the output type. The drawback of using vapply is that it is difficult to be implemented and more verbose.17. How will you read a .csv file in R language?
read.csv () function is used to read a .csv file in R language. Below is a simple example – filcontent <-read.csv (sample.csv) print (filecontent)
18. What do you understand by element recycling in R?
If two vectors with different lengths perform an operationThe elements of the shorter vector will be re-used to complete the operation. This is referred to as element recycling.
Example – Vector A <-c(1,2,0,4) and Vector B<-(3,6) then the result of A*B will be ( 3,12,0,24). Here 3 and 6 of vector B are repeated when computing the result.
19. What is the use of sample and subset functions in R programming language?
20. How will you create scatterplot matrices in R language?
A matrix of scatterplots can be produced using pairs. Pairs function takes various parameters like formula, data, subset, labels, etc.The two key parameters required to build a scatterplot matrix are –
Formula- A formula basically like ~a+b+c. Each term gives a separate variable in the pair’s plots where the terms should be numerical vectors. It basically represents the series of variables used in pairs.
Data- It primarily represents the dataset from which the variables have to be taken for building a scatterplot.