Что такое subset в r
subset : Subsetting Vectors, Matrices and Data Frames
Description
Return subsets of vectors, matrices or data frames which meet conditions.
Usage
Arguments
object to be subsetted.
logical expression indicating elements or rows to keep: missing values are taken as false.
expression, indicating columns to select from a data frame.
passed on to [ indexing operator.
further arguments to be passed to or from other methods.
Value
An object similar to x contain just the selected elements (for a vector), rows and columns (for a matrix or data frame), and so on.
Warning
Details
This is a generic function, with methods supplied for matrices, data frames and vectors (including lists). Packages and users can add further methods.
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
The select argument exists only for the methods for data frames and matrices. It works by first replacing column names in the selection expression with the corresponding column numbers in the data frame and then using the resulting integer vector to index the columns. This allows the use of the standard indexing conventions so that for example ranges of columns can be specified easily, or single columns can be dropped (see the examples).
The drop argument is passed on to the indexing method for matrices and data frames: note that the default for matrices is different from that for indexing.
Factors may have empty levels after subsetting; unused levels are not automatically removed. See droplevels for a way to drop all unused levels from a data frame.
Subset : Subset the Values of One or More Variables
Description
Based directly on the standard R subset function to only include or exclude specified rows or data, and for specified columns of data. Output provides feedback and guidance regarding the specified subset operations. Rows of data may be randomly extracted, and also with the code provided to generate a hold out validation sample created. The hold out sample is created from the original data frame, usually named mydata, so the subset data frame must be directed to a data frame with a new name or the data re-read to construct the holdout sample. Any existing variable labels are retained in the subset data frame.
Usage
Arguments
Specify the rows, i.e., observations, to be included or deleted, such as with a logical expression or by direct specification of the numbers of the corresponding rows of data.
Specify the columns, i.e., variables, to be included or deleted.
The name of the data frame from which to create the subset, which is mydata by default.
Create a hold out sample for validation if rows is a proportion or an integer to indicate random extraction of rows of data.
If an integer or proportion, specifies number of rows to data to randomly extract.
Value
The subset of the data frame is returned, usually assigned the name of mydata as in the examples below. This is the default name for the data frame input into the lessR data analysis functions.
Details
Subset creates a subset data frame based on one or more rows of data and one or more variables in the input data frame, and lists the first five rows of the revised data frame. Guidance and feedback regarding the subsets are provided by default. The first five lines of the input data frame are listed before the subset operation, followed by the first five lines of the output data frame.
To indicate retaining an observation, specify at least one variable name and the value of the variable for which to retain the corresponding observations, using two equal signs to indicate the logical equality. If no rows are specified, all rows are retained. Use the row.names function to identify rows by their row names, as illustrated in the examples below.
Subsetting Data
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.
Selecting (Keeping) Variables
# select variables v1, v2, v3
myvars
To practice this interactively, try the selection of data frame elements exercises in the Data frames chapter of this introduction to R course.
Excluding (DROPPING) Variables
# exclude variables v1, v2, v3
myvars
Selecting Observations
# first 5 observations
newdata 65), ]
# or
attach(mydata)
newdata 65),]
detach(mydata)
Selection using the Subset Function
The subset( ) function is the easiest way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10. We keep the ID and Weight columns.
# using subset function
newdata = 20 | age
In the next example, we select all men over the age of 25 and we keep variables weight through income (weight, income and all columns between them).
# using subset function (part 2)
newdata 25,
select=weight:income)
To practice the subset() function, try this this interactive exercise. on subsetting data.tables.
Random Samples
Use the sample( ) function to take a random sample of size n from a dataset.
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample
Что такое subset в r
Return subsets of vectors, matrices or data frames which meet conditions.
Usage
Arguments
object to be subsetted.
logical expression indicating elements or rows to keep: missing values are taken as false.
expression, indicating columns to select from a data frame.
passed on to [ indexing operator.
further arguments to be passed to or from other methods.
Details
This is a generic function, with methods supplied for matrices, data frames and vectors (including lists). Packages and users can add further methods.
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
The select argument exists only for the methods for data frames and matrices. It works by first replacing column names in the selection expression with the corresponding column numbers in the data frame and then using the resulting integer vector to index the columns. This allows the use of the standard indexing conventions so that for example ranges of columns can be specified easily, or single columns can be dropped (see the examples).
The drop argument is passed on to the indexing method for matrices and data frames: note that the default for matrices is different from that for indexing.
Factors may have empty levels after subsetting; unused levels are not automatically removed. See droplevels for a way to drop all unused levels from a data frame.
Value
An object similar to x contain just the selected elements (for a vector), rows and columns (for a matrix or data frame), and so on.
Warning
Subset in R
Subsetting data consists on obtaining a subsample of the original data, in order to obtain specific elements based on some condition. In this tutorial you will learn in detail how to make a subset in R in the most common scenarios, explained with several examples.
How to subset data in R?
Subsetting data in R can be achieved by different ways, depending on the data you are working with. In general, you can subset:
Single and double square brackets in R
Before the explanations for each case, it is worth to mention the difference between using single and double square brackets when subsetting data in R, in order to avoid explaining the same on each case of use. Suppose you have the following named numeric vector:
As we will explain in more detail in its corresponding section, you could access the first element of the vector using single or with double square brackets and specifying the index of the element.
The difference is that single square brackets will maintain the original input structure but the double will simplify it as much as possible. This can be verified with the following example:
Other interesting characteristic is when you try to access observations out of the bounds of the vector. In this case, if you use single square brackets you will obtain a NA value but an error with double brackets.
However, sometimes it is not possible to use double brackets, like working with data frames and matrices in several cases, as it will be pointed out on its corresponding sections.
Note that when subsetting gives no observations means that you are trying to subset under some condition that never meets.
Subset function in R
The subset function allows conditional subsetting in R for vector-like objects, matrices and data frames.
In the following sections we will use both this function and the operators to the most of the examples. Note that this function allows you to subset by one or multiple conditions.
Subset vector in R
Subsetting a variable in R stored in a vector can be achieved in several ways:
The following summarizes the ways to subset vectors in R with several examples.
my_vector[] is useful when you want to assign the same value to all the elements of a already created vector. As an example, my_vector[] will replace all the values of the vector with 1, but my_vector will override the vector as a number.
In addition, if your vector is named, you can use the previous and the following ways to subset the data, specifying the elements name as character.
Note that vectors can be of any data type.
Subsetting a list in R
Consider the following sample list:
You can subset the list elements with single or double brackets to subset the elements and the subelements of the list.
In case you have a list with names, you can access them specifying the element name or accessing them with the dollar sign.
In addition, it is also possible to make a logical subsetting in R for lists. For example, you could replace the first element of the list with a subset of it in the following way:
Subset R data frame
Subsetting a data frame consists on obtaining some rows or columns of the full data frame, or some that meet one or several conditions. It is very usual to subset a data frame in R for analysis purposes. Consider, for instance, the following sample data frame:
Columns subset in R
You can subset a column in R in different ways:
The following block of code shows some examples:
When subsetting more than one column or when specifying rows and columns (using a comma inside brackets) you will need to set drop = FALSE to maintain the original structure of the object, instead of using double square brackets.
Subset dataframe by column name
Subset dataframe by column value
You can also subset a data frame depending on the values of the columns. As an example, you may want to make a subset with all values of the data frame where the corresponding value of the column z is greater than 5, or where the group of the w column is Group 1.
Note that when subsetting a data frame by column value you have to specify the condition in the first argument, as the output will be a subset of rows of the data frame.
You can also apply a conditional subset by column values with the subset function as follows. Note that when using this function you can use the variable names directly.
When using the subset function with a data frame you can also specify the columns you want to be returned, indicating them in the select argument.
Subset rows in R
Analogously to column subset, you can subset rows of a data frame indicating the indices you want to subset as the first argument between square brackets.
Subset rows by list of values
In case you want to subset rows based on a vector you can use the %in% operator or the is.element function as follows:
Subset by date
Many data frames have a column of dates. In this case, each row represents a date and each column an event registered on those dates. For this purpose, you need to transform that column of dates with the as.Date function to convert the column to date format.
As an example, you can subset the values corresponding to dates greater than January, 5, 2011 with the following code:
Subsetting in R by unique date
Note that in case your date column contains the same date several times and you want to select all the rows that correspond to that date, you can use the == logical operator with the subset function as follows:
Subset a matrix in R
Subsetting a matrix in R is very similar to subsetting a data frame. Consider the following sample matrix:
You can subset the rows and columns specifying the indices of rows and then of columns. You can also use boolean data type.
Subset matrix by column and row names
In case your matrix contains row or column names, you can use them instead of the index to subset the matrix. In the following example we selected the columns named ‘two’ and ‘three’.
Subset matrix by column values
Equivalently to data frames, you can subset a matrix by the values of the columns. In this case, we are making a subset based on a condition over the values of the third column.
Subset time series
Time series are a type of R object with which you can create subsets of data based on time. We will use, for instance, the nottem time series.
The window function allows you to create subsets of time series, as shown in the following example: