Ghana Finance

Jun 25 2017

Translating between R and SQL: the basics – Burns Statistics #between #sql



R was made especially for data analysis and graphics. SQL was made especially for databases. They are allies.

The data structure in R that most closely matches a SQL table is a data frame. The terms rows and columns are used in both.

A mashup

There is an R package called sqldf that allows you to use SQL commands to extract data from an R data frame. We will use this package in the examples. There are two basic steps to using an R package :

  • it must be installed on your machine (once)
  • it must be available in each R session where it is used

You can use sqldf by doing (one time only, and assuming an internet connection):

Then in each R session where you want to use it:

To simplify the examples, we ll slightly modify one of the inbuilt data frames:

Note that the character between C and 2 is a capital-O and not a zero. The CO2 object has a complicated value for its class but the result of the sqldf function has only “data.frame” in its class. We want to cleanly see if two objects are the same, and hence we want the classes to match.

Column names

In R the colnames function returns the names of the columns:

The result is a vector of character strings.

Subsetting columns

Columns in SQL are also called fields . In R it is rather common for columns to be called variables .

In SQL the subset of columns is determined by select. Here we want to get the Type and conc columns:

Subsetting in R (commonly called subscripting ) is done with square brackets. When subscripting a data frame there will be two places inside the square brackets separated by a comma. The R equivalent of the command above is:

The first part inside the square brackets (corresponding to rows) is empty. The second part (corresponding to columns) has a character vector with the names of the two columns we want.

We can test that r01 and s01 are the same:

In R the vector of column names could be created as an object and then used in the subscripting:

The r01 and r01b objects are the same.

All columns

An asterisk is used in SQL to indicate that you want all columns:

When you want all items in a dimension in R, you leave it blank:

You might have been able to guess that because we ve seen that done for rows already. Note that spaces almost never matter in R the command above has spaces either side of the comma, but would be exactly the same with no spaces.

Only one column

How to select only a single column is no surprise in either language:

But there is a surprise when you test if these two objects are equal:

The command above results in a bunch of stuff, indicating they are quite different.

The r03 object is not a data frame, it is an object of the type of the column. While surprising to those used to SQL, this is quite natural for R s purposes. For example, we give the mean function a vector of numbers, not a data frame:

You can get a one-column data frame by slightly modifying the command:

The s03 and r03d objects are the same.

Data frames are not natural inputs to some functions:

Case sensitivity

SQL is not case-sensitive:

s04 is the same as s01 .

On the other hand, R is case-sensitive:

R extensions

We ve seen how to select columns of an R data frame with the names of the columns. There are other ways of selecting columns as well.

The order of the columns in an R data frame is of significance. You can select columns by number. For example, you can select column 5 and then column 2:

You can use negative numbers to exclude columns. Here you are asking for all columns except the first and the fourth:

Column selection in R can also be done with logical values:

Those logical values can be created by a command:

Subsetting rows

In SQL a common synonym for row is record . In R a common synonym is observation .


The common way of getting a subset of rows in SQL is with the where command:

In R the equivalent of the where is put in the first position inside the square brackets:

s05 and r05 are in most respects the same. The difference is that the row names are different. r05 has the row names from the original data frame while s05 has new ones that are sequential from 1.


The command that created r05 is a little convoluted (but logical once you stare at it long enough). The with function allows a command that is more in the spirit of what is done in SQL:

Inside the with call the columns of the data frame named in the first argument can be used as objects. In this example uptake is used directly instead of pulling that column out of the data frame.

Logical operators

Logical comparisons in SQL are combined with AND and OR :

Also note that testing equality is with = .

In R this type of and operation is done with and the or is | :

A possible trouble spot is that equality in R is tested with == (while = is an assignment operator).

The s06 and r06 objects are the same except for their row names.

First few

The limit command in SQL limits the number of rows that are given:

One way to see just the column names is to limit the number of rows to zero.

You can get the first few rows in R with head :

The tail function gives you the last few rows, and the corner function is a logical extension of head and tail .

Row names versus numbers

A source of possible confusion is that row names are character even though they are, by default, representations of numbers. Let s experiment with r06 :

Select the first three rows:

Now let s select the characters one through three:

What happened? The first row is correct the first row name is 1. In the second row it looked for a row name called 2 and didn t find one, so it put in missing values. The third row is even weirder: it looked for a row name called 3 ; there was a single row name starting with 3 so it did a partial match and gave us that row.

Trying to give numbers instead of the actual names doesn t necessarily work either:

Additional details


In SQL NULL means missing value. Confusingly R also has NULL but the equivalent of SQL NULL is NA in R.

Let s create some data to play with:

This looks like:

Get the rows where Plant is not missing:

We can also get the rows where Plant is missing:

To get the rows that have no missing values in R, you can do:


In SQL single quotes are used to delimit character strings. A single quote inside a string is given with two single quotes in a row. Some implementations allow you to specify the delimiter.

In R either single quotes or double quotes can be used. You can use whichever you find more convenient but R always prints using double quotes. The backslash is used to escape a quote character that is the same as the delimiting quote:


Semicolons are sometimes used at the end of statements in both SQL and R.

Some SQL implementations require a semicolon at the end of a statement.

Semicolons are used to separate R commands on the same line. They can be used after all R commands, but probably shouldn t be.

Single subscript

are legal R commands, but they do different things. The first gives the first three rows and all of the columns; the second gives all of the rows and the first three columns (for a reason you need not be concerned about initially).


If a data frame is large and the manipulation is complex, then R can be inefficient. Why should someone with access to a database put up with such inefficiency? One reason is the flexibility that R gives you. It might surprise some people that not all data naturally fit into a structure of rows and columns. Besides it usually doesn t matter. As has been asked facetiously by a certain someone: What are you going to do with that extra millisecond?

But if there are millions of those milliseconds, then you might start to care. The data.table package provides an alternative form of data frames that is highly efficient.

Resources to learn R

“Impatient R” is a minimal set of things to learn about R.

Resources to learn SQL

I m certainly no expert at learning materials for SQL please make suggestions. But here are things I ve found that seem at least okay:

SQLZoo provides quite a nice interactive set of exercises.

Tutorialspoint has information easily arranged for learning and refreshing.

See also

Written by admin

Leave a Reply

Your email address will not be published. Required fields are marked *