5 Writing to a Database

Contents

  1. Writing to a Database in Julia
  2. Writing to a Database in SAS
  3. Writing to a Database in R
  4. Writing to a Database in Python

Once you are connected to a database, you can create tables with data! If you’re only using a database to access data, and aren’t writing to it, feel free to continue to the next chapter on querying a database.

In the code blocks below, we assume you’ve already connected to your database and have named the connection as shown in the previous chapter on connecting to a database.

5.1 Julia

Depending on your ODBC driver, you could write a table with a simple line of code: ODBC.load(database, "name of sql table", julia_table)

MS SQL Server along with other ODBC drivers/database systems don’t support this, so we must use an alternative method.

Imports

using ODBC, CSV, JuliaDB, Query, DataFrames

The Julia Language Community has suggested structure and instructions for working with tabular data, known as the Tables.jl interface. Both the indexed table type from JuliaDB and the more familiar DataFrame implement these standards, and integrate well with the ODBC package. Any of these queries can be converted into a dataframe for a more familiar interface, or into a JuliaDB IndexedTable if you need to run online algorithims or have easier access to parallel features.

Clear Up Workplace

ODBC.execute!(
    dsn,
    """
    drop table if exists goods; 
    drop table if exists items;
    drop table if exists receipts;
    drop table if exists customers;
    drop table if exists iris;
    drop table if exists iris2;
    """,
)

Create tables

ODBC.execute!(
    dsn,
    """
create table customers
(Id int Primary Key, LastName VARCHAR(30), FirstName VARCHAR(30))
""",
)

ODBC.execute!(
    dsn,
    "create table goods
    (Id VARCHAR(30) Primary Key, Flavor VARCHAR(30), Food VARCHAR(30), Price FLOAT)",
)

ODBC.execute!(
    dsn,
    """
create table receipts
(ReceiptNumber int PRIMARY KEY, Date VARCHAR(30), CustomerId int, CONSTRAINT CustomerID FOREIGN KEY (CustomerId) references customers(Id) )
""",
)

ODBC.execute!(
    dsn,
    "create table items
    (Receipt INT, Ordinal INT, Item VARCHAR(30),  PRIMARY KEY (Receipt,Ordinal), CONSTRAINT ReceiptNumber FOREIGN KEY (Receipt) REFERENCES receipts (ReceiptNumber) 
    , CONSTRAINT ItemId FOREIGN KEY(Item) References goods (Id) )
    ",
)

Insert the Data First, make a sql statement with “blank” values that we’ll put in actual values into

insertstmt = ODBC.prepare(dsn, "insert into customers values(?,?,?)")

Now, for each row in the rows of our csv files, run the previous insert statement, with the values for each row inserted into the previous “blank rows” ie those question marks

for row in rows(customers)
    ODBC.execute!(insertstmt, row)
end

## Now do this for the remaining three tables
insertstmt = ODBC.prepare(dsn, "insert into goods values(?,?,?,?)")
for row in rows(goods)
    ODBC.execute!(insertstmt, row)
end

insertstmt = ODBC.prepare(dsn, "insert into receipts values(?,?,?)")
for row in rows(receipts)
    ODBC.execute!(insertstmt, row)
end


insertstmt = ODBC.prepare(dsn, "insert into items values(?,?,?)")
for row in rows(items)
    ODBC.execute!(insertstmt, row)
end

5.1.1 Alternative method for big data

The above method requires you to have the full dataset loaded into memory. But sometimes a data file is so large that it is inefficient to load everything at once and iterate through it to insert each row into a table.

The method used here instead loads one row of data into memory at a time, and writes it to the SQL table before loading the next row.

ODBC.execute!(
  dsn,
  """"
    create table iris 
    (Sepal_length Float, Sepal_width Float, 
    Petal_length FLOAT, Petal_width FLOAT, Species VARCHAR(20))
  """
)

insert_statement = ODBC.prepare(dsn,"insert into iris values(?,?,?,?,?)")

open("iris.csv") do f
  readline(f) # skip the header row (remove this line if no header)

  # read one line at a time, write to SQL table, and repeat
  for lines in readlines(f)
    rawline = map(String, split(lines, ","))
    ODBC.execute!(insert_statement, rawline)
   end 
end

The file can also be read using the CSV package. The code below would replace open("iris.csv") do f and everything following in the code chunk above .

for row in CSV.Rows("iris.csv", skipto=2 )
    ODBC.execute!(insert_statement, row)
end 

5.1.2 Bulk Insert

If you have bulk insert permissions you can write data without reading it into memory. For more info, see bulk insert documentation.

TODO: Test this. This code is not tested, since we do not have bulk load permissions. TODO: Explain the difference between these two options.

ODBC.execute!(
  dsn,
  """
    BULK INSERT iris
    FROM 'iris.csv';
  """
)

ODBC.execute!(
  dsn, 
  """
    INSERT INTO iris (Sepal_length, Sepal_width, Petal_length, Petal_width, Species)
    SELECT *  
    FROM OPENROWSET (  
      BULK 'iris.csv') AS b;
  """
)

5.2 SAS

Downloading Data

Here we use a url libname to download the iris csv file from the internet, and preprocess it using a data step.

filename download url "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv";

data iris;
  infile download delimiter = "," firstobs=2;
  input var1-var4 species $;
run;

Writing Data

Finally, we write this data to the database.

proc sql;
  create table conn2.iris as (select * from work.iris);
quit;

5.3 R

5.3.1 Simple Table: Iris

The first example is a basic one to illustrate how to create and fill a table.

# drop table from database, if it exists
dbExecute(conn, 'drop table if exists iris;')

# make column names sql-compatable, can't have "." in a name
names(iris) <- c(
  'SepalLength','SepalWidth',
  'PetalLength','PetalWidth',
  'Species'
)

# create new table from iris dataset
dbWriteTable(conn, name = "iris",  value = iris)

# make sure table exists in database
dbListTables(conn, table_name = "iris")

# list fields the new table
dbListFields(conn, name = "iris")

5.3.3 Alternative method for big data

The above method requires you to have the full dataset loaded into memory. But sometimes a data file is so large that it is inefficient to load everything at once and iterate through it to insert each row into a table.

The method used here instead loads one row of data into memory at a time, and writes it to the SQL table before loading the next row.

It’s a bit more complex because we can’t use the simple dbWriteTable anymore. Instead, we define the table with a create table statement with dbExecute. To keep the code readable, we suggest using the paste function and using the same indentation you would in a sql file.

For simplicity and lack of repetition, this example doesn’t use the four related datasets again. For information on defining primary and foreign keys in a create table statement, see the w3schools example for SQL primary key on SQL Server and for foreign keys.

dbExecute(conn, 'drop table if exists iris;')

dbExecute(conn, paste(
  "create table iris(",
    "SepalLength decimal(5,2),",
    "SepalWidth decimal(5,2),",
    "PetalLength decimal(5,2),",
    "PetalWidth decimal(5,2),",
    "Species varchar(50)",
  ");"
))

f <- file("iris.csv", open = "r")
first = TRUE

# while the next line exists
while (length(oneLine <- readLines(f, n = 1)) > 0) {
  
  # this skips the header row. delete this if there's no header
  if (first) {
    first = FALSE
    next
  }
  
  # separate values
  myLine <- unlist(strsplit(oneLine, ","))
  
  # insert values
  insert_statement <- paste(
    "insert into iris(",
      "SepalLength,",
      "SepalWidth,",
      "PetalLength,",
      "PetalWidth,",
      "Species",
    ") values (",
      as.numeric(myLine[1]), ",",
      as.numeric(myLine[2]), ",",
      as.numeric(myLine[3]), ",",
      as.numeric(myLine[4]),",'",
      str_replace_all(myLine[5], '\"',''), "');"
  )
  dbExecute(conn, insert_statement)
}
close(f)

5.4 Python

In python, there’s no pretty function to create a table from a file or pandas DataFrame like there is in R. Here, we rely exclusively on the cursor’s execute method and raw sql strings. To keep the code readable, we suggest using the .join method on an empty string or space (" ") and using the same indentation you would in a sql file.

Note that if you are using pandas, you can write data in the same way as shown here by iterating through the rows of the DataFrame as opposed to lines of a file.

5.4.1 Simple Table: Iris

The first example is a basic one to illustrate how to create and fill a table.

Load data

# load in data
with open('iris.csv') as file:
    iris_lines = file.readlines()[1:]

Drop table if it exists, and create a new table

# drop table if it already exists
cursor.execute("drop table if exists iris;")

# create new table
cursor.execute(" ".join([
    "create table iris(",
        "SepalLength decimal(5,2),",
        "SepalWidth decimal(5,2),",
        "PetalLength decimal(5,2),",
        "PetalWidth decimal(5,2),",
        "Species varchar(50)"
    ");"
]))

Insert values To insert values into a sql table, we have two options.

This first strategy is beneficial because it doesn’t require you to load the full file into memory. This means that even if a dataset is very large, python can handle it. Here we only show using it to enter the first line of data, but this could be put inside of a for loop to iterate over all rows.

# single input: write one item to table
line = iris_lines[0].split(',')
cursor.execute(
    " ".join([
        "insert into iris(",
            "SepalLength,",
            "SepalWidth,",
            "PetalLength,",
            "PetalWidth,",
            "Species"
        ")"
        "values(?, ?, ?, ?, ?)"]),
    line[0],
    line[1],
    line[2],
    line[3],
    line[4].strip()
)

This second option is used to insert many rows of data into the table in one command. Using cursor.executemany, we can pass the same insert statement as above, but then a list of a list of values, where each inner list is 1 row of data. This is nice because it is less code. However, we prefer the first option because it is more consistent in that it will work for data tables of any size.

# bulk input: write many items to table
cursor.executemany(
    " ".join([
            "insert into iris(",
                "SepalLength,",
                "SepalWidth,",
                "PetalLength,",
                "PetalWidth,",
                "Species"
            ")"
            "values(?, ?, ?, ?, ?)"]),
    [line.split(',') for line in iris_lines[1:]] # [1:] because observation 0 was inserted above
)