5 Writing to a Database

Contents

Once you are connected to a database, you can create tables with data! If you’re only using a database to access data, and aren’t writing to it, feel free to continue to the next chapter on querying a database.

In the code blocks below, we assume you’ve already connected to your database and have named the connection as shown in the previous chapter on connecting to a database.

5.1 Julia

Depending on your ODBC driver, you could write a table with a simple line of code: ODBC.load(database, "name of sql table", julia_table)

MS SQL Server along with other ODBC drivers/database systems don’t support this, so we must use an alternative method.

Imports

using ODBC, CSV, JuliaDB, Query, DataFrames

The Julia Language Community has suggested structure and instructions for working with tabular data, known as the Tables.jl interface. Both the indexed table type from JuliaDB and the more familiar DataFrame implement these standards, and integrate well with the ODBC package. Any of these queries can be converted into a dataframe for a more familiar interface, or into a JuliaDB IndexedTable if you need to run online algorithims or have easier access to parallel features.

Clear Up Workplace

ODBC.execute!(
    dsn,
    """
    drop table if exists goods; 
    drop table if exists items;
    drop table if exists receipts;
    drop table if exists customers;
    drop table if exists iris;
    drop table if exists iris2;
    """,
)

Create tables

ODBC.execute!(
    dsn,
    """
create table customers
(Id int Primary Key, LastName VARCHAR(30), FirstName VARCHAR(30))
""",
)

ODBC.execute!(
    dsn,
    "create table goods
    (Id VARCHAR(30) Primary Key, Flavor VARCHAR(30), Food VARCHAR(30), Price FLOAT)",
)

ODBC.execute!(
    dsn,
    """
create table receipts
(ReceiptNumber int PRIMARY KEY, Date VARCHAR(30), CustomerId int, CONSTRAINT CustomerID FOREIGN KEY (CustomerId) references customers(Id) )
""",
)

ODBC.execute!(
    dsn,
    "create table items
    (Receipt INT, Ordinal INT, Item VARCHAR(30),  PRIMARY KEY (Receipt,Ordinal), CONSTRAINT ReceiptNumber FOREIGN KEY (Receipt) REFERENCES receipts (ReceiptNumber) 
    , CONSTRAINT ItemId FOREIGN KEY(Item) References goods (Id) )
    ",
)

Insert the Data First, make a sql statement with “blank” values that we’ll put in actual values into

insertstmt = ODBC.prepare(dsn, "insert into customers values(?,?,?)")

Now, for each row in the rows of our csv files, run the previous insert statement, with the values for each row inserted into the previous “blank rows” ie those question marks

for row in rows(customers)
    ODBC.execute!(insertstmt, row)
end

## Now do this for the remaining three tables
insertstmt = ODBC.prepare(dsn, "insert into goods values(?,?,?,?)")
for row in rows(goods)
    ODBC.execute!(insertstmt, row)
end

insertstmt = ODBC.prepare(dsn, "insert into receipts values(?,?,?)")
for row in rows(receipts)
    ODBC.execute!(insertstmt, row)
end


insertstmt = ODBC.prepare(dsn, "insert into items values(?,?,?)")
for row in rows(items)
    ODBC.execute!(insertstmt, row)
end

5.1.1 Alternative method for big data

The above method requires you to have the full dataset loaded into memory. But sometimes a data file is so large that it is inefficient to load everything at once and iterate through it to insert each row into a table.

The method used here instead loads one row of data into memory at a time, and writes it to the SQL table before loading the next row.

ODBC.execute!(
  dsn,
  """"
    create table iris 
    (Sepal_length Float, Sepal_width Float, 
    Petal_length FLOAT, Petal_width FLOAT, Species VARCHAR(20))
  """
)

insert_statement = ODBC.prepare(dsn,"insert into iris values(?,?,?,?,?)")

open("iris.csv") do f
  readline(f) # skip the header row (remove this line if no header)

  # read one line at a time, write to SQL table, and repeat
  for lines in readlines(f)
    rawline = map(String, split(lines, ","))
    ODBC.execute!(insert_statement, rawline)
   end 
end

The file can also be read using the CSV package. The code below would replace open("iris.csv") do f and everything following in the code chunk above .

for row in CSV.Rows("iris.csv", skipto=2 )
    ODBC.execute!(insert_statement, row)
end

5.1.2 Bulk Insert

If you have bulk insert permissions you can write data without reading it into memory. For more info, see bulk insert documentation.

TODO: Test this. This code is not tested, since we do not have bulk load permissions. TODO: Explain the difference between these two options.

ODBC.execute!(
  dsn,
  """
    BULK INSERT iris
    FROM 'iris.csv';
  """
)

ODBC.execute!(
  dsn, 
  """
    INSERT INTO iris (Sepal_length, Sepal_width, Petal_length, Petal_width, Species)
    SELECT *  
    FROM OPENROWSET (  
      BULK 'iris.csv') AS b;
  """
)

5.2 SAS

Downloading Data

Here we use a url libname to download the iris csv file from the internet, and preprocess it using a data step.

filename download url "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv";

data iris;
  infile download delimiter = "," firstobs=2;
  input var1-var4 species $;
run;

Writing Data

Finally, we write this data to the database.

proc sql;
  create table conn2.iris as (select * from work.iris);
quit;

5.3 R

5.3.1 Simple Table: Iris

The first example is a basic one to illustrate how to create and fill a table.

# drop table from database, if it exists
dbExecute(conn, 'drop table if exists iris;')

# make column names sql-compatable, can't have "." in a name
names(iris) <- c(
  'SepalLength','SepalWidth',
  'PetalLength','PetalWidth',
  'Species'
)

# create new table from iris dataset
dbWriteTable(conn, name = "iris",  value = iris)

# make sure table exists in database
dbListTables(conn, table_name = "iris")

# list fields the new table
dbListFields(conn, name = "iris")

5.3.2 Related Tables Connected by Keys: Bakery Files

This example uses related tables. The table sused here represent data that may be stored by a bakery. There is a list of customers who have customer Ids, a list of goods sold with good Ids, a list of receipts with a receipt Id as well as the Id of the customer who purchased the goods, and finally a list of items on eaach receipt.

Load the data

# load dataframes
customers <- read.csv('BAKERY/customers.csv', stringsAsFactors = FALSE)
goods <- read.csv('BAKERY/goods.csv', stringsAsFactors = FALSE)
items <- read.csv('BAKERY/items.csv', stringsAsFactors = FALSE)
receipts <- read.csv('BAKERY/receipts.csv', stringsAsFactors = FALSE)

# clean Item column
items$Item <- trimws(items$Item)

Drop tables if they already exist When tables are related, you can’t drop a table that’s referenced by another one. For example, the receipts table contains the customer Id of who made the purchase, referencing the customers table. This means that receipts must be dropped first. The relationships between tables will become more clear as we move into constructing tables.

dbExecute(conn, 'drop table if exists items;')
dbExecute(conn, 'drop table if exists goods;')
dbExecute(conn, 'drop table if exists receipts;')
dbExecute(conn, 'drop table if exists customers;')

Create new tables We can now create new tables using the dbWriteTable function. When we create tables, each one should have a primary key, and some will also have a foreign key.

The primary key is a column or multiple columns that serve as a unique identifier of each row of data. In the customers table, for instance, the primary key is the customer Id. This value cannot be empty. You’ll see that as we create each table, we first have to specify that the column is not null, and then add it as the primary key.

A foreign key is a column in a table that references the primary key of another table. For instance, in receipts, we have a customer Id column telling us who made each purpose. In this case, customer Id is a foreign key in the receipts table referencing the primary key of the customers table. This value cannot be empty either, so you’ll see that again we specify the column is not null and then add it as a foreign key. The first example of this is for the receipts table.

## --- customers ---
dbWriteTable(conn, name = "customers", value = customers)

dbExecute(conn, 'alter table customers alter column Id int not null;')
dbExecute(conn, 'alter table customers add primary key (Id);')

## --- goods ---
dbWriteTable(conn, name = "goods",  value = goods)

dbExecute(conn, 'alter table goods alter column Id varchar(20) not null;')
dbExecute(conn, 'alter table goods add primary key (Id);')

## --- receipts ---
dbWriteTable(conn, name = "receipts",  value = receipts)

dbExecute(conn, 'alter table receipts alter column ReceiptNumber int not null;')
dbExecute(conn, 'alter table receipts add primary key (ReceiptNumber);')

dbExecute(conn, 'alter table receipts alter column CustomerID int not null;')
dbExecute(conn, 'alter table receipts add foreign key (CustomerID) references customers (Id);')

## --- items ---
dbWriteTable(conn, name = "items",  value = items)

dbExecute(conn, 'alter table items alter column Receipt int not null;')
dbExecute(conn, 'alter table items alter column Ordinal int not null;')
dbExecute(conn, 'alter table items add primary key (Receipt, Ordinal);')

dbExecute(conn, 'alter table items add foreign key (Receipt) references receipts (ReceiptNumber);')

dbExecute(conn, 'alter table items alter column Item varchar(20) not null;')
dbExecute(conn, 'alter table items add foreign key (Item) references goods (Id);')

Check key usage The following query will display information on the key usage within a table. We can use this to check that the keys were applied the way we were expecting. For the items table, receipt and ordianl together should serve as the primary key, receipt should be a foreign key referencing the receipts table, and item should be a foreign key referencing the goods table.

# check key usage in this table
dbGetQuery(conn, "select * from information_schema.key_column_usage where TABLE_NAME = 'items';")[c('CONSTRAINT_NAME', 'COLUMN_NAME')]

               CONSTRAINT_NAME COLUMN_NAME
1    FK__items__Item__5006DFF2        Item
2 FK__items__Receipt__4F12BBB9     Receipt
3  PK__items__286FD81DF2E2854C     Ordinal
4  PK__items__286FD81DF2E2854C     Receipt

5.3.3 Alternative method for big data

The method used here instead loads one row of data into memory at a time, and writes it to the SQL table before loading the next row.

It’s a bit more complex because we can’t use the simple dbWriteTable anymore. Instead, we define the table with a create table statement with dbExecute. To keep the code readable, we suggest using the paste function and using the same indentation you would in a sql file.

For simplicity and lack of repetition, this example doesn’t use the four related datasets again. For information on defining primary and foreign keys in a create table statement, see the w3schools example for SQL primary key on SQL Server and for foreign keys.

dbExecute(conn, 'drop table if exists iris;')

dbExecute(conn, paste(
  "create table iris(",
    "SepalLength decimal(5,2),",
    "SepalWidth decimal(5,2),",
    "PetalLength decimal(5,2),",
    "PetalWidth decimal(5,2),",
    "Species varchar(50)",
  ");"
))

f <- file("iris.csv", open = "r")
first = TRUE

# while the next line exists
while (length(oneLine <- readLines(f, n = 1)) > 0) {
  
  # this skips the header row. delete this if there's no header
  if (first) {
    first = FALSE
    next
  }
  
  # separate values
  myLine <- unlist(strsplit(oneLine, ","))
  
  # insert values
  insert_statement <- paste(
    "insert into iris(",
      "SepalLength,",
      "SepalWidth,",
      "PetalLength,",
      "PetalWidth,",
      "Species",
    ") values (",
      as.numeric(myLine[1]), ",",
      as.numeric(myLine[2]), ",",
      as.numeric(myLine[3]), ",",
      as.numeric(myLine[4]),",'",
      str_replace_all(myLine[5], '\"',''), "');"
  )
  dbExecute(conn, insert_statement)
}
close(f)

5.4 Python

In python, there’s no pretty function to create a table from a file or pandas DataFrame like there is in R. Here, we rely exclusively on the cursor’s execute method and raw sql strings. To keep the code readable, we suggest using the .join method on an empty string or space (" ") and using the same indentation you would in a sql file.

Note that if you are using pandas, you can write data in the same way as shown here by iterating through the rows of the DataFrame as opposed to lines of a file.

5.4.1 Simple Table: Iris

The first example is a basic one to illustrate how to create and fill a table.

Load data

# load in data
with open('iris.csv') as file:
    iris_lines = file.readlines()[1:]

Drop table if it exists, and create a new table

# drop table if it already exists
cursor.execute("drop table if exists iris;")

# create new table
cursor.execute(" ".join([
    "create table iris(",
        "SepalLength decimal(5,2),",
        "SepalWidth decimal(5,2),",
        "PetalLength decimal(5,2),",
        "PetalWidth decimal(5,2),",
        "Species varchar(50)"
    ");"
]))

Insert values To insert values into a sql table, we have two options.

This first strategy is beneficial because it doesn’t require you to load the full file into memory. This means that even if a dataset is very large, python can handle it. Here we only show using it to enter the first line of data, but this could be put inside of a for loop to iterate over all rows.

# single input: write one item to table
line = iris_lines[0].split(',')
cursor.execute(
    " ".join([
        "insert into iris(",
            "SepalLength,",
            "SepalWidth,",
            "PetalLength,",
            "PetalWidth,",
            "Species"
        ")"
        "values(?, ?, ?, ?, ?)"]),
    line[0],
    line[1],
    line[2],
    line[3],
    line[4].strip()
)

This second option is used to insert many rows of data into the table in one command. Using cursor.executemany, we can pass the same insert statement as above, but then a list of a list of values, where each inner list is 1 row of data. This is nice because it is less code. However, we prefer the first option because it is more consistent in that it will work for data tables of any size.

# bulk input: write many items to table
cursor.executemany(
    " ".join([
            "insert into iris(",
                "SepalLength,",
                "SepalWidth,",
                "PetalLength,",
                "PetalWidth,",
                "Species"
            ")"
            "values(?, ?, ?, ?, ?)"]),
    [line.split(',') for line in iris_lines[1:]] # [1:] because observation 0 was inserted above
)

5.4.2 Related Tables Connected by Keys: Bakery Files

This example uses related tables. The tables used here represent data that may be stored by a bakery. There is a list of customers who have customer Ids, a list of goods sold with good Ids, a list of receipts with a receipt Id as well as the Id of the customer who purchased the goods, and finally a list of items on eaach receipt.

cursor.execute("drop table if exists items;")
cursor.execute("drop table if exists goods;") # referenced by items
cursor.execute("drop table if exists receipts;") # referenced by items
cursor.execute("drop table if exists customers;") # referenced by receipts

Create new tables We can now create new tables using create table statements and cursor.execute. When we create tables, each one should have a primary key, and some will also have a foreign key.

The primary key is a column or multiple columns that serve as a unique identifier of each row of data. In the customers table, for instance, the primary key is the customer Id. This value cannot be empty. This constraint is added automatically when we set the column(s) as the primary key.

A foreign key is a column in a table that references the primary key of another table. For instance, in receipts, we have a customer Id column telling us who made each purpose. In this case, customer Id is a foreign key in the receipts table referencing the primary key of the customers table. This value cannot be empty either, and the constraint will automtically be applied when you set it as a foreign key.

# create new tables
cursor.execute(" ".join([
    "create table customers(",
        "Id int,",
        "LastName varchar(50),",
        "FirstName varchar(50),",
        "constraint customers_pk primary key (Id)"
    ");"
]))

cursor.execute(" ".join([
    "create table receipts(",
        "ReceiptNumber int,",
        "Date date,",
        "CustomerId int,",
        "constraint receipts_pk primary key (ReceiptNumber),",
        "constraint receipts_customers_fk foreign key (CustomerId) references customers (Id)"
    ");"
]))

cursor.execute(" ".join([
    "create table goods(",
        "Id varchar(50),",
        "Flavor varchar(50),",
        "Food varchar(50),",
        "Price decimal(5,2),",
        "constraint goods_pk primary key (Id)"
    ");"
]))

cursor.execute(" ".join([
    "create table items(",
        "Receipt int,",
        "Ordinal int,",
        "Item varchar(50),",
        "constraint items_pk primary key (Receipt, Ordinal),",
        "constraint items_goods_fk foreign key (Item) references goods (Id),",
        "constraint items_receipts_fk foreign key (Receipt) references receipts (ReceiptNumber)"
    ");"
]))

Load in data For each table, open file and insert values one line at a time. See the simple table example above for more details.

with open('BAKERY/customers.csv') as file:
    header = True
    for line in file:
        line = line.split(', ')
        if header: # dont write header row to database!
            header = False
        else:
            cursor.execute(
                " ".join([
                    "insert into customers(",
                        "Id,",
                        "FirstName,",
                        "LastName",
                    ")"
                    "values(?, ?, ?)"]),
                line[0],
                line[1],
                line[2].strip()
            )

with open('BAKERY/receipts.csv') as file:
    header = True
    for line in file:
        line = line.split(', ')
        if header:
            header = False
        else:
            # reformatting year: YYYYMMDD from DD-Mon-YYYY
            # sql likes the YYYYMMDD format
            date = line[1].strip().replace("'",'')
            year = date.split('-')[2]
            day = date.split('-')[0]
            if len(day) == 1:
                day = '0'+day
            month = str(month_abbr_dict[date.split('-')[1]])
            date2 = year+month+day
            cursor.execute(
                " ".join([
                    "insert into receipts(",
                        "ReceiptNumber,",
                        "Date,",
                        "CustomerId",
                    ")"
                    "values(?, ?, ?)"]),
                line[0],
                date2,
                line[2].strip()
            )

with open('BAKERY/goods.csv') as file:
    header = True
    for line in file:
        line = line.strip().split(',')
        if header:
            header = False
        else:
            cursor.execute(
                " ".join([
                    "insert into goods(",
                        "Id,",
                        "Flavor,",
                        "Food,",
                        "Price"
                    ")"
                    "values(?, ?, ?, ?)"]),
                line[0],
                line[1],
                line[2],
                float(line[3])
            )

with open('BAKERY/items.csv') as file:
    header = True
    for line in file:
        line = line.split(', ')
        if header:
            header = False
        else:
            cursor.execute(
                " ".join([
                    "insert into items(",
                        "Receipt,",
                        "Ordinal,",
                        "Item",
                    ")"
                    "values(?, ?, ?)"]),
                int(line[0]),
                int(line[1]),
                line[2].strip()
            )