In this blog, we will finally give an answer to THE question: R, Python, Scala, Spark, Tensorflow, etc... What is the best one to answer data science questions? The question itself is totally absurd, but they are so many people asking it on social network that we find it worth to finally answer the recurrent question using a scientific methodology. At the end of this blog, you will find a quantitative answer comparing the computing time of each language/library for fitting the exact same Generalized Linear Model (GLM). Many features matter in the choice of a language/library, among them , the computing and developing time are for sure very important criteria.
The methodology that we are adopting to answer this famous question is, therefore, considering both the performance of the tool in terms of computing time but also the easiness of the tool in terms of data exploration, model definition etc. The quality of the tool documentation is also an important factor that contributes to faster development time cycles. Finally, scalability of the tool with respect to the size of the dataset is also an important factor in the big data era.
Model and Dataset
The dataset that we are using in these comparisons is the airline dataset which contains information about flight details since 1987. The dataset is publicly available on the website of the American Statistical Association. The dataset is made of 29 columns, 7009728 rows and weights 658MB on disk. We will use a 1M flight details from year 2008 as our benchmark dataset. In this post, we will use each tool to put together a model to predict if a flight will arrive on time. The model prediction is based on 9 columns of the input dataset:
- Departure date of the flight (Year, Month, DayOfMonth and, DayOfWeek),
- Departure time
- Air time
- Distance
- Origin airport
- Destination airport
We will use a General Linear Model (GLM) to make the prediction. The GLM is also known as logistics regression, or logit regression. This is quite a basic model which is extensively used in everyday business and which is available in all tools. It therefore offers a nice point of comparison for the types of models that matters today.
It's time to remind that the objective of this blog is not to define the most accurate model for predicting if a flight will arrive on time. The main objective of this blog is to offer a quantitative comparison of data science tools in terms of computing performances and coding easiness. In other words, it is on purpose that we use a simple model with a limited number of inputs.
Comparison Procedure
All the comparisons are made in jupyter notebooks using the exact same analysis flow. The notebooks are available on GitHub and could be visualized on NBviewer. Note: If you would like to add comparison to other tools or dataset, feel free to push your results to the GitHub directory. The analysis flow that we repeat for all tools is the following:
- Import the library
- Load and explore the dataset
- Prepare the dataset for training
- Define and fit the model
- Test the model and Measure the accuracy
Tools (Library/Languages)
R is the installed language in the academic and data science community. SAS is its main commercial competitor. R is considered to be good for pure statistical analysis and it is open source. However, languages like Python and Scala are eating out market share of R. A recent survey by Burtch Works shows how Python gain steam in the analytics community, at the expense of R and proprietary packages like SAS, IBM‘s SPSS, and Mathworks‘ Matlab. [caption id="attachment_1561" align="aligncenter" width="300"] Source Burtch Works[/caption] At the time of writing this blog, we have analyzed the following languages/libraries:
- R
- Python3 + Scikit-learn
- Python3 + Tensorflow
- Python3 + Keras
- Scala + Spark
Data manipulations in Python and Scala are done using Pandas and Spark data frame libraries, respectively. Both are inspired by R data frame tool and are therefore very similar. Spark library could also be used in Python, but we prefer to test it using Scala language as this is the native language of the Spark library. We should also highlight that although Spark can be used to analyze small dataset like this one, it was initially designed for the analysis of datasets that are so big that they can not be analyzed efficiently on one single computer.
Data Loading and Exploration
In this section, we compare the code complexity for the following tasks:
- Load a dataset from a CSV file
- Print the total number of rows in the dataset
- Trim the dataset to the first million rows
- List the column in the dataset
- Display the first 5 rows on in the jupyter output cell.
From the code snippet bellow, we can see that R and python have almost the same syntax. Python (and its libraries Pandas and Numpy are a bit more object oriented which make their usage easier). Spark equivalent code is a bit more complicated but offers the advantage to handle distributed data loading from Hadoop File System (HDFS).
In R:
df = read.csv("2008.csv") #Load data
nrow(df) #Get number of rows
df = df[0:1000000,] #Keep the first 1M rows
names(df) #List columns
df[0:5, ] #Get the 5 rows
In Python3:
df = pd.read_csv("2008.csv") #Load data
df.shape[0] #Get number of rows
df = df[0:1000000] #Keep the first 1M rows
df.columns #List columns
df[0:5] #Get the 5 rows
In Scala:
val dffull = sqlContext.read.format("csv")
.option("header", true)
.option("inferSchema", true)
.load("2008.csv") //Load data
dffull.count //Get number of rows
val df = dffull.sample(false, 1000000.toFloat/count) //Keep the first 1M rows
df.printSchema() //List columns
df2.show(5) //Get the 5 rows
Data Preparation for Model training
In this section, we compare the code complexity for selecting the columns of interest for our model, encode categorical variables and split the dataset into a training sample and a testing sample.
From the code snippet bellow, we can see that R and python have again almost the same syntax. Spark data processing is a bit different. It relies on the Spark ML Pipelines mechanism which allows better optimization of distributed calculus.
In R:
#drop rows where delay column is na
df = df[is.na(df$ArrDelay)==0,]
#turn label to numeric
df["IsArrDelayed" ] <- as.numeric(df["ArrDelay"]>0)
#mark as categorical
df["Origin" ] <- model.matrix(~Origin , data=df)
df["Dest" ] <- model.matrix(~Dest , data=df)
#split the dataset in two parts
trainIndex = sample(1:nrow(df), size = round(0.8*nrow(df)), replace=FALSE)
train = df[ trainIndex, ]
test = df[-trainIndex, ]
In Python3:
#drop rows where delay column is na
df = df.dropna(subset=["ArrDelay"])
#turn label to numeric
df["IsArrDelayed" ] = (df["ArrDelay"]>0).astype(int)
#Mark as categorical (replace by one hot encoded version)
df = pd.concat([df, pd.get_dummies(df["Origin"], prefix="Origin")], axis=1);
df = pd.concat([df, pd.get_dummies(df["De, axis=1);
#split the dataset in two parts
train = df.sample(frac=0.8)
test = df.drop(train.index)
In Scala:
//build a pipeline to turn categorical variables to encoded version
//and to build a feature vector concatenating all training column into a vector
val OriginIndexer = new StringIndexer()
.setInputCol("Origin")
.setOutputCol("OriginIndex")
val OriginEncoder = new OneHotEncoder()
.setInputCol("OriginIndex")
.setOutputCol("OriginVec")
val DestIndexer = new StringIndexer()
.setInputCol("Dest")
.setOutputCol("DestIndex")
val DestEncoder = new OneHotEncoder()
.setInputCol("DestIndex")
.setOutputCol("DestVec")
val Assembler = new VectorAssembler()
.setInputCols(Array("Year","Month", "DayofMonth" ,"DayOfWeek", "DepTime", "AirTime", "Distance", "OriginVec", "DestVec"))
.setOutputCol("Features")
val pipeline = new Pipeline()
.setStages(Array(OriginIndexer, OriginEncoder, DestIndexer, DestEncoder, Assembler))
//Transform the dataset using the above pipeline
val Preparator = pipeline.fit(df2)
val dfPrepared = Preparator.transform(df2).cache()
//Split the dataset in two parts
val Array(train, test) = dfPrepared.randomSplit(Array(0.8,0.2))
Model building and training
In this section, we jump into the heart of the statistical part of the code. Although, The length of the code to define and use a GLM model is quite small for most of the tools, the time needed to execute these few lines of code can be quite long depending on the library (and it's underlying optimization), on the model complexity and on the size of the dataset itself. Fixing the free parameters of a GLM model requires iterating over the dataset until all free parameters are converging to a unique value. This procedure is often referred to as the fitting of the model.
They are generally several library options available for fitting a model in a given language. This is particularly true in Python for which new data science and deep learning libraries are developed every day. Among those, scikit-learn is the reference for many years for all data science algorithms. Deep learning libraries like Google Tensorflow and Keras are also gaining in popularity and offers the possibility to exploit the Graphical Processing Unit (GPU) for faster model fitting. Keras uses either Tensorflow (or Theano) as a back-end for the model fitting but it makes the programming a bit easier for common statistical model and algorithms.
Note: For Tensorflow, we need to decompose the GLM model into simple matrix operations, so the code is a bit more lengthy. For those who need a reminder, the GLM model has linear logits which are linear with respect to the model features X. Logits = (X*W)+B = (Features * Coefficients) + Bias. The model predictions are defined as the sigmoid of the logits. The model loss function (used to optimize the model parameters W and B) is a logistic function.
In the following, you should pay attention at the code complexity, but also at the time it took for fitting the model. The model predicting power will be discussed in the section.
All test were run on a dataset of approximately 80% x 1M rows and on the same computer powered by an Intel I7-6700K @4.0GHz (8cores) and a GTX970 with 4GB GPU. The GPU is only used in Tensorflow/Keras benchmarks. Spark was used in a local model, so it uses the 8 cores of the processor, but nothing more.
In R: The model fitting took ~19min
#define the model and fit it
model <- glm(IsArrDelayed ~ Year + Month + DayofMonth + DayOfWeek + DepTime + AirTime + Origin + Dest + Distance,data=train,family = binomial)
In Python3 with Scikit-learn: The model fitting took ~13sec
#define the model feature columns and the label column
OriginFeatCols = [col for col in df.columns if ("Origin_" in col)]
DestFeatCols = [col for col in df.columns if ("Dest_" in col)]
features = train[["Year","Month", "DayofMonth" ,"DayOfWeek", "DepTime", "AirTime", "Distance"] + OriginFeatCols + DestFeatCols ]
labels = train["IsArrDelayed"]
#define the model per itself (C is the inverse of L2 regularization strength
model = LogisticRegression(C=1E5, max_iter=10000)
#fit the model
model.fit(features, labels)
In Python3 with Tensorflow: The model fitting took ~11sec
featureSize = features.shape[1]
labelSize = 1
training_epochs = 25
batch_size = 2500
#Define the model computation graph
graph = tf.Graph()
with graph.as_default():
# tf Graph Input
LR = tf.placeholder(tf.float32 , name = 'LearningRate')
X = tf.placeholder(tf.float32, [None, featureSize], name="features") # features
Y = tf.placeholder(tf.float32, [None, labelSize], name="labels") # training label
# Set model weights
W = tf.Variable(tf.random_normal([featureSize, labelSize],stddev=0.001), name="coefficients")
B = tf.Variable(tf.random_normal([labelSize], stddev=0.001), name="bias")
# Construct model
logits = tf.matmul(X, W) + B
with tf.name_scope("prediction") as scope:
P = tf.nn.sigmoid(logits)
# Cost function and optimizer (Minimize error using cross entropy)
L2 = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(targets=Y, logits=logits) ) + 1E-5*L2
optimizer = tf.train.AdamOptimizer(LR).minimize(cost)
# Initializing the variables
init = tf.initialize_all_variables()
#Fit the model (using a training cycle with early stopping)
avg_cost_prev = -1
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(features.shape[0]/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_xs = featuresMatrix[i*batch_size:(i+1)*batch_size]#features[i*batch_size:(i+1)*batch_size].as_matrix()
batch_ys = labelsMatrix[i*batch_size:(i+1)*batch_size]#labels [i*batch_size:(i+1)*batch_size].as_matrix().reshape(-1,1)
#set learning rate
learning_rate = 0.1 * pow(0.2, (epoch + float(i)/total_batch))
# Fit training using batch data
_, c = sess.run([optimizer, cost], feed_dict={X: batch_xs, Y: batch_ys, LR:learning_rate})
# Compute average loss
avg_cost += c / total_batch
#check for early stopping
if(avg_cost_prev>=0 and (abs(avg_cost-avg_cost_prev))<1e-4):
break
else: avg_cost_prev = avg_cost
In Python3 with Keras: The model fitting took ~55sec
featureSize = features.shape[1]
labelSize = 1
training_epochs = 25
batch_size = 2500
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.regularizers import l2, activity_l2
from sklearn.metrics import roc_auc_score
from keras.callbacks import Callback
from keras.callbacks import EarlyStopping
#DEFINE A CUSTOM CALLBACK
class IntervalEvaluation(Callback):
def __init__(self): super(Callback, self).__init__()
def on_epoch_end(self, epoch, logs={}): print("interval evaluation - epoch: %03d - loss:%8.6f" % (epoch, logs['loss']))
#DEFINE AN EARLY STOPPING FOR THE MODEL
earlyStopping = EarlyStopping(monitor='loss', patience=1, verbose=0, mode='auto')
#DEFINE THE MODEL
model = Sequential()
model.add(Dense(labelSize, input_dim=featureSize, activation='sigmoid', W_regularizer=l2(1e-5)))
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
#FIT THE MODEL
model.fit(featuresMatrix, labelsMatrix, batch_size=batch_size, nb_epoch=training_epochs,verbose=0,callbacks=[IntervalEvaluation(),earlyStopping]);
In Scala: The model fitting took 44sec
//Define the model
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
.setLabelCol("IsArrDelayed")
.setFeaturesCol("Features")
//Fit the model
val lrModel = lr.fit(train)
As for the other parts, R and python with the (almost native) scikit-learn library are very similar in terms of code complexity. What is quite astonishing is the huge difference in computing time in favor of python. A 15 sec python training translate to a 19 min equivalent training in R. R is ~90 times slower than python with scikit-learn and 25 times slower than Spark.
The fact that spark is a bit slower than python is expected here because Spark communication among each computing nodes is slowing down the process a bit. Here the computing nodes correspond to the 8 cores of the same processor, but still, the communication unit of spark is still eating up a bit of the resource. Spark would start to show an advantage for much larger datasets (not fitting entirely in the computer RAM) or when used on a computing cluster made of dozens of computing nodes or more.
Similarly, Keras seems slower than Tensorflow alone. We imagine that the cause is also due to a sort of overhead in the communication between Keras front-end and (Tensorflow) backend. This overhead is probably negligible for complex deep learning models with a dozen of layers or even more which are very different from the extremely simple GLM model that we are using for this test. Moreover, we limited out GLM Tensorflow model implantation to its most minimal form. It is, therefore, hard for Keras to do better.
Tensorflow computing time is almost identical to what we obtained with scikit-learn despite the usage of a GPU with more than 1664 CUDA computing cores. This is again due to the simplicity of the model. The time spent in the model optimization during the training iteration is actually going very fast. The vast majority of the time is actually spent in transferring the training data to the graphic card and gathering back the results from the GPU. More complex models would, therefore, be trained within basically the same amount of time given that the bottleneck here is not the computing power. They are better Tensorflow implantation that could solve this issue and reduce the training time. One of them is to use input Queues. But we have not explored this solution for this blog.
Model testing and Accuracy
In this section, we show compare code snippets to get the model prediction on the testing dataset, to draw the model ROC curve, and to get the model Area Under the Curve (AUC) which is a good indicator of the model classification performance. The higher the AUC the better. But since all the models are using the same input features and the same dataset, we expect the AUC of all the model implantations to be identical. There might be some small difference due to the randomness of the optimization process itself or due to slightly different stopping conditions.
In R: AUC=0.706
#Get the predictions
test["IsArrDelayedPred"] <- predict(model, newdata=test, type="response")
#Compare prediction with truth and draw the ROC curve
pred <- prediction(test$IsArrDelayedPred, test$IsArrDelayed)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
#Get the AUC
AUC = performance(pred, measure = "auc")@y.values</pre>
<h3>In Python3:
with Scikit-learn: AUC=0.702
with Tensorflow: AUC=0.699
with Keras: AUC=0.689</h3>
<pre>
#Get the predictions
testFeature = test[["Year","Month", "DayofMonth" ,"DayOfWeek", "DepTime", "AirTime", "Distance"] + OriginFeatCols + DestFeatCols ]
test["IsArrDelayedPred"] = model.predict( testFeature.as_matrix() )
#Compare prediction with truth and draw the ROC curve
fpr, tpr, _ = roc_curve(test["IsArrDelayed"], test["IsArrDelayedPred"])
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=4, label='ROC curve')
#Get the AUC
AUC = auc(fpr, tpr)
In Scala with Spark: AUC=0.645
//Get the predictions
val testWithPred = lrModel.transform(test.select("IsArrDelayed","Features"))
//Compare prediction with truth and draw the ROC curve
val trainingSummary = lrModel.evaluate(test)
val binarySummary = trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary]
val roc = binarySummary.roc
plotly.JupyterScala.init()
val fpr = roc.select("FPR").rdd.map(_.getDouble(0)).collect.toSeq;
val tpr = roc.select("TPR").rdd.map(_.getDouble(0)).collect.toSeq;
plotly.Scatter(fpr, tpr, name = "ROC").plot(title = "ROC Curve")
//Get the AUC
println(s"areaUnderROC: ${binarySummary.areaUnderROC}")
All implantations are getting quite similar AUC values. Scala with Spark is a bit behind, but the model parameters have not been tuned at all. We could certainly improve this results by tuning the model convergence, regularization and early stopping parameters of the model training. As the score still remain relatively close to the other implantation, we consider this result as satisfactory for this comparison blog.
Summary
In summary, we have shown that although code complexity is very similar between R and python implantations of the GLM model, the computing time necessary for training the model is significantly higher in the case of the R implantation. The accuracy of the model itself is about the same. It is clear that R paved the way to modern statistical and data science toolkits, but unfortunately he seems to be left behind compared to more modern framework like Python or Spark. We would therefore recommend to move your data science framework to the Python language which offers much better performances than R, with a coding style very similar to what you are used to do with R data frames. You also get the extra advantage to benefits from the developments of the deep-learning libraries that are produced daily in python. Those brings marginal improvement in the case of simple models like GLM, but seriously boost your performance in the case of more complex models (i.e. several computation layers).
Code Complexity | Computing Time (sec) | AUC | Documentation Quality | Additional Remarks | |
R | * | 1140 | 0.71 | GOOD | Much slower |
Python3 Scikit-learn | * | 13 | 0.70 | EXCELLENT | The winner for GLM |
Python3 Tensorflow | ** | 11 | 0.70 | GOOD | Exploit GPU |
Python3 Keras | * | 55 | 0.69 | GOOD | Exploit GPU; Good for complex models |
Scala Spark | ** | 44 | 0.65 | GOOD | Good for large dataset |
Note: We'd like to extend this comparison table (so as my GitHub directory) with more frameworks. We would, in particular, be interested in comparisons with commercial software (SAS, SPSS, etc.). Contact us if you want to help.
Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…
If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!
Comments
Comments are closed.