In earlier post, I shared how to write basic machine learning application using ML.Net in C#, now in this tutorial we learn now we learn how to write a regression analysis in csharp machine learning
Regression analysis is a set of statistical processes for estimating the relationships among inter-dependent variables. Learn more about Regression analysis.
In our example, we will predict taxi fare based on previous year data.
You can download the sample test data taxi-fare-train.csv and
taxi-fare-test.csv datasets.
First, we need to setup our c# console development environment by installing Ml.Net libraries.
You need to create your dataset, You create SQL dataset or any other RDBMS or Excel or CSV anything. If you are SQL developer, better you create DataSet as SQL View, probably you find that easy, then you can directly connect to SQL database or you can just copy that data into Excel file.
At this stage, you need to load dataset into MLContext object, so you can play with data, how to load data that will depend what data source you are working with, in my example i will load data from excel file.
You may need to understand data by changing order, removing columns, adding additional columns, grouping them etc. Get them ready to train and test algorithms.
Now, you may want to see how visually data will look like, by plotting, charting etc. You can also save the visual representation in pdf format for future reference or reporting purpose.
Try different algorithms to see which produce the best closest result
Finally, make prediction with real data.
Here is the complete example of regression analysis using C# you should look at! there is an example of predicting taxi fare based on some previous data.
Open your VS2019, select C# console application and click next.
Right click on your project => Nuget Package => Search ML.Net package and install it.
You need to add following namespace.
using Microsoft.ML; using Microsoft.ML.Data; using Microsoft.ML.Trainers; using Microsoft.ML.Transforms;
Create MLContext object, this will be the gateway of machine learning API.
// 1. Create new instance of MLContext object. MLContext mlContext = new MLContext();
Load data from CSV file, we also can load data from any RDBMS like SQL server etc.
we will see that example later.
(Note: In this example, we have downloaded csv data from above taxi-fare link).
Create a class that will match the structure of csv file or any other data source.
Notice, how each property is mapped using LoadColumn(column-index)
method.
class Taxifare { [LoadColumn(0)] public float vendor_id { get; set; } [LoadColumn(1)] public float rate_code { get; set; } [LoadColumn(2)] public float passenger_count { get; set; } [LoadColumn(3)] public float trip_time_in_secs { get; set; } [LoadColumn(4)] public float trip_distance { get; set; } [LoadColumn(5)] public string payment_type { get; set; } [LoadColumn(6)] public float fare_amount { get; set; } } public class TaxiTripFarePrediction { [ColumnName("Score")] public float FareAmount; }
We need to load two separate IDataView object for train data and test data. In case you are trying with different data, make sure your object structure match the data source.
string CSV_TestData = @"G:\RND\MLApp1\testdata\taxi-fare-test.csv"; string CSV_TrainData = @"G:\RND\MLApp1\testdata\taxi-fare-train.csv"; // 2. Load data from CSV IDataView trainingDataVIew = mlContext.Data.LoadFromTextFile<Taxifare>(CSV_TrainData, separatorChar: ',',hasHeader: true); IDataView testDataVIew = mlContext.Data.LoadFromTextFile<Taxifare>(CSV_TestData, separatorChar: ',', hasHeader: true);
At this point, you probably would like to see if data from excel files are loaded into IDataView object,
fortunately there is a built-in method called GetRowCount()
, as you can see in code below,
I was trying to check how many rows are there in dataset, unfortunately this method does not show anything,
I assume Ml.Net still in early stage,
so probably in next release we will be able see this method working properly; [my assumption].
Console.WriteLine($"Training dataView (trainingDataVIew) {trainingDataVIew.GetRowCount()}"); Console.WriteLine($"Test dataView (testDataVIew) {testDataVIew.GetRowCount()} ");
Note: Still there is a way! after loading data, if you want to check if data is loaded properly into IDataView object, here is the process, you can ignore this part!
DataViewSchema columns = trainingDataVIew.Schema; // Create DataViewCursor using (DataViewRowCursor cursor = trainingDataVIew.GetRowCursor(columns)) { // variables to hold extracted values float _vendorId = default; float _ratecode = default; float _passengerCount = default; // Define delegates for extracting values from columns ValueGetter<float> vendorIdDelegate = cursor.GetGetter<float>(columns[0]); ValueGetter<float> ratecodeDelegate = cursor.GetGetter<float>(columns[1]); ValueGetter<float> passengerCountDelegate = cursor.GetGetter<float>(columns[2]); // Iterate over each row while (cursor.MoveNext()) { //Get values from respective columns vendorIdDelegate.Invoke(ref _vendorId); ratecodeDelegate.Invoke(ref _ratecode); passengerCountDelegate.Invoke(ref _passengerCount); } }
Machine learning algorithms can't directly use the raw data we have in our CSV file; So you need to use data transformations to pre-process the raw data, which will convert data into a format that the algorithm can accept.
Now for converting data, there is different type of Transforms options,
as you can see in below code, each column has been appended using different type of transform option.
Categorical| NormalizeMeanVariance | Concatenate
In python, this part is much straightforward, I hope in future ML.Net version we get much easier way transforming raw data into machine compatible format.
// 3. Add data transformations var dataProcessPipeline = mlContext.Transforms.CopyColumns(outputColumnName: "FareAmount", inputColumnName: nameof(Taxifare.fare_amount)) .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "VendorIdEncoded", inputColumnName: nameof(Taxifare.vendor_id))) .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "RateCodeEncoded", inputColumnName: nameof(Taxifare.rate_code))) .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "PaymentTypeEncoded", inputColumnName: nameof(Taxifare.payment_type))) .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(Taxifare.passenger_count))) .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(Taxifare.trip_time_in_secs))) .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(Taxifare.trip_distance))) .Append(mlContext.Transforms.Concatenate("Features", "VendorIdEncoded", "RateCodeEncoded", "PaymentTypeEncoded", nameof(Taxifare.passenger_count), nameof(Taxifare.trip_time_in_secs), nameof(Taxifare.trip_distance)));
Now we set the right algorithm for training the model, here you can try different algorithms
to check which one produce the most accurate result, in below example
I have tested Sdca (Stochastic Dual Coordinate Ascent),
and also will try LbfgsPoissonRegression
method to see what different result produced!
var trainer = mlContext.Regression.Trainers.Sdca(labelColumnName: "FareAmount", featureColumnName: "Features"); var trainingPipeline = dataProcessPipeline.Append(trainer);
DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(trainingDataVIew, testFraction: 0.2); IDataView trainData = dataSplit.TrainSet; IDataView testData = dataSplit.TestSet;
Now train the model, remember this Fit
method to train the model using previous data.
var trainedModel = trainingPipeline.Fit(trainingDataVIew);
Here we check, Evaluate what would be the output of regression analysis before we start testing with actual data.
IDataView transformTestDataVIew = trainedModel.Transform(testDataVIew); // type: Microsoft.ML.Data.RegressionMetrics var metrics = mlContext.Regression.Evaluate(transformTestDataVIew, labelColumnName: "FareAmount", scoreColumnName: "Score"); PrintRegressionMetrics(trainer.ToString(), metrics);
Just the print the output in console.
public static void PrintRegressionMetrics(string name, RegressionMetrics metrics) { Console.WriteLine($"*********************************"); Console.WriteLine($"*Metrics for {name} regression model "); Console.WriteLine($"*-----------------------------------"); Console.WriteLine($"* LossFn: {metrics.LossFunction:0.##}"); Console.WriteLine($"* R2 Score: {metrics.RSquared:0.##}"); Console.WriteLine($"* Absolute loss: {metrics.MeanAbsoluteError:#.##}"); Console.WriteLine($"* Squared loss: {metrics.MeanSquaredError:#.##}"); Console.WriteLine($"* RMS loss: {metrics.RootMeanSquaredError:#.##}"); Console.WriteLine($"**************************************"); }
Here are the two different output using two different algorithms from same training data
Method SdcaRegressionTrainer
* Metrics for Microsoft.ML.Trainers.SdcaRegressionTrainer regression model *-------------------------- * LossFn: 35.36 * R2 Score: 0.7 * Absolute loss: .76 * Squared loss: 35.36 * RMS loss: 5.95 **************************
Method LbfgsPoissonRegressionTrainer
* Metrics for Microsoft.ML.Trainers.LbfgsPoissonRegressionTrainer regression model *-------------------------- * LossFn: 83.85 * R2 Score: 0.28 * Absolute loss: 2.96 * Squared loss: 83.85 * RMS loss: 9.16 **************************
in progress