SciPy For Data Fitting#

Models and parameters#


The goal of (non)linear regression is to fit a model to our data. A model is usually a mathematical function, describing the relationship between different physical quantities. Fitting a model produces estimates for the model parameters, which often have a biological meaning and are thus of interest.

One example is a standard curve in a colorimetric assay, which describes the absorbance at a specific wavelength at a given standard protein, e.g. BSA, concentration. Here, \([BSA]\) is the independent variable, the absorbance at a specific wavelength the dependent variable, and a and b the model parameters.

Another example is the Michaelis-Menten equation, which describes the initial reaction rate, \(V_{0}\), at a given substrate concentration, \([S_{0}]\), for a system that follows \(E + S\) \(\rightleftharpoons\) \(ES\) \(\rightarrow\) \(E + P\) (with E = enzyme, S = substrate, and P = product). Here, \([S_{0}]\) is the independent variable, \(V_{0}\) the dependent variable, and \(V_{max}\) and \(K_m\) the model parameters.

Example models

Of note, a model is said to be linear when the dependent variable is linear with each model parameter. If the model is not linear, then it is nonlinear. Our first example, the standard curve in a colorimetric assay, is a linear model. Our second example, the Michaelis-Menten equation, is a nonlinear model.

Least squares fitting#


For curve fitting, we use the scipy.optimize.curve_fit command. It can be used for multiple relationships - not just straight lines and polynomials! - between two sets of data. It uses least squares to fit a function, which we define, to data. It simply minimises the sum of the squares between the data point \(y_{i,observed}\) and the fit point \(f(x_{i}) = y_{i,predicted}\):

\[ \chi^2 = \Sigma_{i=1}^n (y_{i,observed} - y_{i,predicted})^2 \]
Least squares fitting

Unweighted and weighted fits#


It can be used for data with and without errors. For a normal or unweighted least squares fit, it assumes the errors on the data points are all the same and therefore play no role in the fitting procedure. A weighted fit does depend on the uncertainties on each point. To take weights, \(w_{i}\) (very often non-uniform sample standard deviations, \(\sigma_{i}\)) into account we use:

\[ \chi^2 = \Sigma_{i=1}^n w_{i} * (\overline {y_{i,observed}} - y_{i,predicted})^2 = \Sigma_{i=1}^n \biggl(\frac{\overline {y_{i,observed}} - y_{i,predicted}}{\sigma_{i}}\biggr)^2 \]

Remember, the mean, \(\overline {y_{i,observed}}\) is:

\[ \overline {y_{i,observed}} = \frac{\Sigma_{j=1}^n y_{j,observed}}{n} \]

The standard deviation, \(\sigma\), is the square root of the sample variance:

\[ \Sigma = \sqrt{\frac{\Sigma_{j=1}^n (y_{j,observed} - \overline {y_{i,observed}})^2}{n-1}} \]

Where

  • \(j\) indexes the data set,

  • \(y_{j,observed}\) are the individual results,

  • \(n\) is the number of data sets,

  • \(n-1\) is called the degrees of freedom.

Please see AMC TB 27-2007, Why are we weighting?, available here for more information on weighted fits.

The scipy.optimize.curve_fit() function, fit parameters, and standard errors of the fit parameters#


The scipy.optimize.curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma) function has many arguments. Important are f to define the model function, xdata and ydata for the independent and dependent variables, and p0 for the initial guesses for the model parameters.

The argument sigma, by default None, can be used to define the sample standard deviations. When working with uncertainties, we need to change the argument absolute_sigma from its default False to True. It relates to all values in sigma being the standard deviations and not just relative weights for the data points.

The function returns the optimal values for the model parameters (params, the order as defined by the function) and the covariance matrix (params_covariance, the order as defined by the function).

This covariance matrix can be used to calculate standard errors of the fit parameters. To compute one standard deviation errors on the fit parameters use the numpy.sqrt and numpy.diag functions from NumPy.

The covariance matrix of the estimated parameters

Steps#


The steps for curve fitting are:

  1. Import the optimize module with the curve_fit function from the SciPy library.

  2. Create the (x,y) data, either as NumPy arrays or a Pandas DataFrame.

  3. Define a function for the model we want to fit. The function should accept as inputs the independent variable(s) and all the model parameters to be fit.

  4. Use the scipy.optimize.curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma) function to fit the data.

  5. Extract the fit parameters and standard errors on the fit parameters from the output.

  6. Inspect the x,y data and fit on a graph.

  7. Calculate and inspect the residuals, i.e. \(y_{i,observed} - y_{i,predicted}\), on a graph.

Please, look at the “Curve fitting in Python with curve_fit” from Brant Carlson YouTube video about curve fitting in Python for more information and an example. It includes:

  • computing \(χ^2\) when having uncertainties (at ~ 6 minutes),

  • using curve_fit (at ~ 14 minutes),

  • plotting the results and residuals (at ~ 19 minutes), and

  • interpreting curve_fit’s covariance matrix (at ~ 30 minutes).

Exercise 35

Import the optimize module with the curve_fit function from the SciPy library. Use convenient naming.

Exercise 36

The following (x,y) data represent a BCA colorimetric assay standard curve. BSA concentrations are given in mg/ml and absorbances at 562 nm in AU.

BSAconc = np.array([0, 0.125, 0.250, 0.500, 0.750, 1.00, 1.50, 2.00])
A562nm = np.array([-0.056333, 0.081000, 0.239667, 0.512333, 0.731667, 1.047000, 1.565000, 2.016667])

We first define the function to fit the data using

def funcline(x, a, b):   #create the function to return a line
    y = a * x + b
    return y

We now fit the data using

paramsSC, params_covarianceSC = curve_fit(funcline,   #the function we try to fit to the data
                                        BSAconc,   #the x values, the concentrations
                                        df0['BSA-mean'],   #the y values, the measured absorbances
                                        (1, 0.1))   #the starting parameters for a (=the slope) and b (=the intercept)

We now extract the fit parameters and calculate the standard error of the fit parameters using

print("Slope, a = ", paramsSC[0], "±", np.sqrt(np.diag(params_covarianceSC))[0])   #print the slope and standard error on the slope
print("Intercept, b = ", paramsSC[1], "±", np.sqrt(np.diag(params_covarianceSC))[1])   #print the intercept and standard error on the intercept

It is up to you to

  • inspect the x,y data and fit on a graph, and

  • calculate and inspect the residuals on a graph.

Exercise 37

For the data and data analysis from the previous exercise, produce a combined figure showing the residuals plot underneath the main plot. Make sure they are aligned and have the same X-axis so we can see which residual corresponds to which data point.

Tip: Many functions can be used. Have a look at matplotlib.pyplot.subplots, matplotlib.figure.figure.add_subplot, or matplotlib.figure.figure.add_axes.

Linear regression#


Of note, other functions for linear regression exist:

  • The scipy.stats.linregress function calculates the analytical solution. This function is more computationally efficient and the result doesn’t depend on e.g. starting estimates.

  • The numpy.polyfit function is used for a least squares polynomial fit. The newer numpy.polynomial.polynomial.polyfit is preferred though. These functions facilitates the coding part (no need to define the function). Also, no starting estimates are needed.