Working with Data¶

This notebook will work through the process of importing data from a csv spreadsheet file, graphing the data, fitting a function to the data, and reporting an experimental measurement (with uncertainty) of the acceleration due to gravity.

  • A. Import and view data
  • B. Selecting the Relevant Data Slice
  • C. Fitting a function
  • D. Reporting fit data with uncertainty

The pandas package will be used to read and write data in csv spreadsheet files. Pandas is a powerful data analysis library -- you can learn more about it here.

To begin, we import pandas, as well as numpy for numerical libraries and matplotlib for making graphs.

In [1]:
import pandas as pd
from numpy import *
import matplotlib.pyplot as plt

A. Import and view data¶

We will need a csv data file before we can proceed. For this example we will use the file data_freefall.csv, which must be saved in the same folder as this notebook. You can download the spreadsheet from the Computational Physics page, where this notebook is found.

We imported pandas above as pd, so you can use its code with pd.<function name>. The command on the next line will read the csv data file. The data will be stored in the variable called mydata, but you can name it whatever you like.

In [2]:
mydata = pd.read_csv('data_freefall.csv')

Show your data in a table:

In [3]:
mydata
Out[3]:
Time (s) Position (m) Velocity (m/s) Acceleration (m/s²)
0 0.05 1.478452 0.014330 0.126563
1 0.10 1.478452 0.028660 -0.360565
2 0.15 1.480098 0.033751 -1.584863
3 0.20 1.486410 -0.058996 -4.102331
4 0.25 1.481745 -0.339341 -7.380276
... ... ... ... ...
88 4.45 0.991392 -0.000152 -0.002337
89 4.50 0.991666 -0.000457 -0.007605
90 4.55 0.991392 -0.001555 -0.001118
91 4.60 0.991392 -0.000610 0.006462
92 4.65 0.991392 -0.000305 0.007385

93 rows × 4 columns

This happens to be data collected in a Physics lab experiment studying the free-fall motion of a ball. We will use this data to experimentally determine the acceleration due to gravity, $g$.

We want to get the data values out of the table so we can work with the numbers. We need to know the column names, called the "keys".

In [4]:
mydata.keys()
Out[4]:
Index(['Time (s)', 'Position (m)', 'Velocity (m/s)', 'Acceleration (m/s²)'], dtype='object')

Using these keys, we put the columns of data into variables t, y, v and a (for time, position, velocity and acceleration).

In [5]:
t = mydata['Time (s)'].values
y = mydata['Position (m)'].values
v = mydata['Velocity (m/s)'].values
a = mydata['Acceleration (m/s²)'].values

Let's make a quick plot of the position data. You can do this with the libraries we imported above as plt.

In [6]:
plt.plot(t,y)
Out[6]:
[<matplotlib.lines.Line2D at 0x7fdb37a5ba60>]
No description has been provided for this image

Think about the shape of this plot to be sure it makes sense. Note that the ball is in free-fall for only the middle portion of this data set, the smooth concave-down parabola from about $t=1$ s to $t=2$ s.

A good plot should have a title and axis labels. Also, we'll save this figure so we can share it or include it in a report, as needed. The .png extension is a good format for this type of image.

In [7]:
plt.plot(t,y)
plt.title('free-fall motion of a ball')
plt.xlabel('time (s)')
plt.ylabel('position (m)')
plt.savefig('free-fall_position_graph.png')
No description has been provided for this image

B. Selecting the Relevant Data Slice¶

Before we continue, we need to remove the portions of data that we don't want. We will find the indicies of the data set that contain just the free-fall (parabolic) data. This is called getting a "slice" of the data.

I will first illustrate this with a simple example array. Array A has the following data:

In [8]:
A = array((42,44,47,50,54,62,77))
print(A)
[42 44 47 50 54 62 77]

Now define a new array called B that has just a slice of the A data from index 1 to index 4:

In [9]:
B = A[1:4]
print(B)
[44 47 50]

Plotting a data slice¶

We want to find an initial and final index (call these i,f) for our data set that starts and ends just around the free-fall (parabolic) portion of the data.

We will need to play around with the first line in the following code block:

i,f = 18,40

Change the initial and final indicies, then plot the data. Change again until you find the perfect data slice.

The code after the data slicing plots $y$ versus $t$ as well as $\Delta y$ versus $t$. A look at both will help get the best slice possible. (See the matplotlib documentation to learn about the poltting functions.)

In [10]:
i,f = 18,40          # you'll need to change these to get the right slice
tt = t[i:f]                # tt is a slice of t
yy = y[i:f]                # yy is a slice of y

# make plots:
plt.subplot(121)      # 121 means "on a 1x2 grid, plot number 1"
plt.plot(tt,yy,'o',label='')
plt.xlabel('time (s)')            # label time axis 
plt.ylabel('$y$ position (m)')    # label position axis 
plt.title('height vs time')
plt.subplot(122)         # 122 means "on a 1x2 grid, plot number 2"
Dy = yy[1:]-yy[:-1]      # define Dy to be delta-y values
plt.plot( Dy, 'o',label='')
plt.xlabel('index')            # label time axis 
plt.title('$\Delta$ y')
Out[10]:
Text(0.5, 1.0, '$\\Delta$ y')
No description has been provided for this image

After some experimentation, it appears that the best values of i,f are 16,36.

In [11]:
i,f = 16,36 
tt = t[i:f]      # tt is a slice of t
yy = y[i:f]      # yy is a slice of y

C. Fitting a function¶

The best way to find the value of the free-fall acceleration is to fit a function to our data. This is a very useful experimental technique.

Suppose we tried to find the second-degree polynomial

$$ p(t) = A t^2 + B t + C \approx y(t) $$

that best matches the parabola in our data set. Numerical Python (numpy) can do this with the function polyfit. You can learn the details in the polyfit documentation.

We expect our $y$-versus-$t$ data to follow a polynomial because (in free-fall) the equation of motion is

$$ y(t) = y_0 + v_0 t - \tfrac{1}{2} g t^2 $$

where $g$ is the acceleration due to gravity.

We give our data to the polyfit function and tell it to fit a polynomial of degree 2 -- a parabola. The function returns the coefficients $A,B,C$ that best fit our data.

In [12]:
A,B,C = polyfit(tt,yy,2)       # fit a degree-2 polynomial to the t,y data
print('best fit parabola has coefficients')
print('     A=',A)
print('     B=',B)
print('     C=',C)
best fit parabola has coefficients
     A= -4.90062519936204
     B= 12.995800694098895
     C= -5.96120029435409

Let's see how well this polynomial fits our data by plotting them together.

To plot the parabola, we'll use the linspace function to make an array of many time values, from tt[0] to the last time, tt[-1]. Then plug these times into the polynomial with the coefficients A,B,C we found above.

In [13]:
t_axis = linspace(tt[0],tt[-1],100)   # t_axis is an array of 100 values from initial to final
pp = A*t_axis**2 + B*t_axis + C       # evaluate the polynomial at all t_axis values
plt.plot(t,y,'o',label='raw data')    # plot the data with dots ('o')
plt.plot(t_axis,pp,label='best fit parabola')  # plot the parabola at the t_axis times
plt.xlabel('time (s)')
plt.ylabel('position (m)')
plt.legend(loc='upper right')        # show graph legend
Out[13]:
<matplotlib.legend.Legend at 0x7fdb358c9540>
No description has been provided for this image

The fit is the line passing accurately through our data.

The process used by the polyfit function is called a "least squares fit". You can learn more about it on Wikipedia, or in a proper treatment of error analysis.

D. Reporting fit data with uncertainty¶

We now have coefficient $A$ from the fit, but what is the uncertainty in the fit coefficients? We want to know $A \pm \delta A$, where $\delta A$ is the absolute uncertainty, so we can report our best measure of $g \pm \delta g$.

When the polyfit function is called with an additional parameter:

polyfit(tt,yy,2,cov=True)

it returns the A,B,C coefficients as before and also a "covariance matrix" which gives the variance in each of the fit coefficients. The variance is the square of the standard deviation, and the standard deviation is the uncertainty in the fit coefficient. The variance of each coefficient appears on the diagonal of the covariance matrix.

This may all sound foreign if you haven't had any linear algebra or statistics. Don't worry about that now -- here's the bottom line. We find the uncertainty in each fit parameter (dA,dB,dC) as follows:

In [14]:
(A,B,C),covariance = polyfit(tt,yy,2,cov=True)        # fit with covariance matrix
dA,dB,dC = sqrt(diag(covariance))
print('fit coefficients are')
print('   A = ',A,'±',dA,'m/s^2')
print('   B = ',B,'±',dB,'m/s')
print('   C = ',C,'±',dC,'m')
fit coefficients are
   A =  -4.90062519936204 ± 0.010819955279437353 m/s^2
   B =  12.995800694098895 ± 0.028807305453603355 m/s
   C =  -5.96120029435409 ± 0.01848475836721139 m

We can now report out experiment's measure for the value of the acceleration due to free-fall, $g$, along with its uncertainty. Since the coefficient A in the polynomial corresponds to $- \tfrac{1}{2} g$ in the equation of motion (see part C above), we have $g = -2A$.

So we finally can report our experimental finding:

In [15]:
g = -2*A
dg = 2*dA
print(f'measured value of g = {g:0.3f} ± {dg:0.3f} m/s²')
measured value of g = 9.801 ± 0.022 m/s²