Handling Data

Handling Data#

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import os.path
import subprocess

Load Course Data Files#

def wget_data(url: str):
    local_path = './tmp_data'
    p = subprocess.Popen(["wget", "-nc", "-P", local_path, url], stderr=subprocess.PIPE, encoding='UTF-8')
    rc = None

    while rc is None:
      line = p.stderr.readline().strip('\n')
      if len(line) > 0:
        print(line)
      rc = p.poll()

wget_data('https://raw.githubusercontent.com/illinois-ipaml/MachineLearningForPhysics/main/data/pong_data.hf5')

File ‘./tmp_data/pong_data.hf5’ already there; not retrieving.

def locate_data(name, check_exists=True):
    local_path='./tmp_data'
    path = os.path.join(local_path, name)
    if check_exists and not os.path.exists(path):
        raise RuxntimeError('No such data file: {}'.format(path))
    return path

locate_data('pong_data.hf5')

'./tmp_data/pong_data.hf5'

Data files are stored in the industry standard binary HDF5 and text CSV formats, with extensions .hf5 and .csv, respectively. HDF5 is more efficient for larger files but requires specialized software to read. CSV files are just plain text:

wget_data('https://raw.githubusercontent.com/illinois-ipaml/MachineLearningForPhysics/main/data/line_data.csv')

File ‘./tmp_data/line_data.csv’ already there; not retrieving.

with open(locate_data('line_data.csv')) as f:
    # Print the first 5 lines of the file.
    for lineno in range(5):
        print(f.readline(), end='')

x,y,dy
0.3929383711957233,0.08540861657452603,0.3831920560881885
-0.42772133009924107,-0.5198803411067978,0.38522044793317467
-0.5462970928715938,-0.8124804852644906,
0.10262953816578246,0.10527828529558633,0.38556680974439583

The first line specifies the names of each column (“feature”) in the data file. Subsequent lines are the rows (“samples”) of the data file, with values for each column separated by commas. Note that values might be missing (for example, at the end of the third row).

Read Files with Pandas#

We will use the Pandas package to read data files into DataFrame objects in memory. This will only be a quick introduction. For a deeper dive, start with Data Manipulation with Pandas in the Phython Data Science Handbook.

pong_data = pd.read_hdf(locate_data('pong_data.hf5'))

line_data = pd.read_csv(locate_data('line_data.csv'))

You can think of a DataFrame as an enhanced 2D numpy array, with most of the same capabilities:

line_data.shape

(2000, 3)

Individual columns also behave like enhanced 1D numpy arrays:

line_data['y'].shape

(2000,)

line_data['x'].shape

(2000,)

For a first look at some unknown data, start with some basic summary statistics:

line_data.describe()

	x	y	dy
count	2000.000000	2000.000000	1850.000000
mean	-0.000509	-0.086233	0.479347
std	0.585281	0.782878	0.228198
min	-0.999836	-2.390646	0.151793
25%	-0.513685	-0.648045	0.302540
50%	-0.006021	-0.068052	0.431361
75%	0.501449	0.473741	0.610809
max	0.999289	2.365710	1.506188

Jot down a few things you notice about this data from this summary.

The values of x and y are symmetric about zero.
The values of x look uniformly distributed on [-1, +1], judging by the percentiles.
The value of dy is always > 0, as you might expect if it represents the “error on y”.
The dy column is missing 150 entries.

Summarize pong_data the same way. Does anything stick out?

pong_data.describe()

	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	y0	y1	y2	y3	y4	y5	y6	y7	y8	y9
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.0	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.049004	0.132093	0.212905	0.291504	0.367950	0.442301	0.514615	0.584949	0.653355	0.719888	0.0	0.125206	0.217122	0.276658	0.304702	0.302116	0.269740	0.208390	0.118860	0.001921
std	0.062998	0.067380	0.075805	0.086806	0.099285	0.112547	0.126175	0.139919	0.153624	0.167196	0.0	0.010876	0.021454	0.031742	0.041748	0.051481	0.060946	0.070153	0.079107	0.087815
min	-0.161553	-0.089041	-0.018516	0.050077	0.116790	0.181677	0.244785	0.306165	0.365863	0.415850	0.0	0.093722	0.155016	0.184769	0.183846	0.153088	0.093310	0.005310	-0.110141	-0.252291
25%	-0.001755	0.079435	0.157023	0.229517	0.293469	0.353604	0.414068	0.473338	0.532280	0.590583	0.0	0.115816	0.198597	0.249250	0.268654	0.257665	0.217116	0.147817	0.050555	-0.073903
50%	0.076534	0.148675	0.205846	0.270214	0.338380	0.406922	0.476322	0.542847	0.608249	0.673589	0.0	0.127098	0.220852	0.282177	0.311961	0.311068	0.280338	0.220589	0.132616	0.017191
75%	0.100177	0.187800	0.286463	0.383127	0.475724	0.565217	0.651398	0.734418	0.816378	0.896600	0.0	0.132847	0.232193	0.298956	0.334029	0.338281	0.312554	0.257672	0.174431	0.063610
max	0.151118	0.261095	0.370325	0.476563	0.579891	0.684321	0.787124	0.887111	0.984358	1.078941	0.0	0.144799	0.255769	0.333838	0.379908	0.394854	0.379530	0.334764	0.261364	0.160113

Some things that stick out from this summary are:

Mean, median values in the xn columns are increasing left to right.
Column y0 is always zero, so not very informative.
Mean, median values in the yn columns increase from y0 to y4 then decrease through y9.

Work with Subsets of Data#

A subset is specified by limiting the rows and/or columns. We have already seen how to pick out a single column, e.g. with line_data['x'].

We can also pick out specific rows (for details on why we use iloc see here):

line_data.iloc[:4]

	x	y	dy
0	0.392938	0.085409	0.383192
1	-0.427721	-0.519880	0.385220
2	-0.546297	-0.812480	NaN
3	0.102630	0.105278	0.385567

Note how the missing value in the CSV file is represented as NaN = “not a number”. This is generally how Pandas handles any data that is missing / invalid or otherwise not available (NA).

We may not want to use any rows with missing data. Select the subset of useful data with:

line_data_valid = line_data.dropna()

line_data_valid[:4]

	x	y	dy
0	0.392938	0.085409	0.383192
1	-0.427721	-0.519880	0.385220
3	0.102630	0.105278	0.385567
4	0.438938	0.582137	0.509960

You can also select rows using any logical test on its column values. For example, to select all rows with dy > 0.5 and y < 0:

xpos = line_data[(line_data['dy'] > 0.5) & (line_data['y'] < 0)]
xpos[:4]

	x	y	dy
13	-0.880644	-1.482074	0.698284
16	-0.635017	-1.192232	0.619905
30	-0.815790	-0.172324	0.643215
35	-0.375478	-1.320013	0.574198

Use describe to compare the summary statistics for rows with x < 0 and x >= 0. Do they make sense?

line_data[line_data['x'] < 0].describe()

	x	y	dy
count	1006.000000	1006.000000	938.000000
mean	-0.507065	-0.689012	0.472889
std	0.288074	0.498581	0.227474
min	-0.999836	-2.390646	0.159862
25%	-0.758180	-1.005357	0.294420
50%	-0.511167	-0.643512	0.419482
75%	-0.264287	-0.338449	0.611192
max	-0.000128	0.757903	1.506188

line_data[line_data['x'] >= 0].describe()

	x	y	dy
count	994.000000	994.000000	912.000000
mean	0.512162	0.523822	0.485989
std	0.287312	0.491520	0.228875
min	0.001123	-1.154558	0.151793
25%	0.266587	0.163363	0.312799
50%	0.502736	0.471419	0.436676
75%	0.761346	0.821626	0.607731
max	0.999289	2.365710	1.378183

Extend Data with New Columns#

You can easily add new columns derived from existing columns, for example:

line_data['yprediction'] = 1.2 * line_data['x'] - 0.1

The new column is only in memory, and not automatically written back to the original file.

EXERCISE: Add a new column for the “pull”, defined as: $$ y_{pull} \equiv \frac{y - y_{prediction}}{\delta y} \; . $$ What would you expect the mean and standard deviation (std) of this new column to be if the prediction is accuracte? What do the actual mean, std values indicate?

line_data['ypull'] = (line_data['y'] - line_data['yprediction']) / line_data['dy']

The mean should be close to zero if the prediction is unbiased. The RMS should be close to one if the prediction is unbiased and the errors are accurate. The actual values indicate that the prediction is unbiased, but the errors are overerestimated.

line_data.describe()

	x	y	dy	yprediction	ypull
count	2000.000000	2000.000000	1850.000000	2000.000000	1850.000000
mean	-0.000509	-0.086233	0.479347	-0.100611	0.036367
std	0.585281	0.782878	0.228198	0.702338	0.661659
min	-0.999836	-2.390646	0.151793	-1.299803	-2.162585
25%	-0.513685	-0.648045	0.302540	-0.716422	-0.429185
50%	-0.006021	-0.068052	0.431361	-0.107225	0.033875
75%	0.501449	0.473741	0.610809	0.501739	0.484257
max	0.999289	2.365710	1.506188	1.099146	2.033837

Combine Data from Different Sources#

Most of the data files for this course are in data/targets pairs (for reasons that will be clear soon).

Verify that the files pong_data.hf5 and pong_targets.hf5 have the same number of rows but different column names.

wget_data('https://raw.githubusercontent.com/illinois-ipaml/MachineLearningForPhysics/main/data/pong_targets.hf5')

File ‘./tmp_data/pong_targets.hf5’ already there; not retrieving.

pong_data = pd.read_hdf(locate_data('pong_data.hf5'))
pong_targets = pd.read_hdf(locate_data('pong_targets.hf5'))

print('#rows: {}, {}.'.format(len(pong_data), len(pong_targets)))
assert len(pong_data) == len(pong_targets)

print('data columns: {}.'.format(pong_data.columns.values))
print('targets columns: {}.'.format(pong_targets.columns.values))

#rows: 1000, 1000.
data columns: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'y0' 'y1' 'y2' 'y3'
 'y4' 'y5' 'y6' 'y7' 'y8' 'y9'].
targets columns: ['th0' 'hit' 'grp'].

Use pd.concat to combine the (different) columns, matching row by row. Verify that your combined data has the expected number of rows and column names.

pong_both = pd.concat([pong_data, pong_targets], axis='columns')

print('#rows: {}'.format(len(pong_both)))
print('columns: {}.'.format(pong_both.columns.values))

#rows: 1000
columns: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'y0' 'y1' 'y2' 'y3'
 'y4' 'y5' 'y6' 'y7' 'y8' 'y9' 'th0' 'hit' 'grp'].

Prepare Data from an External Source#

Finally, here is an example of taking data from an external source and adapting it to the standard format we are using. The data is from the 2014 ATLAS Higgs Challenge which is now documented and archived here. More details about the challenge are in this writeup.

EXERCISE:

Download the compressed CSV file (~62Mb) atlas-higgs-challenge-2014-v2.csv.gz using the link at the bottom of this page.
You can uncompress (gunzip) the file on-the-fly.
Skim the description of the columns here. The details are not important, but the main points are that:

There are two types of input “features”: 17 primary + 13 derived.
The goal is to predict the “Label” from the input features.

Examine the function defined below and determine what it does. Lookup the documentation of any functions you are unfamiliar with.
Run the function below, which should create two new files in your coursse data directory:

higgs_data.hf5: Input data with 30 columns, ~100Mb size.
higgs_targets.hf5: Ouput targets with 1 column, ~8.8Mb size.

wget_data('http://opendata.cern.ch/record/328/files/atlas-higgs-challenge-2014-v2.csv.gz')

File ‘./tmp_data/atlas-higgs-challenge-2014-v2.csv.gz’ already there; not retrieving.

def prepare_higgs(filename='atlas-higgs-challenge-2014-v2.csv.gz'):
    # Read the input file, uncompressing on the fly.
    df = pd.read_csv(locate_data(filename), index_col='EventId', na_values='-999.0')
    # Prepare and save the data output file.
    higgs_data = df.drop(columns=['Label', 'KaggleSet', 'KaggleWeight']).astype('float32')
    higgs_data.to_hdf(locate_data('higgs_data.hf5', check_exists=False), 'data', mode='w')
    # Prepare and save the targets output file.
    higgs_targets = df[['Label']]
    higgs_targets.to_hdf(locate_data('higgs_targets.hf5', check_exists=False), 'targets', mode='w')

prepare_higgs()

Check that locate_data can find the new files:

locate_data('higgs_data.hf5')

'./tmp_data/higgs_data.hf5'

locate_data('higgs_targets.hf5')

'./tmp_data/higgs_targets.hf5'

Now you can load these data files and explore the data

higgs_data = pd.read_hdf(locate_data('higgs_data.hf5'))

higgs_data.describe()

	DER_mass_MMC	DER_mass_transverse_met_lep	DER_mass_vis	DER_pt_h	DER_deltaeta_jet_jet	DER_mass_jet_jet	DER_prodeta_jet_jet	DER_deltar_tau_lep	DER_pt_tot	DER_sum_pt	...	PRI_met_sumet	PRI_jet_num	PRI_jet_leading_pt	PRI_jet_leading_eta	PRI_jet_leading_phi	PRI_jet_subleading_pt	PRI_jet_subleading_eta	PRI_jet_subleading_phi	PRI_jet_all_pt	Weight
count	693636.000000	818238.000000	818238.000000	818238.000000	237985.000000	237985.000000	237985.000000	818238.000000	818238.000000	818238.000000	...	818238.000000	818238.000000	490867.000000	490867.000000	490867.000000	237985.000000	237985.000000	237985.000000	818238.000000	818238.000000
mean	121.867699	49.252712	81.140556	57.849522	2.404626	372.181000	-0.828740	2.373871	18.969620	158.596161	...	209.908752	0.979923	84.904297	-0.001248	-0.018856	57.810280	-0.006669	-0.010470	73.205574	0.503144
std	56.942486	35.378609	40.582706	63.411938	1.742546	398.235413	3.583769	0.780875	21.918491	116.089806	...	126.816689	0.978793	60.649529	1.779419	1.815363	32.455467	2.031695	1.816033	98.331757	0.572200
min	9.044000	0.000000	6.329000	0.000000	0.000000	13.602000	-19.011999	0.208000	0.000000	46.103001	...	13.678000	0.000000	30.000000	-4.500000	-3.142000	30.000000	-4.500000	-3.142000	-0.000000	0.000461
25%	91.862000	19.304001	59.414001	14.164250	0.886000	112.627998	-2.626000	1.814000	2.839000	77.491997	...	122.985001	0.000000	44.487999	-1.337000	-1.591000	37.325001	-1.604000	-1.589000	0.000000	0.005721
50%	112.500999	46.484001	73.744003	38.470001	2.101000	226.192993	-0.244000	2.492000	12.383000	120.665001	...	179.882996	1.000000	65.760002	0.000000	-0.044000	47.993000	-0.011000	-0.020000	40.506001	0.357223
75%	135.576004	73.620003	92.188004	79.226748	3.691000	478.782013	0.945000	2.962000	27.634001	200.884495	...	263.828003	2.000000	103.569504	1.336000	1.557000	66.695000	1.593000	1.568000	110.387001	0.733462
max	1949.260986	968.669006	1349.350952	2834.999023	8.724000	4974.979004	17.650000	5.751000	2834.999023	2079.162109	...	2190.274902	3.000000	1163.438965	4.500000	3.142000	817.801025	4.500000	3.142000	1860.175049	2.386316

8 rows × 31 columns

higgs_targets = pd.read_hdf(locate_data('higgs_targets.hf5'))

higgs_targets.describe()

	Label
count	818238
unique	2
top	b
freq	538678

You can now safely remove the tmp_data directory if you like. This is an example of a shell command. Uncomment this line if you want to do this. Colab will clean this up after you end the session.

#!rm -rf ./tmp_data

Acknowledgments#

Initial version: Mark Neubauer