On this page, you will find various sample code used in the example projects. Click the “Show code” tags to expand the sections.
- Fake News and Dirty Data
- Mortality Rates in Alberta
- “How Does a Bike-Share Navigate Speedy Success?”
Fake News and Dirty Data
Some exemplary code snippets from the “fake news” project:
Python: Text Cleaning
To clean the raw data of some “leaky” contents, the following function is applied to every article (using the re
standard library package for regular expressions):
Show Python code
def clean_text(string):
"""
Clean text by removing leaky features, newlines, and tabs.
"""
# replace tabs and newlines with whitespaces (we'll use tabs as column
# separators for storage later)
string = re.sub(r"\s", r" ", string)
# remove "(Reuters)" preambles ("true" articles)
string = re.sub(
r".{0,100}" # up to 100 characters
r"\(Reuters\)" # match "(Reuters)"
r"[ -]*", # separator dash surrounded by spaces
r"", string, flags=re.IGNORECASE)
# remove image/video attributions at end of "fake" articles
string = re.sub(
r"(Featured|Image|Photo|Video)" # match one of these words
r".{0,200}$", # allow for 200 arbitrary characters to end of string
r" ", string)
# remove "21st Century Wire" preambles ("fake" articles)
string = re.sub(r"21st Century Wire( says| asks)?", r"", string,
flags=re.IGNORECASE)
# remove pic.twitter.com urls
string = re.sub(r"pic\.twitter\.com" # domain
r"/[\S]+", # 1 or more non-whitespace characters
r" ",
string)
# remove any other urls
string = re.sub(r"(http(s)?://)" # match http:// or https://
r"\S+", # collect all following whitespace characters
r" ", string)
# remove embed codes
string = re.sub(r"(// < !\[CDATA)" # match "// < ![CDATA"
r".*" # any number of characters
r"(\>)", # match ">" at end
" ", string)
# add missing spaces after punctuation
string = re.sub(r"(?P<word1>[\w\d]+)" # match first word
r"(?P<punct>[\.,:;!?])" # match punctuation
r"(?P<word2>[\w\d]+)", # match second word
r"\g<word1>\g<punct> \g<word2>",
string)
return string.strip()
Python: Lemmatization
Processing of the article bodies (word splitting, tagging and lemmatization) is done using the following function:
Show Python code
def lemmatize(string):
"""
Clean up and lemmatize string
"""
# remove all non-word characters
string_clean = re.sub(r"[\d\W_]+", r" ", string.lower())
# remove twitter handles and hashtags
string_clean = re.sub(r"(@|#)[\w\d_]+", r" ", string_clean)
# split string and remove stop words
string_split = [word for word in string_clean.split()
if word not in STOPWORDS]
# tag remaining words
string_tagged = nltk.pos_tag(string_split)
# lemmatize words using a wrapper function (leaves words alone that
# are not nouns, verbs or adjectives)
string_lem = [_lemmatize(word, tag) for (word, tag) in string_tagged]
# reassemble string
lemmatized = " ".join(string_lem)
return lemmatized
Mortality Rates in Alberta
Some exemplary code snippets from the “mortality rates” project:
Excel: Month Name Lookup
To convert the month numbers to their names (or abbreviations), we can follow one of two strategies. The first one is to use the CHOOSE
function which selects one entry out of a number of given options based on a supplied numeric input:
CHOOSE(<month number>, "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
A more elaborate approach would be to create a separate lookup table and use the VLOOKUP
function:
Number | Name | Abbrev. |
---|---|---|
1 | January | Jan |
2 | February | Feb |
3 | March | Mar |
… | … | … |
VLOOKUP(<month number>, <lookup table range>, 3)
,
where <lookup table range>
is the cell range of the lookup table, starting with the “Number” column and the first data row (“1, January, Jan”); and <month number>
is a reference to a cell holding the number of the requested month. The third function parameter 3
indicates that we want to return the value from the third column in the lookup table — i.e., the abbreviated month name; we could adjust this to also return the full name by setting it to 2
.
Python/Plotly Visualizations
Once the necessary Pivot Tables have been created, we can use pandas
to load and transform the data, and then visualize it using plotly
.
data_loader.py
For convenience and re-usability, the step of loading the data from the Excel files is outsourced to a separate Python script.
Show Python code
import pandas as pd
DATA_PATH = "../_Data/01 Processed/"
with pd.ExcelFile(DATA_PATH + "2022-01-06-V01 Mortality Data.xlsx") \
as xlsx_mort:
# all age groups and M,F combined
avg_mort = xlsx_mort.parse("Averaged Mortality Rates", header=0,
index_col=0)
months = avg_mort.index.tolist()
plot_mortality.py
The code below will create a graph with the averaged 2010–2019 data and the grey shaded “error band”.
Show Python code
import plotly.graph_objects as go
from data_loader import avg_mort, months
# create trace for averaged mortality data
avg_trace = go.Scatter(x=months, y=avg_mort["Average Mortality"],
name="2010–2019<br>"
f"<sup>(Avg. ±3 Std.Dev.)</sup>",
mode="lines+markers", line_color="darkgrey"
)
# forward-backward strategy for error band,
# as per https://plotly.com/python/continuous-error-bars/
_avg_plus = (avg_mort["Average Mortality"]
+ 3 * avg_mort["StdDev of Mortality"])
_avg_minus = (avg_mort["Average Mortality"]
- 3 * avg_mort["StdDev of Mortality"])
stdev_trace = go.Scatter(x=months + months[::-1],
y=_avg_plus.tolist() + _avg_minus.tolist()[::-1],
mode="lines",
fill="toself",
showlegend=False,
hoverinfo="none")
stdev_trace.line.width = 0
stdev_trace.fillcolor = "rgba(0,0,0,0.1)" # slight transparency
# create figure object
fig = go.Figure(data=[avg_trace, stdev_trace])
# update legend to be horizontal and on top of the graph
fig.layout.legend.update({"orientation": "h",
"yanchor": "bottom",
"y": 1.0,
"valign": "top"})
# adjust axis titles
fig.layout.yaxis.title = "Mortality per 100,000"
fig.layout.xaxis.title = "Month"
# plot
fig.show()
“How Does a Bike-Share Navigate Speedy Success?”
Some exemplary code snippets from the “bike-share” project:
SQL: Station Locations
The following code was used to build a table named Stations_V02
which contains the name and geographic location for each bike-sharing station in Chicago, ignoring some testing and maintenance stations specified in _Ignore_Stations
. The location information is calculated from the average of the latitude/longitude data stored with each single trip record in the _RideData_Raw
data. However, this data is slightly “jittery” and occasionally includes data points with very low precision. To gauge the impact of these inaccuracies, two additional columns with the standard deviations of all averaged location data are included.
Show SQL code
/*
Get all distinct and valid station names,
calculate average latitude and longitude
yields table 'Stations_V02':
703 distinct rows, no nulls in name, lat, lon,
some nulls in standard deviations
*/
drop table if exists #coords;
drop table if exists #coords_avg;
drop table if exists Stations_V02;
create table Stations_V02 (
station_name varchar(100) not null primary key,
station_lat float not null,
station_lon float not null,
station_lat_stdev float,
station_lon_stdev float
)
------------------------------------------------------------
-- collect all coordinate values by station names (start and end)
select
trim(replace(start_station_name, '(*)', '')) as station_name,
cast(start_lat as float) as station_lat,
cast(start_lng as float) as station_lon
into #coords
from _RideData_Raw
union all
select
trim(replace(end_station_name, '(*)', '')),
cast(end_lat as float),
cast(end_lng as float)
from _RideData_Raw
------------------------------------------------------------
-- average all locations and store as table Stations
insert into Stations_V02 (station_name, station_lat, station_lon,
station_lat_stdev, station_lon_stdev)
select
distinct station_name,
avg(station_lat) as station_lat,
avg(station_lon) as station_lon,
stdev(station_lat) as station_lat_stdev,
stdev(station_lon) as station_lon_stdev
from #coords
where (
station_lat is not null
and station_lon is not null
and station_name is not null
and station_name not in (select * from _Ignore_Stations)
)
group by station_name;
R: Log-Normal Fit of Ride Durations
The following code was used to create a lognormal fit of the frequency of recorded ride durations in minutes. The references to knime.in
and knime.out
data frames are used to interface this code snippet with the KNIME workflow in which it was included.
Show R code
suppressMessages(library(MASS))
suppressMessages(library(dplyr))
# get x data: all ride duration observations
x = filter(knime.in, ride_duration <= 500)$ride_duration
# lognormal fit
fit = fitdistr(x, "lognormal")
meanlog = coef(fit)[["meanlog"]]
sdlog = coef(fit)[["sdlog"]]
# make model curve for all observed ride duration points
x_unique = unique(x)
curve = dlnorm(x_unique, meanlog, sdlog)
# assign output (data and flow variables)
knime.out = data.frame(x_unique, curve)
knime.flow.out = list(meanlog = meanlog, sdlog = sdlog)
(Yes, I write lowercase SQL code and use = for assignments in R. Fight me!)