grace teng: Blog

Blog

Scraping With Nokogiri

Dated: Dec 2, 2021

Thinking about little things that I’ve done or played with recently that I am at liberty to share, here’s a fun little one using Nokogiri to solve an entirely self-inflicted problem.

I purchased Adrian Cantrill’s AWS Certified Solutions Architect Associate course and wanted to break it down into smaller sections so I could plan my learning. The course is built on Teachable and looks like this:

Image: Teachable Course Page, with list of lecture videos from Adrian Cantrill's AWS Certified Solutions Architect Associate course Teachable Course Page, with list of lecture videos from Adrian Cantrill’s AWS Certified Solutions Architect Associate course

The lecture titles are there and so is the duration of each lecture. I wanted to get the title and duration of each into a spreadsheet, but there is no obvious way to do it. You can use an app like TextSniper which can extract text from screenshots and visual data, but since all the information I needed was actually contained in the HTML source of the Teachable course page, there was a cheaper and more interesting solution (at least more interesting to me): scrape the page.

To do this, I chose to use Ruby and Nokogiri, a Ruby gem for parsing XML and HTML strings. I could have used Python or JavaScript to do this instead, but I’m more comfortable with Nokogiri and still a Rubyist at heart.

How Nokogiri Works

The title of this section is a misnomer. I don’t know how Nokogiri works or the full extent of what it can do, I just know how to use it to find things in HTML files.

If you want to play along, you can install nokogiri using gem install nokogiri. (If you don’t have a Ruby environment set up, you may not have permission to install gems without sudo. I’m personally not a fan of running sudo gem install, but you do you.)

The workhorse of XML parsing in Nokogiri is the Nokogiri::XML module, and the Nokogiri::HTML module inherits from it. To parse HTML with Nokogiri, you simply do this:

require 'nokogiri'

document = Nokogiri::HTML('<html><head><title>Nokogiri</title></head><body class="select-me">Hello World</body></html>')

This parses the string into an instance of Nokogiri::HTML4::Document.

Since File.read and URI.open (from the open-uri library) both return a string, they can be used as sources for parsing as well:

require 'nokogiri'

file = 'parse_me.html'
document = Nokogiri::HTML(File.read(file))

require 'nokogiri'
require 'open-uri'

url = 'https://example.com/parse_me.html'
document = Nokogiri::HTML(URI.open(url))

Now that you have your Nokogiri document, you can traverse it like a graph, if that’s your thing:

puts document.root.name #=> prints "html"

puts document.root.children.each { |child| puts child.name }
#=> prints "head", "body"

If document is an instance of Nokogiri::HTML::Document, what is document.root an instance of?

puts document.root.class
#=> prints "Nokogiri::XML::Element"

What about root’s children?

puts document.root.children.class
#=> prints "Nokogiri::XML::NodeSet"

Nokogiri::XML::Element is a child of Nokogiri::XML::Node, and Node implements the Searchable interface. Searchable gives us the #css instance method, which will search this node and all its children and return a NodeSet of all the elements that match a given CSS selector.

Now we have the ability to extract HTML elements from the document based on their CSS selectors:

document.css('.select-me').each do |element|
    puts element.text
end

#=> Prints "Hello World"

Extracting Video Titles

Back to the problem at hand.

I saved the HTML of the Teachable page locally and studied it. The lecture titles, it turns out, are really easy to extract:

Image: HTML code from Teachable's course page HTML code from Teachable’s course page

<span class="lecture-name"> AWS Accounts - The basics (11:33) </span>

All we need is to target the .lecture-name CSS selector.

I included the whole chunk of code in the image because it reveals something interesting: Teachable uses Turbolinks. That doesn’t definitively imply that Teachable is a Rails app… but it basically implies that Teachable is a Rails app.

Great: now we can get a NodeSet of all the elements containing the video titles and runtimes, and #map it to get just the text of the element.

require 'nokogiri'

my_file = 'my_file.html'
document = Nokogiri::HTML(File.read(my_file))
lecture_names = document.css('.lecture-name')

lecture_names = lecture_names.map do |name|
    name.text.strip.split.join(' ')
end

Now lecture_names is simply an array of strings, each containing the video title and runtime:

Image: a list of video lecture titles from Adrian Cantrill's course, printed in the console List of video lecture titles from Cantrill’s course printed in console

Separating Title And Runtime

The next step is to identify which portion of the text is the title, and which is the runtime. For this, there is a powerful tool, loved by some and feared by most:

Image: playing with regular expressions on regexr.com Regular Expressions on regexr.com

This is the view from regexr, my favourite tool for writing regular expressions. It breaks down what exactly the regular expression is parsing, highlights where the matches are, and allows you to write tests to check the regex against.

Two related regexes are needed: one regex identifies whether there is a runtime at all, and the other parses the string into a title and a runtime. I’ll spare you the part where we put the regex together, and simply give you the two regexes:

time_regex = /\(([:\d]+)\)$/
title_time_regex = /(?<title>.+)\s\((?<time>[:\d]+)\)$/

They’re not semantically perfect. time_regex will match Reserved Instances (:::::) or Reserved Instances (12345), for example, but since the input data is clean, we don’t need to worry about that.

Now, given a string, we can determine if there is a video runtime listed at the end of the string:

time_regex = /\(([:\d]+)\)$/
lecture_name = "Serverless Architecture (12:55)"
reading_name = "IMPORTANT, READ ME !!"

lecture_name =~ time_regex  #=> returns 24 (index where match begins)
reading_name =~ time_regex  #=> returns nil

Once we know which strings contain video titles and which ones contain titles of readings, we can perform the match:

title_time_regex = /(?<title>.+)\s\((?<time>[:\d]+)\)$/
lecture_name = "Serverless Architecture (12:55)"
matchdata = lecture_name.match(title_time_regex)

puts matchdata[:title]  #=> prints "Serverless Architecture"
puts matchdata[:time]   #=> prints "12:55"

Putting It All Together

# nokogiri.rb
require 'nokogiri'

my_file = 'my_file.html'
document = Nokogiri::HTML(File.read(my_file))
lecture_names = document.css('.lecture-name')

# matches strings ending with (xx:xx), where x is a digit
time_regex = /\(([:\d]+)\)$/

# captures title and time from a string
title_time_regex = /(?<title>.+)\s\((?<time>[:\d]+)\)$/

lecture_names.each do |lecture_name|
    # strips out excess whitespace
    formatted_text = lecture_name.text.strip.split.join(' ')

    if formatted_text =~ time_regex
    # string contains video runtime
    matchdata = formatted_text.match(title_time_regex)
    puts "#{matchdata[:title]}\t#{matchdata[:time]}"
    else
    # string does not contain video runtime
    puts formatted_text
    end
end

Note the use of \t to separate the lecture title from the lecture time. Essentially, what this does is produce tab-separated output. Given that the goal is to import the result into a spreadsheet, TSV makes a lot of sense. CSV could work too, but we’d need to account for commas in lecture titles. TSV works just fine.

$ ruby nokogiri.rb > lecture_list.tsv

Voilà, a file that I can import into Excel or Google Sheets, and use to make a study plan.

Image: a list of video lecture titles and runtimes in Google Sheets List of video lecture titles and runtimes in Google Sheets

Handling date strings and timezones in JavaScript

Dated: Feb 16, 2021

Here is a simple scenario: you’re writing client-side JavaScript. You query an API with a city name, and it returns with a bunch of useful information about the city, including the timezone, in this format:

{
    "city": "New York",
    "timezone": -18000,
}

You want to use this information to display the local date and time in that city, as a string that looks something like this:

Thu, 11 Feb 2021, 1:08 am

How do you do this?

New York City’s timezone is Eastern Standard Time, or UTC-05:00. The API gives us this information in the form -18000, or 18000 seconds behind Coordinated Universal Time (UTC). (We’re not going to worry about Daylight Savings Time, and we’ll simply assume that the API is going to give us the correct time offset in seconds when DST kicks in.)

We need to find some way of representing time in seconds, or in something that can be converted to and from seconds. Then we need to turn that number into a string.

This is a good time to explain how computers represent dates and times.

Timestamps explained

We take timekeeping for granted in 2021, but there’s a lot of maths and science and engineering that goes into keeping track of time. (Don’t believe me? Check out Jack Forster’s amazing article on horology’s Easter problem, or his equally riveting read on modern chronometer watches.) Fortunately, for our purposes, we can operate at a fairly high level of abstraction, and we don’t have to get into the nitty-gritty of timekeeping.

Broadly speaking, we need two things to keep track of time: first, we need a point in time that we can use as a reference, and secondly, we need to be able to count the time elapsed since that reference time.

The good news is that where computers are concerned, counting elapsed time isn’t a concern. Computers can count milliseconds to a high degree to accuracy — after all, your computer has a clock generator in it that oscillates billions of times a second (i.e at a frequency of several gigahertz, or whatever your processor’s clock speed is). All that remains is to know what time computers are counting from.

Enter the Unix epoch: 1 January 1970, 0:00:00:000 UTC.

In most programming languages, objects and classes that handle date and time manipulation will represent the time as a timestamp, or the number of seconds that have passed since the Unix epoch. (Because of leap seconds, that’s not strictly true, but we can ignore that corner case.)

For example, as I write this on Thursday, 11 Feb 2021 at 2:08 pm at UTC+08:00, the current Unix timestamp is 1613023705.

Human-readable dates and times

Computers are very good at counting large numbers, but humans are not. We need to turn the timestamp into something meaningful for humans. This is where JavaScript’s Date object comes in useful: we can create a new Date object by passing in a timestamp, then we’ll have access to a whole host of Date methods, including toString() and toLocaleString(), which help us convert the Date object to human-readable strings.

Let’s try it:

const date = new Date(1613023705);
console.log(date.toString());

On my machine, this prints Mon Jan 19 1970 23:33:43 GMT+0730 (Singapore Standard Time). On yours, it might print a different number, depending on what timezone you are in. In any case, it’s clearly the incorrect time. What happened?

As it turns out, JavaScript timestamps are not calculated as the number of seconds since 1 Jan 1970 0:00:00:000 UTC, but as the number of milliseconds since that time. We’ll need to multiply our timestamp by 1000:

const correctDate = new Date(1613023705 * 1000);
console.log(correctDate.toString());

This prints out Thu Feb 11 2021 14:08:25 GMT+0800 (Singapore Standard Time) on my machine.

Timestamps and timezones

All right, we’re one step closer, but we need to somehow convert the time from GMT+08:00 to UTC-05:00.

Computer timestamps based on an epoch are always in UTC, with no timezone offset. This makes sense: the number of seconds that have passed at the Greenwich Meridian since midnight of 1 January 1970 is the same no matter where in the world you are.

Manipulating time is as gnarly in JavaScript as it is in actual time travel, however. MDN’s documentation for Date gives us the following innocuous warning:

Note: It’s important to keep in mind that while the time value at the heart of a Date object is UTC, the basic methods to fetch the date and time or its components all work in the local (i.e. host system) time zone and offset.

So your date is stored in UTC, but most of the Date methods will return results in the runtime’s local time zone (UTC+08:00 in my case), and you need to output date and time in a third time zone (UTC-05:00, in our example). Great.

Looking down the list of Date methods, what do we have at our disposal? toString() will always return a date and time string based on the runtime’s time zone, so that’s out. toUTCString() will always return a date and time string based on UTC, so that’s out too.

toLocaleString() accepts an options argument that lets you set the timeZone property — this could be useful to us. How do we specify the timezone we need? MDN helpfully points us to the documentation for the Intl.DateTimeFormat() constructor, which has a list of all the options that we can give to toLocaleString() for date and time formatting. Scroll down to timeZone, and let’s see what we have:

timeZone: The time zone to use. The only value implementations must recognize is “UTC”; the default is the runtime’s default time zone. Implementations may also recognize the time zone names of the IANA time zone database, such as “Asia/Shanghai”, “Asia/Kolkata”, “America/New_York”.

How can we convert { city: "New York", timezone: -18000 } into America/New_York to pass as an argument to toLocaleString()? We could use a Map and map the number of seconds offset to one of the IANA time zones…

const timeZones = new Map();
timeZones.set(-39600, 'Pacific/Niue');
timeZones.set(-36000, 'Pacific/Honolulu');
// etc...

You still need to be careful when choosing the timezones, because the IANA time zones take daylight savings into account. For example, imagine you had a Map that looked like this:

// etc...
timeZones.set(-21600, 'America/Chicago');
timeZones.set(-18000, 'America/New_York');
// etc...

If you queried the city of Chicago during daylight savings time, the API might respond with a time offset of -18000. That corresponds to the timezone of America/New_York in your Map. toLocaleString() then applies New York’s daylight savings time offset, which is -14400 instead of -18000.

Or we could avoid IANA time zones altogether, and just math instead.

Detour: Do Not Do This

At this point, you might spot the getUTCHours() and setUTCHours() methods. getUTCHours() returns an integer between 0 and 23, representing the hour in UTC time in your Date object. For the correctDate Date object that we’ve been playing with, getUTCHours() returns 6, since it is 6 am in Greenwich, London when it is 2 pm in Singapore.

setUTCHours() takes one argument, an integer between 0 and 23, and updates the hour in UTC in your Date object.

“Aha!” you might think. “Let’s calculate how many hours we need to add or subtract, and use setUTCHours() to manually offset the time! Then let’s print using toUTCString(), so we don’t have to worry about the user’s timezone!”

const correctDate = new Date(1613023705 * 1000);
const targetTimezone = -18000; // or whatever number the API returns
const offsetHours = targetTimezone / 3600;
const utcHours = correctDate.getUTCHours();
correctDate.setUTCHours(utcHours + offsetHours);
correctDate.toUTCString();

This gives us Thu Feb 11 2021 01:08:25 GMT. Of course, GMT is the incorrect timezone, but we’ll have to live with it. We can easily truncate the GMT timezone out of the string if we don’t need it, or replace it with the correct timezone. Perfect solution!

Wrong.

Leaving aside UTC offsets that are not a full hour (e.g. Iran at UTC+03:30, India at UTC+05:30, Nepal at UTC+05:45), the first problem you run into is when utcHours + offsetHours is less than 0 or more than 23. No big deal, we can just check for those cases, right?

let localHours = utcHours + offsetHours;
if (localHours < 0) localHours += 24;
if (localHours > 23) localHours -= 24;
correctDate.setUTCHours(localHours);

Now you’re in trouble, because we don’t just want to display the local time, we also want to display the local date, and now your Date object is one day ahead of or one day behind the actual local date.

Go ahead, try it with an offset that’s big enough to trigger this problem:

const correctDate = new Date(1613023705 * 1000);
const targetTimezone = -36000; // or whatever number the API returns
const offsetHours = targetTimezone / 3600;
const utcHours = correctDate.getUTCHours();
let localHours = utcHours + offsetHours;
if (localHours < 0) localHours += 24;
if (localHours > 23) localHours -= 24;
correctDate.setUTCHours(localHours);
correctDate.toUTCString();

This returns Thu, 11 Feb 2021 20:08:25 GMT, which is one day ahead of the actual date in Honolulu based on the timestamp that we provided to the Date object.

There’s a better solution along these lines, which is to apply the offset directly to the timestamp.

Moving through time instead of space

Let’s start over, but this time instead of trying to manipulate a Date object’s timezone, let’s just give the Date object a different timestamp altogether:

const date = new Date((1613023705 - 18000) * 1000);
console.log(date.toUTCString());

Here, we’re applying the offset of -18000 directly to the timestamp, then creating a Date object out of it. Calling toUTCString() on this Date object gives us Thu, 11 Feb 2021 01:08:25 GMT. Now we have the correct date and time, but the incorrect timezone. If you don’t need to display the timezone, you can truncate the timezone if you’re okay with the format of toUTCString(), or use toLocaleString('en-GB', { timeZone: 'UTC' }) and specify your own set of formatting options.

I must admit that this solution is not very satisfying, because of the fact that conceptually, this is not the “correct” use of the UNIX timestamp or of timezones. We aren’t moving through timezones, we’re actually moving through UTC time itself. Ideally, we would be able to store both the timestamp and our desired time offset in a single Date object, or we would be able to give toLocaleString() the time offset that we want in an alternative format (like… oh, I don’t know, the number of milliseconds?) instead of in the form of IANA time zones.

Nonetheless, if your goal is to display the time in a specific time zone that isn’t dependent on the user’s local time, this is a workable solution. I’d love to hear of any alternatives in vanilla JS.

Python cheatsheet for Ruby devs

Dated: Jan 19, 2021

I cut my programming teeth at Le Wagon, where the bulk of coding time is spent on Ruby. I’ve also started in the Masters in Computer and Information Technology program at Penn, where the teaching languages are Python and Java. Naturally, it’s been a trip taking my Ruby cap off and putting a Python hat on. The Python hat isn’t too comfortable yet, but I’m sure it’ll break in as I get more Python under my fingers.

As I work more and more with Python, I’ve been putting together a mental cheatsheet for “translating” Ruby to Python, and it’s time for me to take the cheatsheet out of my brain and put it into writing. I know I’m not the only programmer who has moved from Ruby to Python, so I hope others will find this useful. (But hey, even if nobody else finds it useful, it’s helpful for me to put this in writing!)

Some basic stuff first

Integer and float division in Ruby:

5 / 2 #=> returns 2
5 / 2.0 #=> returns 2.5

Integer and float division in Python:

5 / 2 #=> returns 2.5
5 // 2 #=> returns 2

String interpolation in Ruby:

planet = 'world'
puts "Hello #{planet}!" #=> prints "Hello world!"

Formatted strings in Python:

planet = 'world'
print(f'Hello {planet}!') #=> prints "Hello world!"

String manipulation in Ruby and Python

Split a string into an array using a separator

Also known in PHP as explode(), still my favourite name for this operation.

Ruby:

'abracadabra'.split('a')
# returns ['', 'br', 'c', 'd', 'br']

Python:

'abracadabra'.split('a')
# returns ['', 'br', 'c', 'd', 'br', '']

Split a string on whitespace

Ruby:

    'the quick brown fox'.split
    # returns ['the', 'quick', 'brown', 'fox']

Python:

    'the quick brown fox'.split()
    # returns ['the', 'quick', 'brown', 'fox']

So far, so good.

Split a string using a regular expression

Ruby:

'a1b12c123d1234'.split(/\d+/)
# returns ['a', 'b', 'c', 'd']

You’ll probably be using Ruby’s built-in Regexp methods for complex operations involving regular expressions, but for splits on a simple regex, the String#split method works just fine.

In Python, regular expression operations require the re module:

import re
re.split(r'\d+', 'a1b12c123d1234')
# returns ['a', 'b', 'c', 'd', '']

Join array/list elements into a string

Also known in PHP as implode(), still my favourite name for this operation.

Ruby:

['the', 'quick', 'brown', 'fox'].join(' ')
# returns 'the quick brown fox'

Python:

' '.join(['the', 'quick', 'brown', 'fox'])
# returns 'the quick brown fox'

🤯

In Ruby, join is a method called on an array taking a string as an argument. In Python, join is a method called on a string taking a list (array) as an argument.

Enumerable patterns and list comprehension

Quickly generate an array from a range

Ruby:

array = (1..5).to_a
# array = [1, 2, 3, 4, 5]

The literal “translation” of this in Python is:

array = list(range(1, 6))
# array = [1, 2, 3, 4, 5]

However, to be a true Pythonista, you must use list comprehension wherever list comprehension can be used:

array = [i for i in range(1, 6)]
# array = [1, 2, 3, 4, 5]

Apply the same operation to all elements of an array

Ruby:

array = (1..5).to_a
squares = array.map { |i| i**2 }
# squares = [1, 4, 9, 16, 25]

Again, you can do this using a combination of list() and map() in Python:

array = [i for i in range(1, 6)]
squares = list(map(lambda i: i**2, array))
# squares = [1, 4, 9, 16, 25]

Yuck. Instead, use list comprehension:

array = [i for i in range(1, 6)]
squares = [i**2 for i in array]
# squares = [1, 4, 9, 16, 25]

Iterate over an array/a list with indices

Ruby:

array = ['Alice', 'Bob', 'Charlie']
array.each_with_index do |element, index|
    puts "Index #{index}: #{element}"
end
# prints the following:
# Index 0: Alice
# Index 1: Bob
# Index 2: Charlie

Python:

list = ['Alice', 'Bob', 'Charlie']
for (index, element) in enumerate(list):
    print(f'Index {index}: {element}')
# prints the following:
# Index 0: Alice
# Index 1: Bob
# Index 2: Charlie

Iterate over a hash/a dictionary

Ruby:

hash = { 'Alice': 9, 'Bob': 11, 'Charlie': 14 }
hash.each { |key, value| puts "#{key}: #{value}" }
# prints the following:
# Alice: 9
# Bob: 11
# Charlie: 14

Python:

dict = { 'Alice': 9, 'Bob': 11, 'Charlie': 14 }
for key, value in dict.items():
    print(f'{key}: {value}')
# prints the following:
# Alice: 9
# Bob: 11
# Charlie: 14

Let’s take a step back and see what happens if you iterate over dict instead of dict.items(). In that case, the for loop will iterate over the keys only :

dict = { 'Alice': 9, 'Bob': 11, 'Charlie': 14 }
for i in dict:
    print(i)
# prints the following:
# Alice
# Bob
# Charlie

You can still access the values using the key, of course:

dict = { 'Alice': 9, 'Bob': 11, 'Charlie': 14 }
for i in dict:
    print(f'{i}: {dict[i]}')
# prints the following:
# Alice: 9
# Bob: 11
# Charlie: 14

Iterating over a hash in Ruby, on the other hand, always returns an array of two elements per iteration, with the first element being the key and the second element being the value:

hash = { 'Alice': 9, 'Bob': 11, 'Charlie': 14 }
hash.each do |i| 
    pp i
    puts "#{i[0]}: #{i[1]}"
end
# prints the following:
# [:Alice, 9]
# Alice: 9
# [:Bob, 11]
# Bob: 11
# [:Charlie, 14]
# Charlie: 14

With dict.items() in Python, what’s really happening is that dict.items() is returning a tuple of two elements per iteration, with the first element being the key and the second element being the value:

dict = { 'Alice': 9, 'Bob': 11, 'Charlie': 14 }
for i in dict.items():
    print(i)
    print(f'{i[0]}: {i[1]}')
# prints the following:
# ('Alice', 9)
# Alice: 9
# ('Bob', 11)
# Bob: 11
# ('Charlie', 14)
# Charlie: 14

Keyword arguments / last argument hash

I’m not touching that hot potato.

← Later

Page: 3 of 12

Earlier →