Blog-2018-07-03

I mentioned a while back that I’d moved to using my NUC as a backup storage device, and that continues to be a core use case after I repaved and moved the thing back over to Ubuntu.

Fortunately, as a file server, Linux is definitely more capable and compatible than macOS (which is why, back when it was a Hackintosh, I used a Linux VM as the SMB implementation on my LAN), and so I’ve already got backups re-enabled and working beautifully.

But the next step is enabling offsite copies.

Previously, I achieved this with Google Drive for macOS, backing up the backup directory to the cloud, a solution which worked pretty well overall! Unfortunately, Google provides no client for Linux, which left me in a bit of a jam.

Until I discovered the magic that is rclone.

rclone is, plain and simply, a command-line interface to cloud storage platforms. And it’s an incredibly capable one! It supports one-way folder synchronization (it doesn’t support two-way, but fortunately I don’t need that capability), which means that it’s the perfect solution for syncing up a local backup folder to an offsite cloud stored backup.

But wait, there’s more!

rclone also supports encryption. And that means that (assuming I don’t lose the keys… they’re safely stored in my keepass database (which, itself, is cloned in multiple locations using my other favourite tool, Syncthing)) I can protect those offsite backups from prying eyes, something which Google’s Drive sync tool does not offer.

I can also decide when I want the synchronization to occur! I don’t need offsites done daily. Weekly would be sufficient, and that’s a simple crontab entry away.

Now, to be clear, rclone would have worked just as well on the Hackintosh, so if you’re a Mac user who’d like to take advantage of rclone’s capabilities, you can absolutely do so! But for this Linux user, it was a pleasant surprise!

Posted on 2018-07-03

0 Comments on this page

Blog-2018-07-02

Well, it’s finally happened. After a year of running semi-successfully, I’ve finally decided the trouble wasn’t worth the effort and it was time to retire the Hackintosh.

As a project it was certainly a lot of fun, and macOS definitely has its attraction. In the end, though, I found my NUC was serving a few core functions for which macOS wasn’t uniquely or especially suited:

  1. Torrent download server
  2. Storage target for laptop backups
  3. Media playback of local content as well as Netflix

Of course, the original plan was to also use the machine as a recording workstation, but so far it hasn’t worked out that way. Yet, anyway.

Prior to a recent security update, there was a couple of issues that I generally worked around:

  1. Onboard bluetooth and the SD card reader don’t work.
  2. The onboard Ethernet adapter stopped working after heavily utilization.
  3. Built in macOS SMB support is broken when used as a target for Windows file-based backups.

These issues were resolved, by:

  1. Avoiding unsupported hardware.
  2. Using an external ethernet dongle.
  3. Performing all SMB file serving via a Linux VM running on VirtualBox.

Meanwhile, the threat of an OS update breaking the system always weighed on me.

Unfortunately, it didn’t weigh on me enough: the 2018-001 macOS security update broke things pretty profoundly, as the Lilu kernel extension started crashing the system on boot.

A bit of research lead me to updating Lilu, plus a couple of related kexts while I was at it, which brought the system back to a basically functioning state, except now:

  1. HDMI audio no longer worked.
  2. The USB Ethernet dongle stopped working.

The first issue rendered the machine unusable as a video playback device, a use case which is surprisingly common (my office is a very cozy place to watch Star Trek or MST3K!).

The latter left me with a flaky file/torrent server.

In short, all the major use cases I had for the Hackintosh no longer worked reliably, or at all.

Meanwhile, nothing I was doing uniquely relies on macOS. I’ll probably never get into iOS development, and the only piece of software I’d love to have access to is Omnigraffle (I never got far enough into recording tools to get attached to them).

So, no major benefits, and a whole lot of pain meant, if I’m being pragmatic, the Hackintosh was no longer serving a useful function.

So what did I replace it with?

Ubuntu 18.04, of course!

So far it’s been a very nice experience, with the exception of systemd-resolved, which makes me want to weep silently (it was refusing to resolve local LAN domain names for reasons I never figured out). Fortunately, that was easily worked around, and I’m now typing this on a stable, capable, compatible Linux server/desktop.

When I do finally get back to recording, I’ll install a low latency kernel, jack, and Ardour, and then move on with my life!

Posted on 2018-07-02

0 Comments on this page

Blog-2018-01-02

My Google Groups web scraping exercise left me with an archive of over 2400 messages, of which 336 were written by yours truly. These messages were laid down in a set of files, each containing JSON payloads of messages and associated metadata.

But… what do I do with it now?

Obviously the goal is to be able to explore the messages easily, but that requires a user interface of some kind.

Well, the obvious user interface for a large blob of JSON-encoded data is, of course, HTML, and so started my next mini-project.

First, I took the individual message group files and concatenated them into a single large JSON structure containing all the messages. Total file size: 4.88MB.

Next, I created an empty shell HTML file, loaded in jQuery and the JSON data as individual scripts, and then wrote some code to walk through the messages and build up a DOM representation that I could format with CSS. The result is simple but effective! Feel free to take a look at my Usenet Archive here. But be warned, a lot of this is stuff I posted when I was as young as 14 years old…

Usage is explained in the document, so hopefully it should be pretty self-explanatory.

Anyway, this got me thinking about the possibilities of JSON as an archival format for data, and HTML as the front-end rendering interface. The fact that I can ship a data payload and an interactive UI in a single package is very interesting!

Update: I also used this project as an opportunity to experiment with ES6 generators as a method for browser timeslicing. If you look at the code, it makes use of a combination of setTimeout and a generator to populate the page while keeping the browser responsive. This, in effect, provides re-entrant, cooperative multitasking by allowing you to pause the computation and hand control back to the browser periodically. Handy! Of course, it requires a semi-modern browser, but lucky for me, I don’t much care about backward compatibility for this little experiment!

Posted on 2018-01-02

0 Comments on this page

Blog-2018-01-01

In the past web scraping involved a lot of offline scripting and parsing of HTML, either through a library or, for quick and dirty work, manual string transformations. The work was always painful, and as the web has become more dynamic, this offline approach has gone from painful to essentially impossible… you simply cannot scrape the contents of a website without a Javascript engine and a DOM implementation.

The next generation of web scraping came in the form of tools like Selenium. Selenium uses a scripting language, along with a browser-side driver, to automate browser interactions. The primary use case for this particular stack is actually web testing, but it allows scraping by taking advantage of a full browser to load dynamic content. This allows you to simulate human interactions with the site, enabling scraping of even the most dynamic sites out there.

Then came PhantomJS. PhantomJS took browser automation to the next level by wrapping a headless browser engine in a Javascript API. Using Javascript, you could then instantiate a browser, load a site, and interact with the page using standard DOM APIs. No longer did you need a secondary scripting language or a browser driver… in fact, you didn’t even need a GUI! Again, one of the primary use cases for this kind of technology is testing, but site automation in general, and scraping in particular, are excellent use cases for Phantom.

And then the Chrome guys came along and gave us Puppeteer.

Puppeteer is essentially PhantomJS but using the Chromium browser engine, delivered as an npm you can run atop node. Current benchmarks indicate Puppeteer is faster and uses less memory while using a more up-to-date browser engine.

You might wonder why I started playing with Puppeteer.

Well, it turns out Google Groups is sitting on a pretty extensive archive of old Usenet posts, some of which I’ve written, all of which date back to as early as ‘94. I wanted to archive those posts for myself, but discovered Groups provides no mechanism or API for pulling bulk content from their archive.

For shame!

Fortunately, Puppeteer made this a pretty easy nut to crack, such that it was just challenging enough to be fun, but easy enough to be done in a day. And thus I had the perfect one-day project during my holiday! The resulting script is roughly 100 lines of Javascript that is mostly reliable (unless Groups takes an unusually long time loading some of its content):

001 | const puppeteer = require('puppeteer')
002 | const fs = require('fs')
003 | 
004 | async function run() {
005 |  var browser = await puppeteer.launch({ headless: true });
006 | 
007 |  async function processPage(url) {
008 |  const page = await browser.newPage();
009 | 
010 |  await page.goto(url);
011 |  await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});
012 |  await page.waitForFunction('$(".F0XO1GC-nb-Y").find("[dir=\'ltr\']").length > 0');
013 |  await page.waitForFunction('$(".F0XO1GC-nb-Y").find("._username").text().length > 0');
014 | 
015 |  await page.exposeFunction('escape', async () => {
016 |  page.keyboard.press('Escape');
017 |  });
018 | 
019 |  await page.exposeFunction('log', async (message) => {
020 |  console.log(message);
021 |  });
022 | 
023 |  var messages = await page.evaluate(async () => {
024 |  function sleep(ms) {
025 |  return new Promise(resolve => setTimeout(resolve, ms));
026 |  }
027 | 
028 |  var res = []
029 | 
030 |  await sleep(5000);
031 | 
032 |  var messages = $(".F0XO1GC-nb-Y");
033 |  var texts = messages.find("[dir='ltr']").filter("div");
034 | 
035 |  for (let msg of messages.get()) {
036 |  // Open the message menu
037 |  $(msg).find(".F0XO1GC-k-b").first().click();
038 | 
039 |  await sleep(100);
040 | 
041 |  // Find the link button
042 |  $(":contains('Link')").filter("span").click();
043 | 
044 |  await sleep(100);
045 | 
046 |  // Grab the URL
047 |  var msgurl = $(".F0XO1GC-Cc-b").filter("input").val().replace(
048 |  "https://groups.google.com/d/",
049 |  "https://groups.google.com/forum/message/raw?"
050 |  ).replace("msg/", "msg=");
051 | 
052 |  await sleep(100);
053 | 
054 |  // Now close the thing
055 |  window.escape();
056 | 
057 |  var text;
058 | 
059 |  await $.get(msgurl, (data) => text = data);
060 | 
061 |  res.push({
062 |  'username': $(msg).find("._username").text(),
063 |  'date': $(msg).find(".F0XO1GC-nb-Q").text(),
064 |  'url': msgurl,
065 |  'message': text
066 |  });
067 | 
068 |  window.log("Message: " + res.length);
069 |  };
070 | 
071 |  return JSON.stringify({
072 |  'group': $(".F0XO1GC-mb-x").find("a").first().text(),
073 |  'count': res.length,
074 |  'subject': $(".F0XO1GC-mb-Y").text(),
075 |  'messages': res
076 |  }, null, 4);
077 |  });
078 | 
079 |  await page.close();
080 | 
081 |  return messages;
082 |  }
083 | 
084 |  for (let url of urls) {
085 |  var parts = url.split("/");
086 |  var id = parts[parts.length - 1];
087 | 
088 |  console.log("Loading URL: " + url);
089 | 
090 |  fs.writeFile("messages/" + id + ".json", await processPage(url), function(err) {
091 |  if (err) {
092 |  return console.log(err);
093 |  }
094 | 
095 |  console.log("Done");
096 |  });
097 |  }
098 | 
099 |  browser.close();
100 | }
101 | 
102 | run()

The interactions, here, are actually fairly complex. Each Google Groups message has a drop-down menu that you can use to get a link to the message itself. Some minor transformations to that URL then get you a link to the raw message contents. So this script loads the URL containing the thread, and then one-by-one, opens the menu, activates the popup to get the link, performs an Ajax call to get the message content, then scrapes out some relevant metadata and adds the result to a collection. The collection is then serialized out to JSON.

It works remarkably well for a complete hack job!

Posted on 2018-01-01

0 Comments on this page

Blog-2017-11-22

My switch to Vim for journalling can be described as nothing less than a rousing success!

As of this writing I’ve written over 25,000 words using my Vim-based setup and it has been, to put it mildly, an absolute joy.

The switch to using my instrument of choice--the computer--as my preferred method of journalling has freed my inner dialog from the restrictions of my sluggish, illegible printing. And the sheer portability of my laptop means that I never feel as though the choice to go digital has been an albatross. Quite the contrary, in fact, since I’m rarely without my laptop, but frequently don’t have my notebook with me.

For anyone with a bit of a technical bent, I strongly recommend Vim + Goyo + Limelight as a writing stack. The tooling gives me everything I’d want from a distraction-free writing experience, without costing me a penny, with all the power of my favourite editor.

Ironically, the biggest downside is that I’m back into the habit of pressing Escape every time I’m done writing something… even if I’m writing in a web form. And that means I frequently accidentally back out changes to JIRA tickets at work… damn it…

And speaking of work, I’ve also moved to this same stack for taking my own work notes and tracking my work-related tasks. Turning a bulleted list into a set of checkboxes in Vimwiki is a Ctrl+Space awaym, so I can quickly and easily write out the day’s plan, accomplishments, and misses. Synchronizing with OneDrive means I can get the same set of notes on any of my work environments. I highly recommend it!

Posted on 2017-11-22

0 Comments on this page

More...