articles:random

Random bits and pieces
1.0 Formatting a non-OS disk with ext4
2.0 Initialising a Drive
3.0 Query To Spot Composer Name Typos

Random bits and pieces

1.0 Formatting a non-OS disk with ext4

When you format a hard disk, ext4 reserves around 5% of it to ensure you don't run out of space with catastrophic (non-bootable) consequences. If you are formatting an external USB drive for backup purposes, non-booting from the drive is not an issue, so there's little point in reserving any space at all (especially as 5% of a 6TB drive is practically 300GB!). Therefore, un-reserve that space with the command:

sudo tune2fs -m 0 /dev/sdX1

(where sdX1 is actually /dev/sda1, /dev/sdb1 or whatever else your actual physical disk partition is). That sets the “minimum [to keep] free” to nothing, meaning you get to use all your hard disk space. Whilst (for example) the BSD man pages for tune2fs seem to indicate that setting -m to less than 5% is going to hammer performance, that's not what an Ext4 developer says. Obviously, this assumes your backup hard disk is using ext4!

2.0 Initialising a Drive

You've got a hard disk with contents that are in a relatively unknown-state and you want to wipe the entire drive as quickly as possible so you can start with a clean slate for a new series of backups? The following commands will help:

sudo wipefs -a /dev/sdX
echo "label: gpt" | sudo sfdisk /dev/sdX && echo ",," | sudo sfdisk /dev/sdX
sudo mkfs.ext4 -F /dev/sdX1
lsblk -no UUID /dev/sdX1

Replace the “X” with the correct drive letter (e.g., /dev/sda or /dev/sda1 and so on). Wipefs doesn't laboriously clean a disk: it simply wipes all drive partitioning signatures from the disk, so it's a quick operation and effectively renders the disk blank for nearly all known tools. The sfdisk formatting then creates a new partitioning table to replace the wiped signatures and then mkfs formats the new partition. Very dangerous, very destructive… but also very efficient and very quick for 'starting from a blank slate'.

3.0 Query To Spot Composer Name Typos

If I mis-catalogue a new recording, it will sometimes be because I've said its composer is “Arvo Part” rather than “Arvo Pärt”. The lack of umlaut on the 'a' is a tiny typo, but suddenly means Giocoso and Niente will report I've got an extra composer in my music collection than I really ought to have. This Niente query helps spot near-misses like this:

WITH RECURSIVE numbers(n) AS (
  SELECT 1
  UNION ALL
  SELECT n+1 FROM numbers WHERE n<100
),
name_values AS (
  SELECT
    composername,
    SUM(unicode(substr(composername, n, 1))) AS value
  FROM tracks
  JOIN numbers ON n<=length(composername)
  GROUP BY composername
)
SELECT
  composername,
  LAG(composername) OVER (ORDER BY composername) AS prev_name,
  LAG(value) OVER (ORDER BY composername) AS prev_value,
  value - LAG(value) OVER (ORDER BY composername) AS diff,
  CASE WHEN abs(value - LAG(value) OVER (ORDER BY composername)) < 100 THEN 'NEAR MATCH' ELSE '' END AS near_match
FROM name_values
ORDER BY composername;

This takes unique composer names, orders them alphabetically, then compares row n+1 with the composer name in row n. For each composer name, a numeric value of its letters is computed. If row n was “Benjamin Britten” and row n+1 was “Richard Wagner”, you'd expect the two numbers to be wildly different from each other. If row n was “Benjmin Britten” and row n+1 was “Benjamin Britten”, however, you'd expect the two numbers to be very close to each other. If the two numbers are within 100 of each other, then they're flagged as a 'near match'. You can then investigate whether that's just coincidence or an accident of catalogue mis-typing!

It's not a perfect way of doing it: missed letters or badly typed letters might mean a row's composer name is being compared to quite the wrong composer name. For example, “Aaron Copland”, “Aarre Merikanto” and “Aaton Copland” would mean that “Aaton Copland” would be compared to “Aarre Merikanto” not “Aaron Copland”, because the 't' makes it sort after Merikanto's 'r' in the same spot, even though it clearly involves a mis-typing of Copland's first name. Nevertheless, that might still be helpful:

Aaron Copland				
Aarre Merikanto	        Aaron Copland	104890	-80053	
Aaton Copland	        Aarre Merikanto	24837	-23601	
Adolph Weiss	        Aaton Copland	1236	-81	NEAR MATCH
Adolphe Adam	        Adolph Weiss	1155	4365	
Adrian Willaert	        Adolphe Adam	5520	4693

Even though “Aaton Copland” is being compared to the wrong composer name, the 'Near Match' flag is still raised by the query, and that's enough of a pointer to make one realise what has gone on.

You will need to run the report multiple times, too, with different values for the 'near match' threshold. It's not a sure thing that a typo of “Part” for “Pärt” will trigger the near match flag, even with the threshold set into the thousands, for example. But by increasing the threshold significantly, though you'll get plenty of false positives, you do improve your chances of spotting the actual mis-catalogues, too.

Table of Contents

Random bits and pieces

1.0 Formatting a non-OS disk with ext4

2.0 Initialising a Drive

3.0 Query To Spot Composer Name Typos