Build A Small Text-Based Training Dataset

At the time of writing this, my TIL repo ↗ has 1749 TILs that I've written by hand over the course of about 11 years. As I started implementing my own naive Byte Pair Encoding implementation ↗, I realized I needed a sizeable and interesting chunk of text to run through it. The aggregate of all my TILs seemed like a great candidate, but I needed a way to bundle them up into a single file.

Here is a one-liner I can run from the root of the TIL repo to do just that:

{ cat README.md; find */ -name '*.md' -print0 | sort -z | xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; } > combined.md

And here is a formatted version of it so that we can see what is going on:

{
  cat README.md; \
  find */ -name '*.md' -print0 \
  | sort -z \
  | xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; \
} > combined.md

Everything inside the curlies ({ ... }) gets evaluated and the results are redirected (>) to combined.md.

Inside the curlies, I first output the contents of README.md with cat. I'm not sure how much value the README.md is adding to my corpus, but I'll keep it for now.

Then I find all markdown files that are nested in a categorical directory. The -print0 is a nice trick to separate each filename with a null byte (\0). This isn't strictly necessary since all of my filenames are well-formed with alphanumeric characters and dashes.

Those filenames get piped to sort -z which sorts them alphabetically. The -z option maintains the null byte as the separator.

The sorted filenames are then piped to an xargs -0 statement which is going to process the filenames one-by-one recognizing the null byte separator. The -I{} is how I tell xargs to replace the occurrence of {} at the end of the line with the current filename. That filename then becomes the second positional argument ($1) to the inlined shell script. That shell script does two things:

it prints out the file separator that I want to use (<|endoftext|>)
it outputs the contents of the current file. Remember, I'm outputting everything so that it can all get redirected into the combined.md file.

When I run it, I get a 2MB file that contains a ton of words and code samples I've written over the years.

Tell us about your project