This is a collection of writing that I have enjoyed this week. All the headings are clickable links to the writing.
Me and a few others at non-trivial referenced this paper for the Metaculus forecasting contest. Really fun to implement, easy one day build, and with Claude + prompting, the fine-tuning aspect is mostly unecessary. I don't have a brier score for our model yet.
Just how weak are LLM protections? We knew they could be tuned out for a while, but their refusal to obey harmful instructions is a single, easily interpretable direction in activation space. I know that this shows that our current safety techniques are weak, but I'm not sure how it reflects on the nature of LLMs learning when to refuse directions.
It's a Lilian Weng article! Do I need to say more? Great read, but rather long.
Cool read for a scientific computing project I'm working on. Fun blog in general.
A more extreme version of PeFT, where the number of learned parameters can be reduced significantly.
Same scientific computing project, but this is pre-implemented. Very cool results using the solver!
Nanda's great introductory blog. I come back to this really often, so much low hanging fruit here.
Large language models cannot convert between two complex representations of data. However, I have seen papers that they do transform small chunks of data to more favorable representations. This seems like a good topic to investigate, namely is there a set of features and neurons that lead to these features transforming, and can you merge in more of these features to have more representation transformation?
One of the problems from the aforementioned Nanda blog, where they explore phase changes in addition in transformers. Cool read, though I think that there is still more here to be milked.
SAEs on images! There is a gorilla feature here, so you can make images more or less gorilla-y
My first introduction to autoencoders months ago, and I still use it to experiment.