Machine Learning - Training | Fine Tuning

"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Google open sources tools to support AI model development
techcrunch.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Sometimes The Best AI Accelerator Is The 37,888 GPUs You Already Have - The Next Platform
www.nextplatform.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem
pytorch.org
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Stock analysis data pipeline Episode 1
www.youtube.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Training Cluster as a service: Train your LLM at scale on our infrastructure
huggingface.co
5
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Hamel’s Blog - Estimating vRAM
hamel.dev
4
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
How to find garbage in your text dataset? [Checklist]

Tweet: https://twitter.com/Yampeleg/status/1690641029295169536

5
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
LZ77 Is All You Need? Why Gzip + KNN Works for Text Classification
codeconfessions.substack.com

cross-posted from: https://lemmy.world/post/2706141 > Decoding the Success of Gzip + KNN: The Central Role of LZ77

2
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines
github.com
3
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
New trick scales LLMs even longer! - GitHub - jquesnelle/scaled-rope

- [github repo](https://github.com/jquesnelle/scaled-rope) - [related paper](https://lemmy.intai.tech/post/29142) ![twitter announcement archive](https://lemmy.intai.tech/pictrs/image/3d06a669-95f2-407d-9126-a5a3e84e4c68.png) ## NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. [Twitter](https://twitter.com/Yampeleg/status/1674430869828956161) [Reddit](https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/) News I've seen the posts about SuperHOT and just recently, the paper from Meta which uses RoPE interpolation, and I've noticed an immediate improvement that can be brought to this method. Basically if you apply Neural Tangent Kernel (NTK) theory to this problem, it becomes clear that simply interpolating the RoPE's fourier space "linearly" is very sub-optimal, as it prevents the network to distinguish the order and positions of tokens that are very close by. Borrowing from NTK literature, scaling down the fourier features too much will eventually even prevent succesful finetunes (this is corroborated by the recent paper by Meta that suggests an upper bound of ~600x) Instead of the simple linear interpolation scheme, I've tried to design a nonlinear interpolation scheme using tools from NTK literature. Basically this interpolation scheme changes the base of the RoPE instead of the scale, which intuitively changes the "spinning" speed which each of the RoPE's dimension vectors compared to the next. Because it does not scale the fourier features directly, all the positions are perfectly distinguishable from eachother, even when taken to the extreme (eg. streched 1million times, which is effectively a context size of 2 Billion) To my surprise, this method works extremely well, so much so that you don't even need to fine tune the LLaMA 7B model for 4096 context size! The perplexity degradation is minimal. I'm sure with fine tuning this would become even better. Enough explanations, here's some empirical results. All the perplexity measurements are done on LLaMA 7b with the tau/scrolls · Datasets at Hugging Face dataset (I only used a subset of gov_report). Here's a graph showing the average perplexity of LLaMA 7b on a set of 40 very long prompts (12k+ context size). Compared to changing the scale (from SuperHOT, which was set to 4), we change a factor alpha, which when equal to 8 provides the same context size increase but with much less perplexity degradation. All without any finetuning! Code can be found on a colab notebook with a test example. NTKAwareScaledRotaryEmbedding.ipynb - Colaboratory (google.com) Again, the changes to the RoPE code is only 3 lines. Also just for fun, I've pasted the entirety of the Meta paper [2306.15595] Extending Context Window of Large Language Models via Positional Interpolation (arxiv.org), with the abstract and conclusions removed, and asked OASST-sft-6-llama-30b some questions. I think it fared pretty well given it hasn't been fine tuned at all for 6000+ context size. I used an alpha of 8 for this prompt. The entire paper is about 6200 tokens. These prompts can be found in the colab notebook linked above. The following questions are simply appended at the end of the prompt which contains the entire research paper. QUESTION Please give me a brief summary of this research paper in a few bullet points. ANSWER * Large Language models (LLMs) often have a pre-defined context window size, but sometimes they need to operate on longer contexts, e.g., in conversation systems or summarization tasks. * Extending the context window of an existing pre-trained LLM is challenging because direct extrapolation of positional encodings can lead to poor performance. * Researchers propose "Position Interpolation" to extend the context window of LLMs by downscaling the positional encodings. * They prove that interpolated positional encodings have a much smaller upper bound than extrapolated ones, allowing the model to adapt to longer contexts more efficiently. * They experimentally validate the effectiveness of Position Interpolation on LLaMA models, reducing the perplexity gap between the original and extended contexts. * They compare Position Interpolation with direct extrapolation and find that Position Interpolation works better. * They apply Position Interpolation to LLaMa models and fine-tune them on the Pile dataset, finding that the models can effectively utilize longer contexts. * They also evaluate the models' abilities to attend to longer contexts and perform well on benchmarks within the original context size. * They apply Position Interpolation to long document summarization and obtain competitive results. * They provide insights into the interpolation/extrapolation tradeoff and discuss potential improvements. * They plan to explore more advanced methods like data augmentation and multi-task learning to improve the interpolation. QUESTION Please write me the abstract for this paper. ANSWER This paper presents a technique called "Position Interpolation" for extending the context window size of pre-trained Large Language Models (LLMs) without modifying their architecture. Previous methods like ALiBi and LeX were unable to efficiently extend the context window size of LLMs, leading to poor performance. The authors propose Position Interpolation, which involves downscaling the positional encoding of the LLM's attention mechanism to match the desired longer context window size. They prove mathematically that interpolated attention scores are more stable than extrapolated ones, allowing the model to adapt to longer contexts with just a few hundred fine-tuning steps. Experiments showed that Position Interpolation enables LLMs to perform well on longer contexts, reducing perplexity by up to 0.8 points on average, and generating strong summarizations. Extending the context window size to 32768 led to slight degradation on the original benchmarks, but the models adapted to longer texts. The authors conclude that interpolating positional encodings allows the model to attend to relevant evidence in the longer contexts, and the stability of the interpolated attention score helps the model avoid catastrophic behaviors. QUESTION How many steps was the model fine tuned for the final results? Give a short answer. ANSWER The authors of the paper "Extending Context Windows of Large Language Models via Position Interpolation" fine-tuned the model for 1000 batches for the experiments presented in the paper. QUESTION How big is the interpolation bound compared to the extrapolation bound? Give a short answer. ANSWER The interpolation bound is at least 2 - 29.7 * 10^(-6) times smaller than the extrapolation bound. Sorry for the large dump of information on reddit, I'm too lazy to write a blog post for this. I might give a more detailed explanation on how I derived the formula used to calculate the base if enough people are interested. I did not test fine-tuning performance as I do not have the resources or the time to fine tune an LLM, I just derived this formula during lunch and experimented with it. However, I think that this method will do even better with fine tuning. Also thanks to the people behind the SuperHOT blogpost, it was their hard work that inspired me and allowed me to make this contribution for everyone! Finally, I really hope this post will inspire others to start experimenting on ways to improve LLMs. There's so much to learn and so much left to discover! What a time to be alive!

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - teknium1/GPTeacher: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
github.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - OpenAccess-AI-Collective/axolotl: Go ahead and axolotl questions

https://github.com/OpenAccess-AI-Collective/axolotl

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Preemo | Fine Tune Foundational Models
www.preemo.io
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - google-research/google-research: Google Research
github.com

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - princeton-vl/CoqGym: A Learning Environment for Theorem Proving with the Coq proof assistant

https://github.com/princeton-vl/CoqGym

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - microsoft/DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://github.com/microsoft/DeepSpeed

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Relationship between LLM model size and emergent power
zhuanlan-zhihu-com.translate.goog
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Mathematical Foundations of Machine Learning

https://skim.math.msstate.edu/LectureNotes/Machine_Learning_Lecture.pdf

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Foundations of Machine Learning Introduction to ML

- [Original](https://cs.nyu.edu/~mohri/mls/ml_introduction.pdf) - [Archive](https://web.archive.org/web/20210423231106/https://cs.nyu.edu/~mohri/mls/ml_introduction.pdf)

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - ggerganov/ggml: Tensor library for machine learning
github.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - microsoft/LMOps: General technology for enabling AI capabilities w/ LLMs and MLLMs
github.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - dadukhankevin/Finch: A Keras style GA genetic algorithm library
github.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - philipturner/metal-flash-attention: Faster alternative to Metal Performance Shaders
github.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - DylanAlloy/hygiene: hygiene (🪥) is a data preprocessing toolkit that makes it easy to create common LLM-related data structures; from training data to chain payloads!
github.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Things I’m Learning While Training SuperHOT

### Things I’m Learning While Training SuperHOT I have been working on SuperHOT for some time now. It is a fiction-focused finetune of LLaMa with extra focus towards NSFW outputs while also being capable of general use instructions. The main reason I’m making the model is because it is fun and serves as a good way to learn the inner workings of Transformers, dataset creation and techniques, and probing the capabilities of LLMs. There are also a lot of people who want NSFW-capable models, and they provide useful, honest feedback, especially when they don’t get what they want. Besides, it’s a fun model to use. I’m making this page to share some of my findings in the hope that others might find it useful. I will update this page as time goes on with any information that might be important and whenever I have the time. ### Background Originally, I was working on a Langchain extension for oobabooga’s Text-Generation-WebUI. It was not a very fun project. Gradio is not fun to work with when it comes to stateful UI updates, or even encapsulation of UI and logic. I ended up hand-rolling my own UI state management system just to give a nice user experience when modifying templates, chains, etc. After some time of using Langchain, I realized I was only using a subset of the features (because they were the most useful to me), and even those features could be replicated outside of the framework very easily and save me from the unnecessary bloat. I was also displeased with the quality of the chained outputs, so I looked for alternate ways to improve the generation quality. I ended up making SuperCOT at this time, by combining parts of datasets from Sahil Chaudhary’s Code Alpaca, CMU Mellon’s NeuLab CoNaLa, Google’s FLAN (QED and Aqua), and Peng et. al.’s Alpaca GPT-4, mostly sourced from Qingsi Yi et. al’s Alpaca CoT project dataset The resulting model worked much better with chained prompting for me, but the quality was still a far cry from the demonstrations in Langchain’s documentation. I stopped working on the extension, instead I started playing with parts of the framework in isolation, such as vector databases and making my own lightweight chained prompting wrapper library. In the meantime, apparently some users found the model was very good at producing NSFW content. I had made it a point to filter the refusals and bias from the dataset as best as I could, so it was not unexpected. Soon after, the idea was floated to make models based on solely online roleplay logs, with the idea that such models would be much better at making chatbot outputs and would require no filtering of refusals. So, I started working on SuperHOT, and in the meantime others also worked on their own models, such as Bluemoon 13/30B. Continued [ORIGNAL](https://kaiokendev.github.io/til) | [ARCHIVE](https://web.archive.org/web/20230622110406/https://kaiokendev.github.io/til)

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Introducing *SequenceMatch*, training LLMs with an imitation learning loss

- [Tweet](https://twitter.com/ChrisCundy/status/1671554622207709184) - [Paper](https://arxiv.org/pdf/2306.05426.pdf)

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Data is a key ingredient, something I talk about often. I think many know this (many probably knew this long ago as well).

https://twitter.com/abacaj/status/1671569594006347776

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - openai/triton: Development repository for the Triton language and compiler

[github](https://github.com/openai/triton)

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Release 4-bit QLoRA, Paged Optimizers, and 8-bit Memory Leak Bugfix · TimDettmers/bitsandbytes
github.com
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
UpTrain | Open-Source ML Observability and Refinement
https://uptrain.ai/
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Are autoencoders and auto-associative neural networks the same thing?

cross-posted from: https://sh.itjust.works/post/223572 > This question is being reposted to preserve technical content removed from elsewhere. Feel free to add your own answers/discussion. > > Original question: Autoencoders and auto-associative memory seem to be closely related. It appears the terminology changed, is there a difference between the two or did the wording simply change over time?

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Should I scale values before using them to train autoencoder?

cross-posted from: https://sh.itjust.works/post/116346 > Not OP. This question is being reposted to preserve technical content removed from elsewhere. Feel free to add your own answers/discussion. > > Original question: > > I have a dataset that contains vectors of shape 1xN where N is the number of features. For each value, there is a float between -4 and 5. For my project I need to make an autoencoder, however, activation functions like ReLU or tanh will either only allow positive values through the layers or within -1 and 1. My concern is that upon decoding from the latent space the data will not be represented in the same way, I will either get vectors with positive values only or constrained negative values while I want it to be close to the original. > > Should I apply some kind of transformation like adding a positive constant value, exp() or raise data to power 2, train VAE, and then if I want original representation I just log() or log2() the output? Or am I missing some configuration with activation functions that can give me an output similar to the original input?

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - chavinlo/musicgen_trainer: simple trainer for musicgen/audiocraft

https://github.com/chavinlo/musicgen_trainer https://twitter.com/femboylover03/status/1669252447674531840

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Get some!
1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
GitHub - facebookresearch/ijepa: Official codebase for I-JEPA, the Image-based Joint-Embedding Predictive Architecture. First outlined in the CVPR paper, "Self-supervised learning from images with a j
github.com

https://twitter.com/MetaAI/status/1668652154897453056

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
Training Autoencoder on time series with repeating pattern

cross-posted from: https://sh.itjust.works/post/67956 > Not OP. This question is being reposted to preserve technical content removed from elsewhere. Feel free to add your own answers/discussion. > > Original question: > > Im training an autoencoder on a time series that consists of repeating patterns (because the same process is repeated again and again). If I then use this autoencoder to reconstruct another one of these patterns, I expect the reconstruction to be worse if the pattern is different from the ones it has been trained on. > > Is the fact that the sime series consists of repeating patterns something that needs to be considered in any way for training or data preprocessing? I am currently using this on raw channels. > > Thank you.

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
StanfordHCI/ModelSketchBook
github.com

PAPER: https://hci.stanford.edu/publications/2023/Lam_ModelSketching_CHI23.pdf

1
0
"Initials" by "Florian Körner", licensed under "CC0 1.0". / Remix of the original. - Created with dicebear.comInitialsFlorian Körnerhttps://github.com/dicebear/dicebearTR
CUDA full GPU acceleration, KV cache in VRAM
github.com
1
0