Bibiography

[1] Amos, B. and Kolter, J. Z. (2017). Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136–145.
[2] Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients through
stochastic neurons for conditional computation. arXiv:1308.3432.
[3] Correia, G. M., Niculae, V., and Martins, A. F. T. (2019). Adaptively sparse transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2174–2184.
[4] Farinhas, A., Aziz, W., Niculae, V., and Martins, A. F. T. (2022). Sparse communication via mixed distributions. In International Conference on Learning Representations.
[5] Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv:2101.03961.
[6] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pages 315–323.
[7] Goyal, A., Lamb, A., Hoffmann, J., Sodhani, S., Levine, S., Bengio, Y., and Sch¨olkopf, B. (2021). Recurrent independent mechanisms. In International Conference on Learning Representations.
[8] Graves, A. (2016). Adaptive computation time for recurrent neural networks. arXiv:1603.08983.
[9] Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182.
[10] Hu, J. Y.-C., Yang, D., Wu, D., Xu, C., Chen, B.-Y., and Liu, H. (2023). On sparse modern Hopfield model. In Thirty-seventh Conference on Neural Information Processing Systems.
[11] Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. (2024). Mixtral of experts. arXiv:2401.04088.
[12] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv:2001.08361.
[13] Lee, K., Choi, S., and Oh, S. (2018). Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466–1473.
[14] Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through L0 regularization. In International Conference on Learning Representations.
[15] Martins, A. F. T. and Astudillo, R. (2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning.
[16] Martins, P. H., Marinho, Z., and Martins, A. F. (2022). ∞-former: Infinite memory transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 5468–5485.
[17] Muqeeth, M., Liu, H., Liu, Y., and Raffel, C. (2024). Learning to route among specialized experts for zero-shot generalization. arXiv:2402.05859.
[18] Nawrot, P., Łańcucki, A., Chochowski, M., Tarjan, D., and Ponti, E. M. (2024). Dynamic memory compression: Retrofitting LLMs for accelerated inference. In Proceedings of ICML 2024.
[19] Niculae, V. and Blondel, M. (2017). A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems.
[20] Niculae, V. and Martins, A. F. T. (2020). LP-SparseMAP: Differentiable relaxed optimization for sparse structured prediction. In International Conference on Machine Learning.
[21] Oren, M., Hassid, M., Adi, Y., and Schwartz, R. (2024). Transformers are multi-state RNNs. arXiv:2401.06104.
[22] Pfeiffer, J., Ruder, S., Vulić, I., and Ponti, E. (2023). Modular deep learning. TMLR.
[23] Ponti, E. M., Sordoni, A., Bengio, Y., and Reddy, S. (2023). Combining parameter-efficient modules for task-level generalisation. In Proceedings of EACL 2023, pages 687–702.
[24] Raposo, D., Ritter, S., Richards, B., Lillicrap, T., Humphreys, P. C., and Santoro, A. (2024). Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv:2404.02258.
[25] Rish, I., Cecchi, G. A., Lozano, A., and Niculescu-Mizil, A. (2014). Practical applications of sparse modeling. MIT Press.
[26] Rosenbaum, C., Cases, I., Riemer, M., and Klinger, T. (2019). Routing networks and the challenges of modular and compositional computation. arXiv:1904.12774.
[27] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv:1701.06538.
[28] Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. (2019). Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of ACL 2019, pages 5797–5808.
[29] Yang, W., Li, X., and Zhang, Z. (2019). A regularized approach to sparse optimal policy in reinforcement learning. Advances in Neural Information Processing Systems, 32.
[30] Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z. A., and Chen, B. (2023). H2O: Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems 36.