Blackbox Model Provenance via Palimpsestic Membership Inference
                
                 
                  Rohith Kuditipudi*, Jing Huang*, Sally Zhu*, Diyi Yang †, Christopher Potts †, Percy Liang †
                 
                Neurips, 2025, Spotlight 🌟
                 
               | 
             
            
              
                
                  Demystifying Verbatim Memorization in Large Language Models
                
                 
                  Jing Huang, Diyi Yang*, Christopher Potts*
                 
                EMNLP, 2024
                 
                Featured on Stanford AI Lab Blog,
                NNSight Mini Paper Tutorials / 
                Project Page
               | 
             
           
          
            
              
                Causal Abstraction and Generalization
               | 
             
           
          
             
              
                
                  Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
                
                 
                  Jing Huang*, Junyi Tao*, Thomas Icard, Diyi Yang, Christopher Potts
                 
                ICML, 2025
                 
                Actionable Interpretability Workshop @ ICML, 2025, Oral Presentation 🌟
                 
                Talk /
                Project Page
               | 
             
            
              
                
                  Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
                
                 
                  Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard                  
                 
                JMLR, 2025
                 
               | 
             
           
          
            
              
                Automating and Evaluating Interpretability Tools
               | 
             
           
          
            
             
                
                  RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
                
                 
                  Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
                 
                ACL, 2024
                 
                Featured on Anthropic Transformer Circuits Thread
                 / 
                Project Page
               | 
             
            
              
                
                  Rigorously Assessing Natural Language Explanations of Neurons
                
                 
                Jing Huang, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, Christopher Potts
                 
                BlackboxNLP, 2023, Best Paper Award 🏆
                 
                Project Page
               | 
             
            
              
                
                  AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
                
                 
                Zhengxuan Wu*, Aryaman Arora*, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts
                 
                ICML, 2025, Spotlight 🌟
                 
                Project Page
               | 
             
            
              
                
                  HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
                
                 
                Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar*, Atticus Geiger*
                 
                ICLR, 2025
                 
                Project Page
               | 
             
           
          
            
               
               
              
                Misc
               | 
             
            
              | 
                I like doing puzzle hunts. My first PhD project was building a cryptic crossword solver. It turns out that we need to teach these subword-based language models about characters first!
               | 
             
            
              | 
                I am not on any social media. You can find me via email or Slack.
               |  
             
           
         |