Notes from CNI Fall 2023

Clifford Anderson
12 min readDec 14, 2023

The fall meeting of the Coalition for Networked Information (CNI) met on December 11–12 in Washington, DC. CNI is a membership organization “dedicated to supporting the transformative promise of digital information technology for the advancement of scholarly communication and the enrichment of intellectual productivity.” While most attendees are librarians, CNI brings together innovators across the scholarly communications sector for two days of rich conversations.

The pace of the talks is fast, making it easy to miss things (or misconstrue them). What follows are my notes from the meeting; if anything looks amiss, please let me know, and I’d be glad to amend this report.

A notepad filled with indecipherable notes about artificial intelligence, cloud computing, and research. The notepad’s pages are densely packed with diagrams, mathematical equations, and short phrases, offering a glimpse into complex thoughts and ideas. (DALL*E 3)
A notepad filled with indecipherable notes about artificial intelligence, cloud computing, and research. The notepad’s pages are densely packed with diagrams, mathematical equations, and short phrases, offering a glimpse into complex thoughts and ideas. (DALL·E 3)

Opening Plenary

Clifford Lynch, executive director of CNI, opened the fall meeting by sharing thoughts on innovations he sees in technology and information policy. Lynch’s reflections about current and future trends are always a highlight of these meetings.

Lynch pointed to the potential of CloudLab at Carnegie Mellon University to advance digital science. By providing researchers with 24/7 online access to scientific instruments, CloudLab widens collaboration possibilities and fosters reproducibility. CloubLab builds on the pioneering work of Emerald Cloud Lab, a startup in Austin. In 2023, Emerald Cloud Lab released its template language, Symbolic Lab Language (SLL), for orchestrating experiments under an open-source license.

On the topic of infrastructure, Lynch also noted that cybersecurity is “looking really scary.” Hackers are attacking everything from hospitals to cultural heritage institutions, including the British Library. In the question and answer period, an attendee remarked that the Toronto Public Library and the University of Michigan have also experienced devastating cybersecurity attacks. On a more positive note, Lynch mentioned the notion of “digital twins,” which corporations already use to model everything from jet engines to building sites.

Lynch highlighted the uncertainty around the world of artificial intelligence and machine learning. Everyone has an opinion, but few understand how these models work. The emergence of public interest in artificial intelligence has fostered even greater confusion. CNI has partnered with the ARL on a task force to chart the waters and to provide possible paths forward; a first draft of their projected scenarios will be presented at the CNI Spring Meeting.

Lynch shared a series of key issues that he is tracking around AI. First, will AI become increasingly centralized, or will infrastructure for AI follow a distributed model? The technological demands of large language models (LLMs) and regulation in the European Union and, potentially, the United States favor large organizations. A second question is, how significant will generative AI be in retrospect? Lynch suspects that generative AI will not be a key driver of scientific discovery in the long run. A third question relates to training data. Lynch commented that concerns go beyond intellectual property but touch on transparency, stability, and quality. How does the choice of training data influence what the models can do? For example, IBM’s Aurora GenAI LLM is being trained on enormous quantities of scientific data. How may that model advance the frontiers of science?

What about scholarly communications? Research data management is gearing up on university campuses, driven by demand from federal and private funders. Universities are finding it difficult to develop support models; researchers want libraries to assist directly in data management, but how can libraries scale their services to meet this need? Lynch suggested that we are seeing success at convincing open-access proponents that code should be considered a first-class research output alongside publications and datasets. Lynch also commented on a loss of consensus about how to finance the transition to open access.

Finally, he called attention to the emergence of so-called ‘prediction databases’ such as GNoME (Graph Networks for Materials Exploration) from DeepMind at Google, which can predict future discoveries in material science.

Lynch concluded his talk with “obvious and less obvious” speculative comments. The “obvious” point is how we “recalibrate truth” in a world of deepfakes and synthetic media. Technological solutions exist to detect anomalies or to establish provenance, but Lynch worries that hackers will learn how to circumvent these tools. We must become more serious about digital literacy initiatives as there is no easy technical patch to the ‘deep fake’ challenge.

A “slightly less obvious” comment concerns the legal standing of AI under copyright law. Lynch suspects that we need to reassess our understanding of creators’ rights. How do we respond to the rise of texts and images “in the style of” famous authors and artists? This reassessment “may have broader implications than we might think.” We may underestimate as an academic community “how deeply upset” these creators are about the potential loss of their livelihood, but how do academics feel about these tools? Are we moving to a sharper distinction between art and scholarship?

Other issues surfaced during the question and answer period. For example, Lynch spoke about the rise of quantum computing and the associated field of quantum-resistant cryptography. There are exciting initiatives to foster an understanding of quantum computing, he conceded. But it’s unclear when we will get “genuine leverage” on actual problems. Another attendee asked about privacy issues connected with large language models. Lynch observed that leading institutions are developing local models to serve institutional and scholarly goals.

Capacity Building for Librarians

The first plenary session I attended was “To Increase or Decrease Capacity: The What, How, and Why of 21st Century Library Skill Development.” Tony Zanders, CEO of Skilltype, introduced the panel. He emphasized that librarianship has experienced a series of disruptions. Given these periodic disruptions, how do we decide on the skills librarians need to learn without “getting distracted by the trend de jour”? At Skilltype, he is developing software to track organizational needs, individual competencies, and where gaps exist. Keith Webster of Carnegie Mellon University noted that the emergence of generative AI makes technology less a tool and more a co-worker. Are our skills becoming devalued by generative AI? He suggested we think “bottom-up” regarding tasks that AI may replace rather than “top-down” about job loss. He argued that developing “talent and technology” should be the critical tasks of library leaders; as leaders, we should “award and encourage advanced levels of ‘digital dexterity.’” For his part, Karim Boughida of Stony Brook University pointed out a divergence between leaders’ and staff's priorities related to professional development. Leaders must learn to communicate priorities to their staff members. During the Q&A, some asked about the meaning of the word “decrease” in the panel's title. Zander replied that libraries and iSchools must “decrease” or deprecate teaching certain skill areas to focus on emerging needs. Webster contended that libraries are shifting from being collections-based to service-based environments, with a lower headcount of more highly skilled professionals.

Data Science Consulting

The next panel was about collaboration between two groups at North Carolina State University. Data & Visualization Services draws on professional faculty librarians and graduate students (along with a coordinator) for 30-minute appointments; they don’t offer workshops or teaching support but rely on units elsewhere. The Data Science Academy provides longer-term engagements (up to twenty hours) that include coding services. With support from the Alfred P. Sloan Foundation, they hosted a workshop at NCSU to discuss how best to provision consulting support for data science. Among other subjects, they talked about how to pitch data science consulting services effectively to university administrators.

In the questioning period, someone asked who funds the data science academy and which disciplines it targets. The provost has made a five or six-year commitment. The primary users come from the sciences and the social sciences, which reflects the university's emphasis. Another attendee questioned whether reliance on graduate student instructors was sustainable. The speakers remarked that while it proved challenging at times to retain graduate instructors because they are in high demand both at the university and in industry, they have seen strong interest in these roles because they prepare students well for professional positions outside the academy, which require strong communication skills.

The Federal Year of Open Science

Maryam Zaringhalam from the Office of Science & Technology Policy (OSTP) introduced the panel's themes. Her work focuses on public access to research and data publications. OSTP and seventeen federal agencies signed onto the effort (see open.science.gov). The year began by publishing an official definition of open science: “Open Science is the principle and practice of making research products and processes available to all, while respecting diverse cultures, maintaining security and privacy, and fostering collaborations, reproducibility, and equity.” Among the products is, for instance, NASA’s Open Science 101 curriculum and NIST’s version 1.5 of its Research Data Framework (RDaF). Martin Halbert of the National Science Foundation (NSF) called attention to the newly redesigned Science.gov website. Brett Bobley from the National Endowment of the Humanities (NEH) commented that public access has been essential to the NEH. Still, a formal plan to provide public access was lacking. They now have a plan in draft, which will be similar to other federal agencies. The NEH will develop a designated repository of funded peer-reviewed journal articles. The NEH will also require a data management plan for any scholarly datasets its grantees generate. Ashley Sands then spoke about the Institute of Museum and Library Services (IMLS) and its public access plan, which will be shared in draft in 2024 and come into effect in 2025.

A question arose about monographs, which are central to research in the humanities: why does the proposed NEH policy not cover these publications? Bobley answered that the NEH has developed funding programs to make NEH-funded monographs available under Creative Commons licenses. However, the NEH policy parallels other agencies’ data and peer-reviewed articles policies.

Lightning Round

A round of seven lightning talks concluded the first day at CNI. Among the talks, Erik Mitchell (UCSD) discussed the LIS Education and Data Science-Integrated Network Group (LEADING), an IMLS-funded initiative to support early career information professionals. As they wrap up their grant project, they want to discuss the question, “How might we help libraries collaborate to innovate professional education that impacts recruitment, growth, and retention?” Sayeed Choudhury of Carnegie Mellon University announced its “Open Forum for AI” initiative to build on CMU’s expertise in developing responsible AI. Finally, Rob Sanderson (Yale) spoke about the LUX, the new cross-collection discovery system at Yale libraries, museums, and archives. He reported that contributing data to LUX helped the Yale University Gallery and Yale Peabody Museum review and improve their metadata. He is exploring whether LUX could be used to provide augmented AI services that do not suffer from hallucinations. The code will soon be released under an open-source license on GitHub.

New AI Tools and Initiatives in Libraries

As new artificial intelligence tools come onto the market, how will libraries select and put them into production? A challenge with artificial intelligence is helping patrons (and staff) to imagine the possibilities and, in some cases, to overcome apprehensions about AI.

Leo Lo described a program to give librarians experience using AI in their daily work. His program provided GPT4 for his staff members to allow them to explore AI. The program helped participants to regard AI less as a “threat” and more as a “collaborator.” His administrative assistant was delighted to have GPT4 as an assistant. Among the challenges he noted were concerns about data privacy, prompt engineering, and their lack of subject matter expertise in AI.

Other panelists talked about AI products specialized for research. For example, Elias Tzoc at Clemson University described a trial of Scite.ai, and Joelen Pastva of Carnegie Mellon University discussed an experiment using Keenious.

OpenBook Collective

How do we advance open access to monographs in the open-access movement? Lidia Uziel (UC Santa Barbara) described the history of the OpenBook Collection (OBC), which emerged from an Arcadia-funded project called the Community-led Open Publication Infrastructures for Monographs project (COPIM), which aims to foster the open access ecosystem for open access books. Livy Snider of Punctum Books explained that the OBC provides a centralized platform for discovering and supporting OA book packages. She commented that 90% of the income from these packages goes to the publishers; the rest supports, among other things, grants to small publishers to help them with open ebook publishing workflows. In 2023, Arcadia provided an additional grant to fund Open Book Futures to scale this infrastructure and to ensure equitable participation in the Global South.

Rethinking Data Citation

The next panel I attended focused on the significance of data citation. A stark disparity exists between citations to articles and citations to datasets. The traditional “carrot and stick” (incentives/compliance) approach is not advancing citations to data effectively. What would a different approach look like? How would machine learning approaches differ from attempts to influence user behavior?

MakeDataCount.org aspires to improve metrics for assessing the impact of datasets. The ecosystem has become crowded with different approaches to tracking and measuring data, which makes coordinating efforts in this space complex.

Global Data Citation Corpus (supported by the Wellcome Trust) is building a prototype of a knowledge graph of datasets using machine learning in collaboration with CrossRef, DataCite, and the Chan Zuckerberg Initiative. The prototype will be launched in January 2024.

Librarians connect researchers to data management systems and encourage them to make their datasets citable, discoverable, and measurable. But their role could still expand. For example, could libraries work with medical researchers to include clinical trial datasets in the project?

Bringing Special Collections into JSTOR

Kevin Guthrie remarked that JSTOR in 1995 was at the beginning of the digitization revolution in libraries and has developed an infrastructure for publishing digitized materials. How could that existing infrastructure serve the goal of making digitized special collections available online? Not every library wants to build or buy a digital publishing system; JSTOR offers a “borrowed” system at a lower cost and greater reach. The combination of JSTOR for publishing and PORTICO for preservation provides excellent value while also getting materials into systems that students already use. Bruce Heterick discussed the process of reflecting with librarians about digital publishing and preservation. (Full disclosure: Vanderbilt University was among the early adopters in what became a pilot project called Open Community Collections.) JSTOR has added many services to the platform in the intervening years, including cataloging tools and AI/ML processes for enhancing metadata. For several schools, including CUNY and Skidmore, JSTOR/Portico has become their library-wide platform for publishing and preserving digital collections. The big question Guthrie and Hetrick are wrestling with is how to tie all these services together to build an active infrastructure for library digital publishing programs while giving libraries greater choice over individual services.

Advancing Research Support at Duke

My final panel at CNI was about the evolution of information technology research support at Duke University. Beyond high-performance computing users, how could Duke provide better computing resources to other disciplines, including digital humanities? And how can it be done without any grant funding? A working group composed of members from information technology, libraries, and research got together to address these questions. After holding listening sessions, they discovered that the actual bottleneck was the lack of subject matter experts who could navigate both the scholarly discourses and the technological realm. Beyond GPUs, storage, software, and training, the number one need was for these specialists. Tim McGeary described the strategic planning process to determine where gaps existed and how to fill them at the most effective costs. The university has made a financial commitment to build capacity in data security, research data management, and additional technical personnel. Rebecca Brower at the Office of Research Initiatives at Duke University then spoke about the implementation of these plans, prioritizing hiring in these essential areas, determining governance, and finding ways to measure success.

Closing Plenary

The theme of the closing plenary was “Open Access, Open Scholarship, and Machine Learning.” Lynch opened the panel by asking how the open access movement, which has always embraced computational uses like text and data mining, should respond to generative artificial intelligence. The rise of generative artificial intelligence has made some question openness as a goal. Is training AI models the kind of openness we had in mind? If not, where does the open-access movement go from here?

Rachael Samberg (UC Berkeley) argued that training AI models with copyrighted literature is fair use and contended that it must remain fair if we want to maintain our computational rights to transform scholarly literature. She cautioned that generative AI models could, however, create outputs substantially similar to copyrighted works and, thus, possibly, not be a fair use. In licensing negotiations, content providers may contract around the otherwise fair uses of corpora, including text and data mining rights. As a cultural heritage sector, we should push back on such contractual encroachments on fair use rights.

Heather Sardis (MIT) and Richard Sever (Cold Spring Harbor Laboratory) also took optimistic perspectives. Sardis argued that open access advocates should not seek to control its downstream uses but should support human-centric regulations of its applications. Sever concurred, preferring that AI models be trained with scholarly literature rather than the New York Post. He also called attention to infrastructural questions: how can we avoid critical dependence on only a small set of wealthy companies?

Wrap Up

The fall 2023 meeting of CNI demonstrated the intellectual verve and vitality of the library community. Librarians and allied professionals are collaborating creatively to address the disruptions brought about by artificial intelligence, cloud computing, and increasing need for data management, among other topics. The emphasis on helping existing and new staff meet these challenges through ongoing professional development struck a welcome chord.

On the humorous side, many presenters used generative AI to create the images for their slide decks. Those who turned to DALL·E 3 featured misspellings of keywords in the graphics. Diffusion models like DALL·E do not yet understand syntax like transformed-based LLMs; words are just like any other graphic element in those models. As far as AI has progressed, we’re still not at the point where we have effective multimodal models. We’ll see what next year brings!

--

--