🔥[논문 리뷰 / NAACL 2025] Generating Long-form Story Using Dynamic Hierarchical Outlining with Memory-Enhancement

내가 하고 싶었던 연구와 꽤나 유사한 내용의 연구라서 정리해보았다.

*The implementation are available at https://github.com/Qianyue-Wang/DOME

†All the datasets are available at https://github.com/Qianyue-Wang/DOME_dataset

논문 요약

장편 스토리 생성을 할때, 기존의 방법들(고정된 outline, human interactive 방법)은 각각 단점이 있었다. 그래서 그 두 방법을 합치되 automatic하게 진행하기 위해서 TKG를 도입하여 앞선 context 내용을 저장하였고, rough outline -> detailed outline -> partial story로 생성하는 각 단계에서 TKG에서 관련있는 content를 찾아와 프롬프트에 반영하도록 하였다.

의문점 1 : KG 형식으로 변환하였을 때 storage cost가 정말로 효과적으로 줄어드는 것이 맞는가?
- 쿼드러플 형태가 아닌 그냥 자연어 문장 형식으로 있었을 때 Retrieval을 사용한 것과 어떤 차이가 있을까?
의문점 2 : 쿼드러플 형태 -> 주어, 동작, 목적어만 저장하였을때 정보의 손실이 발생하는 부분은 이야기의 전개에 영향을 미치지 않는가?
- 인물이 나오지 않는 지문 - 설정이나 배경 등 - 의 경우에 정보가 반영되지 않는 점은 어떻게 처리할 수 있을지?

Abstract

장편 스토리 생성(Long-form story generation)
- 소설 집필(novel writing)이나 대화형 스토리텔링(interactive storytelling)
- 일관성 있고 충분한 길이의 텍스트를 생성하는 것이 목표
그러나 기존 방법, 특히 대형 언어 모델(LLM) 기반 접근법은 경직된 개요(outline)에 의존하거나 거시적 수준의 계획이 부족하여 장편 스토리 생성에서 맥락적 일관성과 일관된 줄거리 전개를 동시에 달성하기 어렵다.
- rely on rigid outlines or lack macro-level planning
- both contextual consistency and coherent plot development in long-form story generation

-> 이 문제를 해결하기 위해, 기억 강화가 적용된 동적 계층적 개요화(Dynamic Hierarchical Outlining with Memory-Enhancement) 기반 장편 스토리 생성 방법, DOME을 제안한다.

to generate the long-form story with coherent content and plot
DOME은 스토리의 내용과 줄거리가 일관되도록 생성하는 것을 목표로 한다.
- 동적 계층적 개요(Dynamic Hierarchical Outline, DHO)은
  - 소설 집필 이론을 개요 설계 과정에 도입하고, 개요 계획과 본문 작성 단계를 결합하여 스토리 생성 과정에서 발생할 수 있는 불확실성에 적응하면서 줄거리를 완성도 높게 구성하여 이야기의 일관성을 향상시킨다.
- 기억 강화 모듈(Memory-Enhancement Module, MEM)을 도입하여
  - 시간적 지식 그래프(temporal knowledge graphs) 기반으로 생성된 내용을 저장하고 접근할 수 있도록 함으로써, 맥락적 충돌을 줄이고 이야기의 일관성을 개선한다.
- 시간적 충돌 분석기(Temporal Conflict Analyzer)를 제안하여
  - 시간적 지식 그래프를 활용해 장편 스토리의 맥락적 일관성을 자동으로 평가한다.
실험 결과, DOME은 최신(state-of-the-art) 방법들과 비교했을 때, 생성된 장편 스토리의 유창성, 일관성, 전반적인 품질을 크게 향상시키는 것으로 나타났다.

1 Introduction

the automatic generation of a long-form story -> requires creativity and long-term planning skills.
- such as novel writing and interactive storytelling (Riedl and Young, 2010).
LLMs are rapidly developing -> generating a long-form story that further increases dramatically in length, complexity, and fluency (Yang et al., 2022).
Unfortunately, it is difficult for the LLMs to generate a long-form story maintaining contextual consistency in semantic and coherent plot development for the following reasons.(의미적으로 맥락의 일관성을 유지하며 일관된 플롯 전개를 통해 장편 이야기를 생성하는 것)
- 1) Memory limitations:
  - The black-box self-attention mechanism still suffers from the long-range dependency issue (Vaswani et al., 2017), making it hard for them to precisely and unambiguously recall and thereby leading to contextual incoherence(문맥적 불일치).
- 2) Planning difficulties:
  - LLMs cannot effectively apply knowledge of coherent plot planning -> because they do not inherently understand or apply the principles of storytelling -> hinders the generation of engaging stories with complete and fluent plots (Xie et al., 2023)

Figure 1 상단. 장편 스토리 생성의 세 가지 전략 중 기존의 전략 두 가지 (a) 고정된 계층적 개요(hierarchical outline)를 적용하여 스토리 생성을 안내하는 방식으로, 스토리 생성 과정에서 발생하는 불확실성에 적응하기 어렵다. (b) 인간과의 상호작용을 통해 불확실성에 적응하는 방식으로, 유연성을 갖추었지만 스토리 전개를 안내할 고차원적 줄거리(high-level storyline)가 부족하다

Existing long-form story generation methods often leverage higher-level attributes
- such as plots or commonsense knowledge (Xie et al., 2023 : "The Next Chapter: A Study of Large Language Models in Storytelling") aiming to enhance story development fluency,
- and can be divided into two kinds based on whether humans are involved
  - (a) emulates the human writing process(인간의 집필 과정을 모방)
    - by adopting a plan-and-write framework (Yang et al., 2022) in generating the long-form story
      - separates the generation process into the planning and writing stages,
      - utilizing detailed and plot-fluent outlines to guide the writing phase.
    - the inflexibility of a fixed outline can impede its adaptability to uncertainty in the writing stage, often leading to plot incoherence, such as contextual repetition or conflicts.
  - (b) generating content progressively without outlining based on human interaction and relevant preceding content (Zhou et al., 2023; Brahman et al., 2020)
    - can address the influence of local plot development caused by uncertainty in generation.
    - However, the generated story lacks a macro-level of rational planning, affecting the plot completeness(완성도).

(c) 우리의 방법은 줄거리와 표현 모두에서 이야기의 일관성을 강화하는 것을 목표로 하며, 앞선 두 전략의 장점을 결합하여 줄거리의 일관성을 높이고 문맥적 충돌을 줄이는 방식을 적용한다.

To address the above limitations, we propose Dynamic Hierarchical Outlining with Memory-Enhancement long-form story generation method(DOME), to generate the long-form story with coherent content and plot.
DOME = Dynamic Hierarchical Outline (DHO) + Memory-Enhancement Module (MEM) = to generate long-form story.
- DHO mechanism (Dynamic Hierarchical Outline)
  - to guide long-form story generation based on the plan-write writing framework + the novel-writing theory (Campbell, 1949).
    - DHO mechanism requires the generation of a rough outline to ensure plot completeness,
    - and the dynamic planning of a detailed outline based on the rough outline during the writing process -> so that it can adapt to the uncertainty of the generation & improve the plot fluency.
- MEM (Memory-Enhancement Module)
  - stores and accesses generated stories through temporal knowledge graphs
  - and provides contextual content for outline planning and story writing to reduce content conflicts in long story texts.
- Temporal Conflict Analyzer
  - To automatically evaluate the contextual consistency(맥락적인 일관성),
  - we propose a conflict detection matrix (named Temporal Conflict Analyzer)
    - based on the information representation rules of the temporal knowledge graph and LLM.

Our main contributions are as follows:
- 1) A new paradigm for long-form story generation.
  - a dynamic hierarchical outline (DHO) mechanism -> planning + writing stages,
    - making it adaptable to the generation uncertainty and improving plot coherence.
    - Experiments show that DHO improves 6.87% on Ent-2 metric -> 이 metric은 어떤 metric?
- 2) A new approach to contextual conflict resolution.
  - a Memory-enhancement module (MEM) based on the temporal knowledge graph to store and access the generated story.
  - Applying LLM to perform semantic filtering on KG retrieval
    - -> results keeps the conciseness(간결성) of historical information
  - Experiments show that MEM reduces conflicts by 87.61%. -> 어떻게 conflict을 측정했나?
- 3) A new evaluation for conflict detection.
  - To evaluate the degree of contextual consistency automatically,
  - a potential conflict detection method
    - based on the information representation rules of the temporal knowledge graph
    - and apply LLM to further determine whether a conflict exists.
  - Experiments show that the judgment results of this method are consistent with human preferences.

2 Related Work

2.1 Long-form Story Generation

Recent works for generating long-form stories can be classified into two categories based on whether human participation is engaged(사람의 개입 여부)
- 1 사람 개입X) a plan-and-write framework (Yao et al., 2018)
  - the rigid nature of a predetermined outline can hinder flexibility in the writing phase,
    - resulting in plot inconsistencies, such as repeated or contradictory contexts.
  - 이러한 기법들은 이야기의 길이를 늘리고 플롯 전개의 유창성을 향상시키지만, 사전에 정해진 개요의 경직된 특성이 글쓰기 과정의 유연성을 저해할 수 있다. 그 결과, 문맥 반복 또는 모순과 같은 플롯 비일관성이 발생할 가능성이 있다.

Fan et al. (2018) proposes a hierarchical story generation method,
- which first generates a story premise and then generates the story based on it.
Yao et al. (2018) introduces a "plan-and-write" framework for open-domain story generation,
- which divides the writing process into planning and generating.
Yang et al. (2022) further subdivides writing into plan, draft, rewrite, and edit module.
Yang et al. (2023) makes efforts on outline control to generate a more detailed and hierarchical story.

2 사람 개입 O) Other works (Zhou et al., 2023; Brahman et al., 2020) explore using human interaction to replace pre-generated outlines.
- Brahman et al. (2020) introduce a content-inducing approach,
  - which involves human being using cue-phrases.
- Zhou et al. (2023) propose a language simulacrum of RNNs’ recurrence,
  - enabling human-guided story planning.
- This reduces plot inconsistencies but results in less fluent, readable narratives due to lack of microcontrol.
- 이 방식은 플롯의 비일관성을 줄이는 효과가 있지만, 미시적 제어(microcontrol) 부족으로 인해 이야기의 유창성과 가독성이 낮아지는 문제가 발생할 수 있다.

2.2 KG-enhanced LLM Inference

Knowledge Graphs (KGs) can effectively model the key information in text by triples in the form of <′ subject′ , ′ action′ , ′ object′> (Ji et al., 2020).
integration of LLMs and KGs has attracted widespread attention due to the potential improvement KGs can make for LLMs (Pan et al., 2023a).
- KG can be used as a knowledge base to provide references when LLM generates content.
- There are two kinds of works to inject the knowledge in KG into LLM
  - 1) fuse knowledge in KG and input by directly concatenating them in language level or in token level
    - can easily fuse the knowledge between them
    - but with little interaction between knowledge and input.
  - 2) applies the addition knowledge fusion module
    - in which the knowledge is updated and simplified

-> KG와 관련해서 공부가 더 필요할 것 같다.

3 Problem definition and motivations

3.1 Problem Definition

I : given story premise from a writer,
F* : a framework to generate a long-form story S
C_Plot : score of story S on the plot coherence
C_Context : score of story S on the context coherence

-> We aim to generate a story S with improved C_Plot and C_Context

3.2 Motivation

기존의 long-form story generation methods
- -> 1) planning and writing stages를 분리한다(based on the plan-write framework)
  - impossible to adapt to the uncertainty of the writing stage.
  - Besides, stories tend to stop abruptly at the beginning, resulting in missing plots.(my) 앞쪽 내용까지만 생성되는 경우가 많다 - 실제로 llm에게 줄거리 써달라고 하면 기-승 까지만 작성할 뿐 중후반부-결말에 대한 내용은 생략하는 경향을 보인다)
- -> 2) human-involved detail outline preferences, improving the flexibility of outlines
  - but making the overall storyline uncontrollable or incomplete.
An intuitive idea arises: What if we could further improve story coherence from plot fluency and completion?
- 만약 우리가 플롯 유창성과 완성도에서 더 나아가 이야기 일관성을 더욱 향상시킬 수 있다면 어떨까요?
We follow the idea of the hierarchical outline (Yang et al., 2023)
- which is composed of the rough outline and the detailed outline.
To ensure the plot completeness,
- we consider the rough outline as the macro plot guidance and apply the novel writing theory to guide its generation to ensure it contains all the story stages of the theory.
- 플롯 완성도를 보장하기 위해, 우리는 rough outline을 거시적인 플롯 지침으로 간주하고, 소설 작문 이론을 적용하여 이론의 모든 이야기 단계를 포함하도록 개요 생성을 안내합니다.(my) 기승전결을 포함하도록 생성해줘. 이런 방식으로 진행했다는 이야기인듯)
The detailed outlines are expanded gradually based on the corresponding rough outline
- referred to the relevant content in the leading story
In addition, with the increasing length of the generated story, the contextual inconsistency becomes obvious
- Therefore, we design a memory module to store stories and access concise relevant content. (-> 어떤 형식으로 구현했을까? KG를 이용함)
  - Since KG can model information and retrieve the relevant information (Pan et al., 2024),
  - -> apply KG to store and access the semantically relevant content with a filter based on LLM.

4 Proposed Method

a dynamic(동적인) hierarchical(계층적) outlining(개요작성) with memory-enhancement(메모리 강화) long-form story generation(긴 스토리 생성) method(DOME),
- aiming to story의 일관성 향상 from plot and context description.
DOME is mainly composed of two parts:
- the dynamic hierarchical outline and writing module (DHO)
- and the KG-based memory-enhancement module (MEM).

Figure 2: 제안된 DOME의 다이어그램 스토리 생성 과정은 거친 개요(rough outline)의 양을 기준으로 여러 단계(stage)로 나뉜다. 단계 i(stage i)에서는, rough outline i를 여러 개의 detailed outlines으로 확장 -> 이 때, 기억 강화 모듈(MEM)이 제공하는 관련 콘텐츠를 참고하여 확장한다. 확장된 상세 개요들은 해당하는 관련 콘텐츠를 MEM에서 조회하여 순차적으로 부분 스토리(partial story)를 생성한다. 생성된 모든 부분 스토리는 시간적 지식 그래프(temporal Knowledge Graph)에 저장되며, → 이후의 질의(query)에 활용될 수 있도록 한다.

Rough outlines(R)
- 이야기 설정, 등장인물 소개, 줄거리 요구사항을 포함하는 사용자 입력 I가 주어지면, rough outlines를 한 번에 생성한다.(under the guidance of the writing theory)
DHO : 어떻게 (부분적인) 스토리를 생성하는지
- Then, the detailed outline(D) and the corresponding part(해당되는 부분의) of the story are generated alternately(번갈아가며).
  - -> detailed outlines are not expanded by the rough outline incrementally until its previous stories are generated(my) detailed outline i -> partial story i -> detailed outline i+1 -> partial story i+1 -> ...)
  - and then the detailed outlines guide the generation of sequential stories.
MEM : 생성된 스토리를 쿼드러플 형태로 저장 & detailed outline 생성할때와 partial story 생성할 때 관련 내용 찾아서 반영
- MEM stores user input at the very beginning and the generated story during the writing process.
- The module provides relevant(관련된) content in natural language for the generation stage, ensuring contextual consistency.

4.1 Dynamic Hierarchical Outline Mechanism

The hierarchical outline H = {R, D} is composed of a rough outline R and a detailed outline D.
By incorporating the novel writing theory (Campbell, 1949) (see Fig. 3) into the rough outline planning stage, we improve the overall structure. This is done by stating the theory in the rough outline generation prompt.
Thus, the rough outline R was generated based on user input I:

WT : novel writing theory (Fig 3)
LLM(·) : LLM inference operation
P_rough_utline : the prompt to generate a rough outline

The detailed outline D

is generated step by step
- according to the corresponding part of the rough outline r_i and its relevant content RInfo.i,
- -> 왜 i=1부터 5까지일까? detailed outline을 5번 생성한다는건 5막 구조를 따르기 위해서인 것 같다.

P_detailed_outline : the prompt to generate a detailed outline

{do_i^t}_t=1 to M : detailed outlines, expanded from the rough outline
The M : the number of detailed outlines expanded from the given rough outline r_i
- which should be set at the beginning.

The story content s_t _i
- generated step-by-step based on the detailed outline do_i^t and relevant generated content DInfo.t_i of the sub-detailed outline do_t_i
P_gen_story : the prompt to generate a partial story

The detailed process of writing story S in Algorithm 1. The details of prompts in the DHO are shown in Appendix B.1.

알고리즘 1 정리 :
- KG를 초기화 한다.
- -> Rough outline을 생성한다(eqn 1)
  - -> [(반복) i번째 rough outline에 대해서, 관련있는 content를 KG에서 찾아오고, rough outline과 관련있는 content를 함께 프롬프트에 줘서 i번째 detailed outline을 만든다.(eqn 2)
    - -> [(반복) d_i를 구성하고 있는 문장(do_i^t)에 대해서, do_i^t와 관련있는 content를 KG에서 찾는다
    - -> 관련있는 content와 do_i^t를 함께 프롬프트에 줘서 parital story s_i^t를 생성한다(eqn 3)
    - -> s_i^t를 KG에 넣는다]]
- -> R, D, S를 반환한다.

4.2 Memory-Enhancement Module

LLMs tend to ignore the information in the middle of the long inputs.(Liu et al., 2024 "Lost in the middle: How language models use long contexts")
As the story generation process progresses, the increasing story length affects the LLM’s attention on the relevant content when generating new content,
- increasing computing overhead (Choromanski et al., 2020) and even leading to contextual conflict (Yang et al., 2022).
- my) 즉 생성된 내용이 쌓일 수록 관련된 내용을 잘 찾지 못한다는 것(만약 자연어 형태의 previous context를 주었을 경우)
-> propose an additional memory-enhancement module(MEM), to store the generated story and provide concise relevant content.

KGs and their variants are well-established tools for content storage and query (Wen et al., 2023).
KGs extract the key information for the content
- like subject, action, and object and ignore unimportant information like adverbs and modifiers.
- -> 그럼 스토리에서는 어떤 정보들을 어떤 형식으로 추출했을까? 한 문장/문단? 그냥 말 그대로 문장을 주어/동작/목적어로면 정리를 해둔 것?

MEM applies temporal knowledge graph (TKG) (Roddick and Spiliopoulou, 2002) to store the generated story.
- TKG is stored by quadruples in the form of < subject, action, object, index >.
  - The index means the chapter number of the information.
- The module stores the input story premise at the very beginning and the generated content
- and provides query-relevant content based on entity retrieval in TKG each time the DHO generates content.
To store entity,
- we use LLM to extract triples for every sentence and then add the current chapter number to form quadruples.

To access the relevant content (see Fig. 4),
- we first conduct entity-based quadruple retrieval (Reinanda et al., 2020) on TKG
- myQ) 이때 관련된 content를 가져오기 위해 하나하나 다 살펴보나? 아니면 유사도 같은걸로 특정 갯수만 탐색을 하나?
  - TKG(Temporal Knowledge Graph)에서 관련 내용을 조회할 때는, 쿼리에서 추출된 쿼드러플의 주어(Subjects) 및 목적어(Objects)를 활용하여 TKG에서 적절한 쿼드러플을 검색한다.
  - LLM은 프롬프트(Fig 11)를 기반으로, 쿼리와 검색된 쿼드러플 간의 연관성을 평가하여 관련 점수(Relevance Score)를 부여한다.
- myQ) 그리고 TKG에서 temporal이라는건 기존 KG와 어떤 차이를 가지는지?

and then apply LLM to filter the retrieved results through semantics correlation by evaluating relevance based on rules (see Fig. 5).

의미 기반 관련성 점수 규칙

부분 개요에 표시된 의미적으로 동일하거나 유사한 주어를 지식에 포함하면 1점을 추가하고, 주어진 지식과 개요가 이 기준을 충족하지 않으면 0점을 추가합니다. 예: 헬렌 켈러와 헬렌은 동일한 주어입니다.
부분 개요에 표시된 의미적으로 동일하거나 유사한 목적어를 지식에 포함하면 1점을 추가하고, 주어진 지식과 개요가 이 기준을 충족하지 않으면 0점을 추가합니다.
부분 개요에 표시된 의미적으로 동일하거나 유사한 행동을 지식에 포함하면 1점을 추가하고, 주어진 지식과 개요가 이 기준을 충족하지 않으면 0점을 추가합니다. 예: 제거하다와 지우다는 동일한 행동을 나타냅니다.
지식과 개요가 의미적으로 동일하거나 유사한 사건을 나타내면 1점을 추가하고, 주어진 지식과 개요가 이 기준을 충족하지 않으면 0점을 추가합니다.
주어진 개요에 관련 세부 정보나 중요한 정보를 추가할 수 있으므로 지식이 작성 시 잠재적으로 사용될 수 있으면 1점을 추가하고, 주어진 지식과 개요가 이 기준을 충족하지 않으면 0점을 추가합니다.

점수는 부여하는 모든 점수를 합산한 정수입니다. 따라서 점수 범위는 1, 2, 3, 4, 5입니다.

MEM can provide the top-k most relevant information as query-relevant content, making it concise and semantically relevant with input (more details in Appendix B.2).

4.3 Temporal Conflict Analyzer

The evaluation of long-form stories -> 사람에게 의존했었다
MEM applies TKG to store story and it is easy for TKG to associate context (Wen et al., 2023).

Thus we propose an auto metric that measures the contextual consistency of the story(이야기의 맥락적인 일관성)
- by calculating the rate of conflict quadruples to total quadruples.(전체 quadruple 분의 conflict quadruple)
  - myQ) confilct이라는 게 어떤 의미의 충돌인지?

Figure 26: The rules for grouping quadruples. / to classify quadruples based on the knowledge expression features.

To detect the conflict quadruples,
- 생성된 이야기의 quadruples in TKG Q = {q_i}_i=1부터 N까지,(KG에 저장되어 있는 쿼드러플들)
  - are grouped by rules(Fig 26) sequentially(순차적으로) and without repetition
  - and the potential conflict information is aggregated.

TKG에서 쿼드러플을 그룹화하는 규칙

Rule 1

<주어(sub), 동작(act), 객체(obj), 인덱스(idx)> 형식의 쿼드러플이 동일한 주어와 동작을 가짐.
이는 주어가 다른 객체에 대해 같은 동작을 수행하며, 스토리가 진행됨에 따라 다른 챕터 인덱스에서 발생함을 의미함.

Rule 2

<주어(sub), 동작(act), 객체(obj), 인덱스(idx)> 형식의 쿼드러플이 동일한 주어와 객체를 가짐.
이는 주어가 동일한 객체에 대해 서로 다른 동작을 수행하며, 스토리가 진행됨에 따라 다른 챕터 인덱스에서 발생함을 의미함.

Rule 3

<주어(sub), 동작(act), 객체(obj), 인덱스(idx)> 형식의 쿼드러플이 동일한 동작과 객체를 가짐.
이는 서로 다른 주어가 동일한 객체에 대해 같은 동작을 수행하며, 스토리가 진행됨에 따라 다른 챕터 인덱스에서 발생함을 의미함.

Rule 4

<주어(sub), 동작(act), 객체(obj), 인덱스(idx)> 형식의 쿼드러플이 챕터 인덱스만 다름.
이는 주어가 동일한 객체에 대해 같은 동작을 수행하지만, 스토리가 진행됨에 따라 다른 챕터 인덱스에서 발생함을 의미함.

Rule 5

동일한 엔티티(주어, 객체 등)를 포함하는 쿼드러플이 다른 속성(특성)을 가짐.
이는 스토리가 진행됨에 따라 엔티티의 상태가 변화함을 의미함.

Since LLm-as-a-judge has been verified by experiments (Zheng et al., 2024) and it can evaluate some features of a text within a certain length according to rules (Achiam et al., 2023; Liu et al., 2021),
-> apply LLM to further detect the conflict quadruples based on time order.(More details in Appendix E.)
In this way, we can find the quadruples Q^conflict = {q_i^conflict}_i=1부터 m까지, containing conflict information.

The conflict rate is expressed as follows:

N : the number of quadruples in TKG
m : the number of conflict quadruples in TKG.

5 Experiment

Implementation details.

DOME contains DHO and MEM.
- In DHO,
  - every rough outline is expanded into 3 detailed outlines which means the M mentioned in Section 4.1 is set to 3.
  - The embedding model we applied is bge-large-en-v1.5 (Chen et al., 2024).
- In MEM,
  - we apply cosine similarity (Lahitani et al., 2016) and set the filter threshold to 0.75 to implement the entity-based query.
  - Besides, we switch off the default historical content input of LLM where MEM is applied.
  - The temperature of LLM is set to 0.5 and the max token of LLM is 1000 while other parameters are kept by default.
All the experiments are conducted on 2×A800 GPU with CUDA version 11.3.

Datasets and baselines.

Following the settings of DOC (Yang et al., 2023),
- use its story premises as the input and generate 20 long stories for evaluation.
- The details of the dataset are in Appendix A.
We use Qwen1.5-72B-chat (Bai et al., 2023) for long-form story generation

We compare DOME with two types of methods.
- 1) prompting LLM to generate stories as long as possible with input.
- 2) the state-of-the-art (SOTA) methods including DOC (Yang et al., 2023) and Re3 (Yang et al., 2022).
To validate the scalability of DOME,
- we report the improvement When applying DOME on Llama-3-70b-Instruct (AI@Meta, 2024) and Yi1.5-34B-chat (Young et al., 2024).

Metrics.

content repetition is an important aspect reflecting the quality of stories in terms of coherence (Yang et al., 2022),
- We apply the auto evaluation metrics including n-gram entropy (Zhang et al., 2018) and set n = 2 (Ent-2) 어휘의 다양성을 측정하여 내용 반복의 정도를 평가하기 위해서.
Besides, we apply our proposed conflict rate (CR.) to evaluate the contextual consistency of the generated stores.
To evaluate the alignment of the stories with human preferences in terms of contextual consistency and coherence and other basic story quality,
- conduct a human evaluation to evaluate the story quality of plot completeness, coherence, relevance, interest level, and expression coherence.

We compare all methods according to each indicator and calculate their average rank.
The description of every metric is as follows:
- 1) Plot Completeness (P Co.): the extent to which the generated story covers all the stages mentioned in the story theory.
- 2) Plot Coherence (P Coh.): the fluency of the development of the generated story.
- 3) Relevance (Rel.): the consistency between the generated story and the input premise. It demonstrates the actual usability of the method.
- 4) Interesting (Int.): the reading interest to the user. Since the reading interest can reflect peoples’ preferences on the completeness and coherence of stories (Wang et al., 2023b).
- 5) Expression Coherence (ECoh.): the contextual consistency of the story.
More details of human evaluation are shown in Appendix C.

5.1 Comparison experiments

Table 1. Comparison with LLMs and SOTA baselines. We use the same premises as DOC Yang et al. (2023)

DOME is consistently better in auto evaluation.

From Table 1, the proposed DOME achieves superior performance in both CR. and Ent-2.
The results indicate that our DOME delivers diverse content with fewer conflicts.

We attribute the improvement to the following facts.
- Firstly, our TKG is capable of fine-grained modeling of generated stories. It provides accurate relevant content through semantic filtering during the generation stage, which ensures consistency and reduces contextual conflicts in the generated content.
- Besides, Our DHO dynamically adjusts the detailed outlines during the story generation process, thereby increasing the space for plot development and enhancing content diversity.

DOME is consistently better in human evaluation.

We conduct a human evaluation to assess how well DOME aligns with human preferences.
In Table 1, our DOME ranks first across all five metrics. The results demonstrate that MEM can provide relevant content during the generation stage, reducing the conflict due to the unclear memory of LLM itself (Zhang et al., 2023) and thus ensuring contextual consistency and enhancing relevance and expression coherence

5.2 Ablation studies

Effectiveness of DHO.

We remove the DHO so that the outline is fixed and consistent with the input storyline. As shown in Table 2, this variant exhibits higher content conflicts and reduced content variety. Additionally, it ranks lowest across all human evaluation metrics. It supports the notion that dynamic adjustment of the detailed outline helps to control content generation, thereby ensuring plot coherence. Furthermore, the application of novel writing theory enhances plot completeness and thus improves the reading interest, this is because a complete story encourages the reader to read to the end. See the example of DHO in Appendix F.

Effectiveness of MEM.

Similarly, we remove the MEM to eliminate the knowledge graph for reference and resort to the multi-round chatting capability of LLM to generate content. Limited by computing cost, we set the maximum chat rounds to 2. As shown in Table 2, the conflict rate significantly increases from 0.56 to 4.52 and Ent-2 reduces from 12.29 to 10.00 without MEM. Additionally, all human evaluation metrics deteriorate to varying degrees. These results indicate that MEM ensures contextual coherence by providing relevant content during generation, which also improves plot coherence and alignment with human preferences. See the example of MEM in Appendix E.

5.4 Computation and storage cost of DOME

Table 3: DOME의 계산 및 저장 비용. 참고로, API 호출은 KG 구축을 위한 평균 API 호출 횟수를 의미한다.

Table 3 : the computation and storage cost for constructing the knowledge graph dynamically
제안된 방법은 허용 가능한 추가 계산 및 저장 오버헤드를 가지면서도 장문 스토리 생성의 품질을 향상시킨다.
구체적으로, Table 1과 3에서 볼 수 있듯이, 제안된 DOME은 KG를 동적으로 구축하는 과정에서 791개의 노드 저장 오버헤드와 16회의 LLM API 호출만 추가하지만, 모든 성능 지표에서 최신(SOTA) 방법들을 능가한다.
이는 KG의 구조적 저장 방식과 KG 접근의 장점이 LLM이 정확하고 간결한 관련 기억을 확보하도록 돕고, 문맥 불일치를 줄이는 데 기여하기 때문이다.

6 Conclusion

In this paper, we propose a dynamic hierarchical outlining with memory-enhancement, named DOME, to generate a coherent long-form story.
- Specifically, we propose a dynamic hierarchical outline (DHO) mechanism for long-form story generation, based on the plan-write framework and novel-writing theory.
- The DHO mechanism creates a rough outline that aligns with novel-writing theory and dynamically generates it during the writing process, enhancing plot fluency.
- Additionally, the Memory-Enhancement Module (MEM) uses temporal knowledge graphs to store and access generated stories, providing contextual content for both detailed outline planning and story writing, and reducing contextual conflicts.
- Lastly, the temporal conflict analyzer detects potential conflicts using temporal knowledge graphs and integrates with LLMs to automatically assess the contextual consistency of the generated text.
Experiments demonstrate that DOME significantly improves the coherence and overall quality of generated longform stories from plot and expression compared to SOTA methods.

Limitations

The lack of automatic evaluation metrics constrains our experiments.
- Specifically, we have to rely on human evaluation to evaluate the quality of the generated stories which is time-consuming and costly.
- Besides, the amount of experimental data is limited since there are no specific datasets for long-form story generation and we follow the experiment setting of (Yang et al., 2023).
In addition, our framework requires massive LLM API calls which are time-consuming. On average, Generating a story requires about 200 API calls. Massive API calls result in a limited story generation speed, taking about 4 hours to complete a 7,000- word story. Thus the long text generation and a large number of API calls limit our usage of paid closed-source LLMs, such as ChatGPT.
Although we report the result about the extensibility of our framework, many steps in our framework are realized based on custom-designed prompts and it is better to re-design these prompts to achieve better performance.
- Although experiments demonstrate the effectiveness and scalability of our proposed DOME, the prompts in DOME are designed based on human experience. Therefore, the use of automatic prompts may further exploit the capabilities of DOME. However, automatic prompting methods often struggle with capturing the nuanced context required for a specific task (Si et al., 2023). Applying them to long-form story generation tasks may not ensure the completeness of long-form stories and further affects the effectiveness and scalability of DOME.

저작자표시 비영리 변경금지

'✨ Story Generation > 논문 리뷰' 카테고리의 다른 글

[논문 리뷰 / Arxiv 2024] Evaluating Creative Short Story Generation in Humans and Large Language Models (0)	2025.03.10
[논문 리뷰 / CEUR Workshop Proceedings] Leveraging LLM-Constructed Graphs for Effective Goal-Driven Storytelling (0)	2025.03.04
🔥[논문 리뷰 / EMNLP 2024 findings] CHIRON: Rich Character Representations in Long-Form Narratives (0)	2025.02.02

논문 요약

Abstract

1 Introduction

2 Related Work

2.1 Long-form Story Generation

2.2 KG-enhanced LLM Inference

3 Problem definition and motivations

3.1 Problem Definition

3.2 Motivation

4 Proposed Method

4.1 Dynamic Hierarchical Outline Mechanism

4.2 Memory-Enhancement Module

의미 기반 관련성 점수 규칙

4.3 Temporal Conflict Analyzer

TKG에서 쿼드러플을 그룹화하는 규칙

Rule 1

Rule 2

Rule 3

Rule 4

Rule 5

5 Experiment

Implementation details.

Datasets and baselines.

Metrics.

5.1 Comparison experiments

5.2 Ablation studies

5.4 Computation and storage cost of DOME

6 Conclusion

Limitations

'✨ Story Generation > 논문 리뷰' 카테고리의 다른 글

티스토리툴바