3

Context: I am trying to query Llama-2 7B, taken from HuggingFace (meta-llama/Llama-2-7b-hf). I give it a question and context (I would guess anywhere from 200-1000 tokens), and ask it to answer the question based on the context (context is retrieved from a vectorstore using similarity search). Here are my two problems:

  1. The answer ends, and the rest of the tokens until it reaches max_new_tokens are all newlines. Or it just doesn't generate any text and the entire response is newlines. Adding a repetition_penalty of 1.1 or greater has solved infinite newline generation, but does not get me full answers.
  2. For answers that do generate, they are copied word for word from the given context. This remains the same with repetition_penalty=1.1, and making the repetition penalty too high makes the answer nonsense.

I have only tried using temperature=0.4 and temperature=0.8, but from what I have done, tuning temperature and repetition_penalty both result in either the context being copied or a nonsensical answer.

Note about the "context": I am using a document stored in a Chroma vector store, and similarity search retrieves the relevant information before I pass it to Llama.

Example Problem: My query is to summarize a certain Topic X.

query = "Summarize Topic X"

The retrieved context from the vectorstore has 3 sources that looks something like this (I format the sources in my query to the LLM separated by newlines):

context = """When talking about Topic X, Scenario Y is always referred to. This is due to the relation of
Topic X is a broad topic which covers many aspects of life.
No one knows when Topic X became a thing, its origin is unknown even to this day."""

Then the response from Llama-2 directly mirrors one piece of context, and includes no information from the others. Furthermore, it produces many newlines after the answer. If the answer is 100 tokens, and max_new_tokens is 150, I have 50 newlines.

response = "When talking about Topic X, Scenario Y is always referred to. This is due to the relation of \n\n\n\n"

One of my biggest issues is that in addition to copying one piece of context, if the context ends mid-sentence, so does the LLM response.


Is anyone else experiencing anything like this (newline issue or copying part of your input prompt)? Has anyone found a solution?

2 Answers2

2

This is a common issue with pre-trained base models like Llama.

My first thought would be to select a model that has some sort of instruction tuning done to it i.e https://huggingface.co/meta-llama/Llama-2-7b-chat. Instruction tuning impacts the model's ability to solve tasks reliably, as opposed to the base model, which is often just trained to predict the next token (which is often why the cutoff happens).

The second thing, in my experience, I have seen that has helped is using the same prompt format that was used during training. You can see in the source code the prompt format used in training and generation by Meta. Here is a thread about it.

Finally, for repetition, using a Logits Processor at generation-time has been helpful to reduce repetition.

Jamie
  • 171
  • 2
  • Interesting, thanks for the resources! Using a tuned model helped, I tried TheBloke/Nous-Hermes-Llama2-GPTQ and it solved my problem. They had a more clear prompt format that was used in training there (since it was actually included in the model card unlike with Llama-7B). And this new model still worked great even without the prompt format. – user22284215 Aug 03 '23 at 13:14
  • Even for the original non-chat models, I still don't understand why it would repeats the long context. – Saddle Point Aug 22 '23 at 02:05
  • @SaddlePoint Which models are you talking about, I think my point was that chat models are better at not repeating - since the non-chat, base models are trained on a next-word objective, not to field instrucitons. – Jamie Aug 24 '23 at 18:59
  • @Jamie I'm agree with you. I just can not understand why the original issue the OP facing will happen - the model repeats completely exactly the same context before giving the response. – Saddle Point Aug 25 '23 at 11:51
0

HuggingFace Transformers generate function concatenates the prompt ids to the generation and returns the concatenation of the prompt token ids and the generation. This is done in this line of code:

https://github.com/huggingface/transformers/blob/021887682224daf29264f98c759a45e88c82e244/src/transformers/generation/utils.py#L2487

I suggest you encode the prompt using Llama tokenizer, find the length of the prompt token ids and remove them from the model output:

prompt = "Who was the third president of the United States?"
prompt_tokens = tokenizer(prompt, return_tensors="pt")["input_ids"]
start_index = prompt_tokens.shape[-1]

output = model.generate(prompt_tokens, num_return_sequences=1)

generation_output = output[0][start_index:]
generation_text = self.tokenizer.decode(generation_output, skip_special_tokens=True)
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 28 '23 at 07:38