I'm relatively new to python, so please bear with me if this is a stupid question.
I need to open a CSV in excel, but said CSV file is formatted poorly -in my understanding and gives me errors when opened in excel. Through out the file it contains multiple rows that look like this:
21904,103909,"8481109920","80","Electromagnetic pressure reducing|valve
-|with a plunger,
-|with an operating pressure of not more than 325MPa,
-|with a plastic connector with 2 silver or tin, or silver-plated, or tin-plated, or silver and tin plated pins","2018-07-01","#NA",49247,"8481109900","80"
This whole part is supposed to be 1 row and there seems to be an error with the formatting of the product description. This happens more than 1000 times, so fixing this by hand is no option for me.
There must be a way to write a script to remove all line breaks within double quotes right?
The dataset is officially provided by the UK government and is available to the public, here is where you can find the link under commodities (commodities table):
Or directly download said CSV file:
I need to do this for work so any help is greatly appreciated!
I tried to solve the issue with chatgpt and google and got the following script. Unfortunately it only removes the empty lines within the file and not the line breaks within the product descriptions.
# import csv
def remove_line_breaks_within_quotes(input_file, output_file, encoding='utf-8'):
# Read the CSV file and store its contents in a list
with open(input_file, 'r', encoding=encoding) as file:
lines = file.readlines()
# Process each line to remove line breaks within double quotes
inside_quotes = False
processed_lines = []
current_line = ''
for line in lines:
line = line.strip()
if not line:
continue
if inside_quotes:
current_line += line.replace('\n', ' ')
if line.endswith('"'):
inside_quotes = False
processed_lines.append(current_line)
current_line = ''
else:
if line.startswith('"') and not line.endswith('"'):
inside_quotes = True
current_line = line
else:
processed_lines.append(line)
# Write the processed content back to the CSV file
with open(output_file, 'w', newline='', encoding=encoding) as file:
for line in processed_lines:
file.write(line)
file.write('\n')
if __name__ == "__main__":
input_filename = "" # Replace this with the name of your input CSV file
output_filename = "" # Replace this with the desired name for the output CSV file
remove_line_breaks_within_quotes(input_filename, output_filename)
print("All done")