0

I'm relatively new to python, so please bear with me if this is a stupid question.

I need to open a CSV in excel, but said CSV file is formatted poorly -in my understanding and gives me errors when opened in excel. Through out the file it contains multiple rows that look like this:

21904,103909,"8481109920","80","Electromagnetic pressure reducing|valve

-|with a plunger,

-|with an operating pressure of not more than 325MPa,

-|with a plastic connector with 2 silver or tin, or silver-plated, or tin-plated, or silver and tin plated pins","2018-07-01","#NA",49247,"8481109900","80"

This whole part is supposed to be 1 row and there seems to be an error with the formatting of the product description. This happens more than 1000 times, so fixing this by hand is no option for me.

There must be a way to write a script to remove all line breaks within double quotes right?

The dataset is officially provided by the UK government and is available to the public, here is where you can find the link under commodities (commodities table):

https://data.api.trade.gov.uk/v1/datasets/uk-tariff-2021-01-01/versions/v3.0.215/metadata?format=html#table--commodities-report

Or directly download said CSV file:

https://data.api.trade.gov.uk/v1/datasets/uk-tariff-2021-01-01/versions/v3.0.215/tables/commodities-report/data?format=csv

I need to do this for work so any help is greatly appreciated!

I tried to solve the issue with chatgpt and google and got the following script. Unfortunately it only removes the empty lines within the file and not the line breaks within the product descriptions.

# import csv

def remove_line_breaks_within_quotes(input_file, output_file, encoding='utf-8'):
    # Read the CSV file and store its contents in a list
    with open(input_file, 'r', encoding=encoding) as file:
        lines = file.readlines()

# Process each line to remove line breaks within double quotes
inside_quotes = False
processed_lines = []
current_line = ''
for line in lines:
    line = line.strip()
    if not line:
        continue
    if inside_quotes:
        current_line += line.replace('\n', ' ')
        if line.endswith('"'):
            inside_quotes = False
            processed_lines.append(current_line)
            current_line = ''
    else:
        if line.startswith('"') and not line.endswith('"'):
            inside_quotes = True
            current_line = line
        else:
            processed_lines.append(line)

# Write the processed content back to the CSV file
with open(output_file, 'w', newline='', encoding=encoding) as file:
    for line in processed_lines:
        file.write(line)
        file.write('\n')


if __name__ == "__main__":
    input_filename = ""  # Replace this with the name of your input CSV file
    output_filename = ""  # Replace this with the desired name for the output CSV file
    remove_line_breaks_within_quotes(input_filename, output_filename)

print("All done")
InSync
  • 4,851
  • 4
  • 8
  • 30
Falk
  • 11
  • 2
    That entire thing _is_ one row: See [this tio.run](https://tio.run/##VVBNb4MwDL3zKyx2AKRQBcpUqLTDDjvssA9tvyAC00aCJDKBalr325lZkdrlkNjv2c/xc1/@aM12nnXvLHnQNlijepiCoIUHhjafnrQ5PL/FURTlWSULkcltJSsRlkWZZbKqchlyslxPHdaebK8OBr2uwREOw0gIhM1Ys8x5Ut2EQZCeT9ofQYHrRnNAElfIgHVIynP1td@2YKyH3nLsj1yzze9f3pX4p6SGZWZtjeFfWII/IodB80gCBlhTLO8FSbnDYyNW5jZdW5RpFgYuDDhtBl4yl1mZyl0qM07uXh9DUVR5sbv6IVc/2LCEbWQ9LYDsCVgKzdgv22HMHm8IVYMUt0myD4CPY6t93IYfXPytf/ZhcgOzQjLPvw), where `csv.reader` parses it as a single row. Notice the quotes around the multi-line string. This is a valid CSV format. – Pranav Hosangadi Aug 17 '23 at 13:22
  • Removing newlines from a string is a different question which you should be able to find already answered on Stack Overflow. – Pranav Hosangadi Aug 17 '23 at 13:23
  • 1
    Does this answer your question? [What's a quick one-liner to remove empty lines from a python string?](https://stackoverflow.com/questions/1140958/whats-a-quick-one-liner-to-remove-empty-lines-from-a-python-string) – Pranav Hosangadi Aug 17 '23 at 13:24

1 Answers1

1

The newlines appear because they are in the data. The example you posted is newlines that are part of the "commodity__description" for that row. The python csv package will handle them correctly, but if you really wanted to strip them out you can via the re package.

Assuming you download the data from that url and thus have the file "data.csv" then you can try:

import csv
import re

with open("data.csv", "r", encoding="utf-8") as file_in:
    for row in csv.DictReader(file_in):
        if "plunger" in row["commodity__description"]:
            print("-"* 32)
            print("the new lines are just part of the data")
            print("-"* 32)
            print(row)
            print("-"* 32)
            print()

            print("-"* 32)
            print("you can see that here")
            print("-"* 32)
            print(row["commodity__description"])
            print("-"* 32)
            print()

            print("-"* 32)
            print("you can get rid of them though")
            print("-"* 32)
            print(re.sub(r"\s+", " ", row["commodity__description"]))
            print("-"* 32)

            break

That should give you:

--------------------------------
the new lines are just part of the data
--------------------------------
{'id': '20564', 'commodity__sid': '105629', 'commodity__code': '8413302040', 'commodity__suffix': '80', 'commodity__description': 'High-pressure plunger pump for direct diesel injection, with:\n \n-an operating pressure of not more than 275\xa0MPa,\n \n-a camshaft,\n \n-a fluid discharging 
of 15\xa0cm³ per minute or more, but not more than 1\xa0800\xa0cm³ per minute,\n \n-an electric pressure regulating valve', 'commodity__validity_start': '2021-01-01', 'commodity__validity_end': 
'#NA', 'parent__sid': '82932', 'parent__code': '8413302000', 'parent__suffix': '80'}
--------------------------------
--------------------------------
you can see that here
--------------------------------
High-pressure plunger pump for direct diesel injection, with:

-an operating pressure of not more than 275 MPa,

-a camshaft,

-a fluid discharging of 15 cm³ per minute or more, but not more than 1 800 cm³ per minute,       

-an electric pressure regulating valve
--------------------------------
--------------------------------
you can get rid of them though
--------------------------------
High-pressure plunger pump for direct diesel injection, with: -an operating pressure of not more than 275 MPa, -a camshaft, -a fluid discharging of 15 cm³ per minute or more, but not more than 1 800 cm³ per minute, -an electric pressure regulating valve
--------------------------------

Please note that if you just want Excel to sanely open the file, you can do that by setting to proper delimiters prior to opening. Here is a screenshot of me opening the file in LibreOffice with delimiters of a comma and option quotes. I'm sure Excel will provide the same facilities.

File Opened Via LibreOffice

JonSG
  • 10,542
  • 2
  • 25
  • 36
  • Thank you for your response! Yes, the new lines are part of the data. But when I open the CSV in excel each newline is seen as a new ID. Even if it is totally empty. That's why im trying to format the file. Thanks again. I will try to implement your solution! – Falk Aug 18 '23 at 09:03
  • Please note that if you just seek to open the file in Excel have have it behave properly, you can do that by setting the proper delimiters prior to opening. – JonSG Aug 18 '23 at 12:59