Why does importing a txt file with a "description" variable consisting of multiple words and white spaces in Python not work correctly?

Question

I have a txt-file, and I am importing it using the read method. The goal is to turn the columns into a df. This works fine for the most part. However, the description column consists of multiple words with white spaces. When I try to turn the description variable into a column, only the first word is grabbed, which makes sense, but I want the whole string. For example: PACK OF 20 NAPKINS PANTRY DESIGN only Pack is grabbed.

I tried different workarounds with the index and re.sub, but nothing works really.

I know if I use

ttt = pd.read_csv("csv/???")

it is not a problem and the multiword column is imported correctly.

How can I put the complete description string into one column using a read method?

This maybe a duplicate. Anyway, did not find an answer to this problem here or online. Maybe I do express the question wrongly.

Anybody knows a solution? Thanks.

My code:

    fin_2 = open('csv/ecom_sales_full_cat.txt', 'rt')
    fout_2 = open('csv/ecom_sales_full_cat_2.txt', 'wt')

    for line in fin_2:
    fout_2.write(line.replace('United Kingdom', 'UnitedKingdom').\
               replace('Hong Kong', 'HongKong').replace('Major Category ', 'MajorCategory ').\
               replace('Minor Category', 'MinorCategory'))

fin_2.close()
fout_2.close()enter code here

with open('csv/ecom_sales_full_cat_2.txt' ,'r') as file:
    text= file.read()
f = open('csv/ecom_sales_full_cat_2.txt', 'r')
ecom_sales_full_cat_2_data = f.read()
ecom_sales_full_cat_2_rows = ecom_sales_full_cat_2_data.split('\n')
ecom_sales_full_cat_2_rows = ecom_sales_full_cat_2_rows[0:700]

index_col_5 = []
invoice_no_5 = []
unit_price_5 = []
order_value_5 = []
quantity_5 = []
country_5 = []
invoice_date_5 = []
invoice_time_5 = []
year_month_5 = []
major_category_5 = []
minor_category_5 = []
description_5 = []

for i in ecom_sales_full_cat_2_rows:
    values_full_cat = i.split()
    
    
    index_col_5_value = values_full_cat[0]
    invoice_no_5_value = values_full_cat[1]
    unit_price_5_value = values_full_cat[2]
    order_value_5_value = values_full_cat[3]
    quantity_5_value = values_full_cat[4]
    country_5_value = values_full_cat[5]
    invoice_date_5_value = values_full_cat[6]
    invoice_time_5_value = values_full_cat[7]
    year_month_5_value = values_full_cat[8]
    major_category_5_value = values_full_cat[9]
    minor_category_5_value = values_full_cat[10]
    description_5_value = values_full_cat[11]

    
    index_col_5.append(index_col_5_value)
    invoice_no_5.append(invoice_no_5_value)
    unit_price_5.append(unit_price_5_value)
    order_value_5.append(order_value_5_value)
    quantity_5.append(quantity_5_value)
    country_5.append(country_5_value)
    invoice_date_5.append(invoice_date_5_value)
    invoice_time_5.append(invoice_time_5_value)
    year_month_5.append(year_month_5_value)
    major_category_5.append(major_category_5_value)
    minor_category_5.append(minor_category_5_value)
    description_5.append(description_5_value)

print(invoice_no_5[:10])
print(major_category_5[:10])
print(minor_category_5[:10])
print(description_5[:10])

The txt:

OrderValue Quantity Country InvoiceDate InvoiceTime Year-Month Major Category Minor Category Description 0 549185 0.85 10.20 12 United Kingdom 7/4/11 09:35 2011-04 Clothes Tops PACK OF 20 NAPKINS PANTRY DESIGN

Output

['InvoiceNo', '549185', '576381', '551192', '573553', '539436', '580893', '568321', '552949', '567461']
['MajorCategory', 'Clothes', 'Clothes', 'Kitchen', 'Garden', 'Garden', 'Garden', 'Clothes', 'Clothes', 'Household']
['MinorCategory', 'Tops', 'Shoes', 'Cutlery', 'Turf', 'Hoses', 'Hoses', 'Hats', 'Tops', 'Curtains']
['Description', 'PACK', 'NATURAL', '36', 'SET', 'FINE', 'SEWING', '3', 'BULL', 'STRAWBERRY']

Ormetrom2354 · Answer 1 · 2022-05-27T08:28:52.607

I found a workaround. This works.

df = pd.read_fwf('csv/ecom_sales_full_cat_23.txt')

pd.read_fwf

This does not work in my case:

ecom = pd.read_csv('csv/ecom_sales_full_cat_23.txt', lineterminator='\n', sep=" ", error_bad_lines=False)

The pd.read_csv results in:

ParserError: Error tokenizing data. C error: Expected 79 fields in line 4, saw 83 Sep or delimiter makes no difference. It must have something to do with the white spaces.

However, this is only a workaround, but not the solution, when using a read method.

There is a discussion on stack considering: How to efficiently parse fixed width files?

Not sure if this points in the right direction.

Why does importing a txt file with a "description" variable consisting of multiple words and white spaces in Python not work correctly?

1 Answers1