0

I have little code which using regex and here I'm trying to make my records to be with lowercase and without any punctuations in it, but in further situation I have error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5387: character maps to <undefined>

I want to extract Record ID and Title for the records with Languages English

import csv
import re
import numpy

filename = ('records.csv')

def reg_test(name):

    reg_result = ''

    with open(name, 'r') as csvfile:
        reader = csv.DictReader(csvfile)

        for row in reader:
            row = re.sub('[^A-Za-z0-9]+', '', str(row))
            reg_result += row + ','

            if (row['Languages'] == 'English')
                return reg_result

print(reg_test(filename).lower())
Bob Gilmore
  • 12,608
  • 13
  • 46
  • 53
  • 1
    Can you add `records.csv` content? – Alderven Aug 02 '19 at 12:54
  • post a testable fragment from your csv – RomanPerekhrest Aug 02 '19 at 13:00
  • Seems to be a problem with your CSV file at position 5387, maybe there is some symbol not present in the current character map (maybe Cyrillic, Chinese, or something like that). You could print the line before the error and then check your CSV file for some strange characters. – Albo Aug 02 '19 at 13:06
  • https://i.postimg.cc/Gm37GGjW/Capture.jpg – Kiril Vodenicharov Aug 02 '19 at 13:06
  • You might need to set the encoding on your streams. Assuming the file is in utf-8, this might help: https://stackoverflow.com/a/844443/3216427 . Or you might need to add the `encoding` parameter to the `open()` call, as in the examples at https://docs.python.org/3/library/csv.html – joanis Aug 02 '19 at 13:09
  • Аctually if i try to type a code without `regex` and print the current columns, there is no problem. The problem it comes when I try to use `regex` to make it lowercase without any punctuations. – Kiril Vodenicharov Aug 02 '19 at 13:09
  • @KirilVodenicharov, instead of guessing you could just post a testable fragment and expected result - so that you will increase your chances to get a quick and workable answer – RomanPerekhrest Aug 02 '19 at 13:12
  • What are your locale settings, by the way? – joanis Aug 02 '19 at 13:14
  • 1
    In any case, as others have said, if you could post a small csv file that exhibits the problem, maybe with just a line or two in it, we'll be able to test on our machines. Right now, all we can do is guess and it's unlikely anyone will be able to solve this for you without a test file. – joanis Aug 02 '19 at 13:16

1 Answers1

0
import re, csv

# sample.csv - contains some samples from original csv file.
with open('sample.csv', 'rb') as f:
    patt = r'[:;\'".`~!@#$?-_*()=\[\]\/]+'
    puncs = re.findall(patt, f.read())
    f.close()

with open('sample.csv', 'rb') as f:
    reader = csv.reader(f)
    next(reader)    # leaving the header of csv file
    data = []
    for row in reader:
        data.append(row)
    f.close()

new_data = []

for i, j in enumerate(data):
    d = ','.join(j)
    nop = [c for c in d if c not in puncs]
    nop = ''.join(nop)

    new_data.append(nop.split(','))

print new_data

output:

[['UkEN000561198', 'article', 'text', '00310182', '', 'QE500', '56045', 'Mesozoic radiolarian biostratigraphy of Japan and collage tectonics along the eastern continental margin of Asia', '', 'Kojima', ' S  Mizutani', ' S', '', 'Netherlands', 'PALAEOGEOGRAPHY PALAEOCLIMATOLOGY PALAEOECOLOGY', 'monthly', '1992', '96', '2Jan', '', '', '', '367', '', 'PALAEOGEOGRAPHY PALAEOCLIMATOLOGY PALAEOECOLOGY 9612', ' 367 1992', '634345'],
['UkEN001027396', 'article', 'text', '03778398', '', 'QE719', '560', 'Late Pliocene climate in the Southeast Atlantic Preliminary results from a multidisciplinary study of DSDP Site 532', '', 'Hall', ' M A  Heusser', ' L  Sancetta', ' C', '', 'Netherlands', 'MARINE MICROPALAEONTOLOGY', '4 issues per year', '1992', '20', '1', '', '', '', '59', '', 'MARINE MICROPALAEONTOLOGY 201', ' 59 1992', '53764']]

Hope, this may help.

Bubai
  • 1,155
  • 1
  • 10
  • 20
  • pass the `row` inside `.findall()` as string, str(row). – Bubai Aug 02 '19 at 13:26
  • [![Capture.jpg](https://i.postimg.cc/qqqnpcHT/Capture.jpg)](https://postimg.cc/dZMDBy0H) take a look... – Kiril Vodenicharov Aug 02 '19 at 13:29
  • Still not working and the `TypeError: can only concatenate str (not "list") to str – Kiril Vodenicharov Aug 02 '19 at 13:32
  • where do you get the TypeError...? and there's no need to use `reader = csv.DictReader(csvfile)` as it can be opened using the`open()` func. and there it'll works fine. It'll help a lot if u provide a content sample of your `.csv` file. That we can test on. – Bubai Aug 02 '19 at 13:35