0

I am trying to parse multiline blocks of text from a very large .txt file (300,000+ lines) and write these blocks of text into a new file. Each block of text I need is 42 lines, and the first line of each 42-line block begins with a unique language name.

I have created a text file listing each unique language name I require on separate line. I have created a list out of this text file, and want to loop over this list to locate each unique language name in the master file, then copy the 42 two lines specific to that unique name and write this 42-line block of text to a new file.

I am new to programming/Python, and similar questions I could locate were not sufficient to solve my problem, so please excuse any ignorance. I am stuck at the commented section #pseudocode.

from sys import argv

script, from_file, to_file = argv

# Opens input/output files
infile = open(from_file).read()
outfile = open(to_file, 'w')


# Appends all unique language names into a list
langList = []
with open('language-list.txt') as file:
    for language in file:
        name = language.strip()
        langList.append(name)

# Pseudo code of what I want to do
for l in langList:
    find l in infile
    copy 42 lines beginning with match l
    write to outfile

Here is an example of the language-list.txt file:

CENTRAL_GOJAL_WAKHI
TURKMEN

The master text file can be downloaded here: http://email.eva.mpg.de/~wichmann/listss16.zip

The two example languages above can be located within this text file. Although I am interested in parsing about 1000 languages, any suggestion of how to accomplish this for two languages would be sufficient to nudge me in the right direction.

Thank you in advance!

acd
  • 5
  • 4
  • I don't entirely understand the problem, but typically any question involving the phrases "unique" and "related value" has the answer "use a [dictionary](https://docs.python.org/2/tutorial/datastructures.html#dictionaries)". – Kevin Sep 16 '14 at 15:24

3 Answers3

1

I normally don't do that, but as I have a weakness for linguistics and linguists, here you go:

import re

swadesh = {}
lang = None

with open('/tmp/listss16.txt') as fp:
    for line in fp:
        m = re.match(r'(\d+)\s+\w+\s+([^/]+)', line)
        if m:
            if lang:
                swadesh[lang][int(m.group(1)) - 1] = m.group(2).strip()
            continue
        m = re.match(r'([A-Z]\w+){', line)
        if m:
            lang = m.group(1)
            swadesh[lang] = [''] * 100
            continue

This creates the dict lang_name : list, e.g.

LITHUANIAN : ['aS', 'yus, tu', 'mes', 'Sitas', 'anas, tas', 'kas, kuris', 'kas', 'ne', 'visas, visi', 'daugelis', 'vienas', 'du, div', 'didelis, platus', 'ilgas, letas', 'maZas, maZutis', 'moteris', 'Zmogus', 'asmuo, Zmogus', 'Zuvis', 'paukStis', 'Suo', 'utele', 'medis', 'sekla, grudas', 'lapas', 'Saknis', 'Zieve', 'oda, kailis', 'mesa, kunas', 'krauyas, giminy~ste', 'kaulas', 'taukai, riebalai', 'kiauSinis', 'ragas', 'uodega, eile', 'plunksna', 'palukas', 'galva', 'ausis', 'akis', 'nosis, nuyautimas', 'burna', 'dantis', 'lieZuvis', 'nagas', 'koya, peda', 'kelis', 'ranka', 'pilvas, skramdis', 'kaklas', 'krutine', 'Sirdis', 'kepeny~s', 'ger, girtau', 'valgy~, iSes', 'kas, gel', 'maty~, Ziure', 'girde, iSklausy~ti', 'Zino, paZin', 'miegas', 'mir', 'uZmuS, nuZudy~', 'plaukio', 'skris, pralek', 'ei, vaikSCio', 'atei, atvy~k', 'gule, bu', 'sede, posedZiau', 'stove, atsisto', 'duo, dovano', 'nuomone, Zodis', 'saule', 'menulis', 'ZvaigZde, Sviesuly~s', 'vanduo', 'lietus', 'akmuo', 'smely~s', 'Zeme', 'debesy~s', 'dumai', 'ugnis, liepsna', 'pelenai', 'deg', 'takas, kelelis', 'kalnas', 'raudonas, paraudes', 'Zalias, nesubrendes', 'geltonas', 'baltas', 'yuodas, tamsus', 'naktis', 'karStas', 'Saltas, abeyingas', 'pilnas, visas', 'nauyas, SvieZias', 'geras, malonus', 'apskritas, apvalus', 'sausas', 'vardas']
KHWARSHI : ['do', 'mo', 'ilo', '', '', '', '', '', '', '', 'hos', 'qw"X$inE', '', '', '', '', '', '%Xadam', 'CuXa', '', 'XXw$E', 'noc"o', 'Xon', '', 'tL~ib', '', '', 'qX~al', '', 'e*q"X~o', 'tL~ozol', '', '', 'SEly~u', '', '', '', '', 'a*hX~a', 'Ezol', 'ma*ni', '', 's3l', 'muc', '', '', 'gurtu', 'litL"a', '', '', 'koko', '', 'Zubu', 'c"oda', '', '', 'aka', 'tuqX~a', '', '', 'uha', '', '', '', '', 'ok"a', '', '', '', '', '', 'buqXX$', '', 'ca', 'Lo', '', 'Xur', '', '', '', '', 'c"o', '', '', 'hu*nE', 'hu*n', '', '', '', '', '', 'rELa', '', '', 'lec"u', 'uc"nu', '', '', '', 'co']
MANDARIN_2 : ['wo', 'ni', 'women', '', '', '', '', '', '', '', 'yi', 'er', '', '', '', '', '', 'ren', 'yu', '', '%gow', 'Sizi, towSi', 'Su', '', 'yezi', '', '', 'pi', '', 'Sie, Swe', 'gu tow', '', '', 'jiao', '', '', '', '', 'erduo', 'yanjiN', 'bizi', '', 'ya, yaCi', 'Setow', '', '', 'Si, Sigai', 'Sow', '', '', 'rufaN', '', 'gan, ganzaN', 'he', '', '', 'jian', 'tiN', '', '', 'si', '', '', '', '', 'lai', '', '', '', '', '', 'taiyaN', '', 'SiNSiN', 'Sui', '', 'Sitow', '', '', '', '', 'huo', '', '', 'xiaolu', 'Ciu, CiuliN', '', '', '', '', '', 'ye', '', '', 'man', 'Sin', '', '', '', 'miNzi, SiNmiN']
WARAO : ['ine', 'zatu', 'oko', 'tamaha', 'tai', 'sina', 'bitu', 'XXX', 'kokotuka', 'era', 'hishaka', 'manamu', 'irija', 'bumija', 'sanuka', 'tija', 'nibora', 'warau', 'homakaba', 'domu', 'beroro', 'ami', 'dau', 'amu', 'dau aroko', 'XXX', 'ahoro', 'horo', 'toma', 'hotu', 'muhu', 'toi', 'ahi', 'akw~ahoi', 'ahu', 'huhi', 'hio', 'kw~a', 'kohoko', 'mu', 'hikoto', 'doko', 'i', 'hono', 'kahanobo, mohusi', 'omu', 'mukuru', 'moho', 'obono', 'do', 'ami', 'kobe', 'mahi', 'takatakaza', 'nahoroza', 'basia', 'mia', 'nokoza', 'naminaza', 'ubaza', 'wabaza', 'naza', 'soitia', 'nebiria', 'naria, za*tia', 'nauza', 'zahia', 'duhuya', 'kanamuya', 'muaza', 'dibu', 'za', 'waniku', 'kura', 'ho', 'naha', 'hozo', 'huhu', 'hobahi', 'nahamutu', 'hehuku', 'hekunu', 'hekohuhu', 'ehohuya', 'hohisi', 'hotakw~ai', 'simo*', 'hebura', 'simosimo, johene', 'hokera', 'anera', 'ima', 'ihija', 'daitera, dehorohera', 'kw~atai', 'hijo', 'zakera', 'kobera', 'nauwaha', 'wai']
KAREN_GEBA : ['ya', 'na', 'pwa', '', '', '', '', '', '', '', 't3bw~a', 'Ci bw~a', '', '', '', '', '', 'by~a', 'ta ph~o', '', 'thw$i*7', 'tr~o7', 'tr~o7', '', 'La*7', '', '', '3ph~3i7', '', 'tr~wi*7', 'khw$i*7', '', '', 'ta 73no', '', '', '', '', 'k3ni7 ku', 'ka ka dr~u', 'k3ni kh~3de', '', 'kotr~o', 'k3pli', '', '', 'kh~a lo ma*7', 's3 kh~o', '', '', 'XXX', '', 'k3to tr~a*7', 'o', '', '', '3sa Ci', 'k3tr~a ha', '', '', 'tr~i', '', '', '', '', 'ke ba*7', '', '', '', '', '', 'lu mu', '', 'sya', 'Ci', '', 'lo7', '', '', '', '', 'mi7', '', '', 'kla*7 do7', 'kh~o la7', '', '', '', '', '', 'lu mu na kh~a, na kh~a', '', '', 'pw~a th~a*7', '3tr~a', '', '', '', 'k3sh~o7 mi']

From this dict it's easy to extract required language(s) or word(s).

georg
  • 211,518
  • 52
  • 313
  • 390
  • Works great! I am stepping outside of my comfort zone here so I greatly appreciate the assistance. – acd Sep 16 '14 at 16:58
0
  1. 300000 lines of text are not a lot and can fit easily into a few 100MBs. Unless this is time-sensitive, do it the stupid way and load it all into a list.

  2. Collect all your language names in an dictionary instead of list.

Then iterate over the huge text-file:

for i, current_line in input_list:
    if current_line in languages:
        language_block = input_list[i:i + 43]
        # do something with it

Once you have this, and you want to minimize memory consumption, don't load the whole file, but jump over 42 lines using continue while iterating over all lines.

Community
  • 1
  • 1
Georg Schölly
  • 124,188
  • 49
  • 220
  • 267
0

with readlines() on your opened file you can have an index of the lines, so:

infile = open('listss16.txt')
lines = infile.readlines()

for line in range(0,len(lines)):
    for l in langlist:
        if l in lines[line]:
            outfile.write(lines[line:line+42])

or a one liner to write once-for-all:

out = [lines[line:line+42] for l in langlist for line in range(0,len(lines)) if l in lines[line]]
Hrabal
  • 2,403
  • 2
  • 20
  • 30
  • When running the initial suggestion I received the error "TypeError: expected a character buffer object". My understanding of this error message is that I'm trying to write a list to outfile, not a string. Shouldn't readlines() return the text as a string? The one liner ran without error but produced a blank document. – acd Sep 16 '14 at 17:52
  • Yeah sorry, readlines() returns a list of strings, so `lines[line:line:42]` is a list, so to write in the file you have to iterate over the list ( `[outfile.write(part) for part in lines[line:line+42]` ) -- The one liner returns a list of the selected lines in the infile... to write this to the file you have to iterate over this list and write to file the element. – Hrabal Sep 16 '14 at 17:58
  • p.s: I played a little with the files you linked when at work.. beware that if you search "TURKMEN" in the file you are gonna find "TURKMEN_2" as well.. – Hrabal Sep 16 '14 at 18:04
  • Ah yes makes sense. Thank you, the script works now that I'm iterating over the list. I also noted the occurrence of TURKMEN_2 while writing up that example, but thanks! – acd Sep 16 '14 at 18:20