0

I am trying to extract information out of a text file and store each "paragraph", by paragraph I mean I need the date (always the first index) and whatever description is associated with that date (the information right after that date, but before the next date), .txt looks likes

September 2013. **I NEED THE DATA THAT WOULD BE WRITTEN HERE STORED WITH ITS DATE HOWEVER 
WHEN ANOTHER DATE SHOWS UP IT NEEDS TO BE SEPERATED
September 2013. blah blah balh this is an example blah blaha blah I need the information hereblah blah balh this is an example blah blaha blah I need the information here
blah blah balh this is an example blah blaha blah I need the information here
August 2013. blah blah balh this is an example blah blaha blah I need the information here
August 2013.blah blah balh this is an example blah blaha blah I need the information here
blah blah balh this is an example blah blaha blah I need the information hereblah blah balh this is an example blah blaha blah I need the information hereblah blah balh this is an example blah blaha blah I need the information here
June 2013. blah blah balh this is an example blah blaha blah I need the information hereeeeee

There isn't a definite number of lines which comes after the date.

I am able to find every line starting with a date using

with open("test.txt", encoding="utf8") as input:
    for line in input:
        for month in months:
            if month in line:
                print(line)

but this outputs

"May 2014. only the first line is taken in and not the rest of the paragraph

April 2013. only the first line is taken in and not the rest of the paragraph

December 2013. only the first line is taken in and not the rest of the paragraph

November 2012. only the first line is taken in and not the rest of the paragraph
Hugo
  • 1
  • 1
  • 4
    And your attempts? Check out [ask] and [mre] if you haven't yet. – Filip Müller Jul 30 '22 at 22:05
  • Test to see if the first two words of a line can be parsed to a date, then are immediately followed by a period/full stop. Grab the rest of that line, and the next lines until you find another date. – MattDMo Jul 30 '22 at 22:05
  • You could use Python's regular expression module [`re`](https://docs.python.org/3/library/re.html#module-re) to match the date format pattern. – martineau Jul 30 '22 at 22:08
  • @MattDMo that is exactly what I am trying to do, and I am able to grab when it finds a date with a period/fullstop. But I am unable to read the next lines "until it finds another date" – Hugo Jul 30 '22 at 22:20
  • Please [edit] your question and post the code you have so far, along with its output and the [*full text* of any errors or tracebacks](https://meta.stackoverflow.com/q/359146). Please don't post images of text. – MattDMo Jul 30 '22 at 22:23

2 Answers2

1

If the file you read fits in memory, it's most of the time the best option to just read the complete file and then operate on it.

If you might have huge files (100MB and more), you might want to read in chunks:

https://stackoverflow.com/a/519653/562769

However, this means that you need to write more complex logic how to deal with those chunks.

Reading by lines doesn't make sense if your lines can become arbitrary big. For the OS/file system, a line is no meaningful unit. A newline character is only that: one character in a bigger file. Just like any other character.

Regarding the line matching, you could do something like this:

with open("file.txt") as fp:
    data = fp.read()

for line in data.split("/n"):
    if matches(line):
        operate(line) 

Where matches is a function that checks if your date condition is met and operate does what you want to do with the line.

The matches function could use several if-elif statements or regular expressions (the re module). Using split / startswith / "pattern" in "haystack" might be useful

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    What question are you answering? – martineau Jul 30 '22 at 22:11
  • 1
    100MB is not "huge" by any stretch of the imagination, unless you're using a computer from the 1990s. And I agree with martineau - how does this relate to the OP's problem at all? – MattDMo Jul 30 '22 at 22:13
  • He asked about reading specific lines from a file. I told him that this doesn't make sense in most cases. I agree that 100 MB is easy to handle, but I disagree about the notion "100 MB isn't huge". 100MB of natural language text is enormous. If you receive an email with text only which is 100 MB, you would need many hours to read – Martin Thoma Jul 30 '22 at 22:16
  • I'm not comparing it with my TB of data I handle at work. I see that this is a beginner question with likely some simple automation. I'm saying if things get in that order of magnitude, you start thinking about buffers / performance of reading. Not before as it's just not worth the effort – Martin Thoma Jul 30 '22 at 22:18
  • As a side note: Smartphones still don't have a lot of memory and there are a lot of things like Raspberry Pi floating around. I'm also uncertain how much memory chromebooks / ebook readers and similar have. It always depends on the platform you're using – Martin Thoma Jul 30 '22 at 22:21
0

This will work, assuming every line begins with a month and year separated by a space. You have a line in your example text that does not begin with a month/year, however, which is making me wonder if you're expecting it to reject lines that do not begin with a month/year.

with open('filename.txt', 'r') as f:
    data = f.readlines()

for line in data:
    words = line.strip().split(' ')
    date = ' '.join(words[0:2])
    desc = ' '.join(words[2:])
    print(f'{date} | {desc}\n')