Frage So lesen Sie eine Textdatei umgekehrt mit Iterator in C #


Ich muss eine große Datei verarbeiten, etwa 400 K Zeilen und 200 M. Aber manchmal muss ich von unten nach oben verarbeiten. Wie kann ich hier Iterator (Rendite) verwenden? Grundsätzlich möchte ich nicht alles in den Speicher laden. Ich weiß, dass es effizienter ist, einen Iterator in .NET zu verwenden.


75
2018-01-17 06:27


Ursprung


Antworten:


Das Rückwärtslesen von Textdateien ist wirklich schwierig, es sei denn, Sie verwenden eine Kodierung mit fester Größe (z. B. ASCII). Wenn Sie eine Codierung mit variabler Größe verwenden (z. B. UTF-8), müssen Sie beim Abrufen von Daten prüfen, ob Sie sich in der Mitte eines Zeichens befinden oder nicht.

Es ist nichts in das Framework integriert, und ich vermute, dass Sie für jede Codierung mit variabler Breite eine separate, harte Codierung durchführen müssen.

EDIT: Das war es etwas getestet - aber das heißt nicht, dass es nicht immer noch einige kleine Fehler gibt. Es verwendet StreamUtil von MiscUtil, aber ich habe nur die notwendige (neue) Methode von dort unten eingefügt. Oh, und es braucht Refactoring - es gibt eine ziemlich heftige Methode, wie Sie sehen werden:

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace MiscUtil.IO
{
    /// <summary>
    /// Takes an encoding (defaulting to UTF-8) and a function which produces a seekable stream
    /// (or a filename for convenience) and yields lines from the end of the stream backwards.
    /// Only single byte encodings, and UTF-8 and Unicode, are supported. The stream
    /// returned by the function must be seekable.
    /// </summary>
    public sealed class ReverseLineReader : IEnumerable<string>
    {
        /// <summary>
        /// Buffer size to use by default. Classes with internal access can specify
        /// a different buffer size - this is useful for testing.
        /// </summary>
        private const int DefaultBufferSize = 4096;

        /// <summary>
        /// Means of creating a Stream to read from.
        /// </summary>
        private readonly Func<Stream> streamSource;

        /// <summary>
        /// Encoding to use when converting bytes to text
        /// </summary>
        private readonly Encoding encoding;

        /// <summary>
        /// Size of buffer (in bytes) to read each time we read from the
        /// stream. This must be at least as big as the maximum number of
        /// bytes for a single character.
        /// </summary>
        private readonly int bufferSize;

        /// <summary>
        /// Function which, when given a position within a file and a byte, states whether
        /// or not the byte represents the start of a character.
        /// </summary>
        private Func<long,byte,bool> characterStartDetector;

        /// <summary>
        /// Creates a LineReader from a stream source. The delegate is only
        /// called when the enumerator is fetched. UTF-8 is used to decode
        /// the stream into text.
        /// </summary>
        /// <param name="streamSource">Data source</param>
        public ReverseLineReader(Func<Stream> streamSource)
            : this(streamSource, Encoding.UTF8)
        {
        }

        /// <summary>
        /// Creates a LineReader from a filename. The file is only opened
        /// (or even checked for existence) when the enumerator is fetched.
        /// UTF8 is used to decode the file into text.
        /// </summary>
        /// <param name="filename">File to read from</param>
        public ReverseLineReader(string filename)
            : this(filename, Encoding.UTF8)
        {
        }

        /// <summary>
        /// Creates a LineReader from a filename. The file is only opened
        /// (or even checked for existence) when the enumerator is fetched.
        /// </summary>
        /// <param name="filename">File to read from</param>
        /// <param name="encoding">Encoding to use to decode the file into text</param>
        public ReverseLineReader(string filename, Encoding encoding)
            : this(() => File.OpenRead(filename), encoding)
        {
        }

        /// <summary>
        /// Creates a LineReader from a stream source. The delegate is only
        /// called when the enumerator is fetched.
        /// </summary>
        /// <param name="streamSource">Data source</param>
        /// <param name="encoding">Encoding to use to decode the stream into text</param>
        public ReverseLineReader(Func<Stream> streamSource, Encoding encoding)
            : this(streamSource, encoding, DefaultBufferSize)
        {
        }

        internal ReverseLineReader(Func<Stream> streamSource, Encoding encoding, int bufferSize)
        {
            this.streamSource = streamSource;
            this.encoding = encoding;
            this.bufferSize = bufferSize;
            if (encoding.IsSingleByte)
            {
                // For a single byte encoding, every byte is the start (and end) of a character
                characterStartDetector = (pos, data) => true;
            }
            else if (encoding is UnicodeEncoding)
            {
                // For UTF-16, even-numbered positions are the start of a character.
                // TODO: This assumes no surrogate pairs. More work required
                // to handle that.
                characterStartDetector = (pos, data) => (pos & 1) == 0;
            }
            else if (encoding is UTF8Encoding)
            {
                // For UTF-8, bytes with the top bit clear or the second bit set are the start of a character
                // See http://www.cl.cam.ac.uk/~mgk25/unicode.html
                characterStartDetector = (pos, data) => (data & 0x80) == 0 || (data & 0x40) != 0;
            }
            else
            {
                throw new ArgumentException("Only single byte, UTF-8 and Unicode encodings are permitted");
            }
        }

        /// <summary>
        /// Returns the enumerator reading strings backwards. If this method discovers that
        /// the returned stream is either unreadable or unseekable, a NotSupportedException is thrown.
        /// </summary>
        public IEnumerator<string> GetEnumerator()
        {
            Stream stream = streamSource();
            if (!stream.CanSeek)
            {
                stream.Dispose();
                throw new NotSupportedException("Unable to seek within stream");
            }
            if (!stream.CanRead)
            {
                stream.Dispose();
                throw new NotSupportedException("Unable to read within stream");
            }
            return GetEnumeratorImpl(stream);
        }

        private IEnumerator<string> GetEnumeratorImpl(Stream stream)
        {
            try
            {
                long position = stream.Length;

                if (encoding is UnicodeEncoding && (position & 1) != 0)
                {
                    throw new InvalidDataException("UTF-16 encoding provided, but stream has odd length.");
                }

                // Allow up to two bytes for data from the start of the previous
                // read which didn't quite make it as full characters
                byte[] buffer = new byte[bufferSize + 2];
                char[] charBuffer = new char[encoding.GetMaxCharCount(buffer.Length)];
                int leftOverData = 0;
                String previousEnd = null;
                // TextReader doesn't return an empty string if there's line break at the end
                // of the data. Therefore we don't return an empty string if it's our *first*
                // return.
                bool firstYield = true;

                // A line-feed at the start of the previous buffer means we need to swallow
                // the carriage-return at the end of this buffer - hence this needs declaring
                // way up here!
                bool swallowCarriageReturn = false;

                while (position > 0)
                {
                    int bytesToRead = Math.Min(position > int.MaxValue ? bufferSize : (int)position, bufferSize);

                    position -= bytesToRead;
                    stream.Position = position;
                    StreamUtil.ReadExactly(stream, buffer, bytesToRead);
                    // If we haven't read a full buffer, but we had bytes left
                    // over from before, copy them to the end of the buffer
                    if (leftOverData > 0 && bytesToRead != bufferSize)
                    {
                        // Buffer.BlockCopy doesn't document its behaviour with respect
                        // to overlapping data: we *might* just have read 7 bytes instead of
                        // 8, and have two bytes to copy...
                        Array.Copy(buffer, bufferSize, buffer, bytesToRead, leftOverData);
                    }
                    // We've now *effectively* read this much data.
                    bytesToRead += leftOverData;

                    int firstCharPosition = 0;
                    while (!characterStartDetector(position + firstCharPosition, buffer[firstCharPosition]))
                    {
                        firstCharPosition++;
                        // Bad UTF-8 sequences could trigger this. For UTF-8 we should always
                        // see a valid character start in every 3 bytes, and if this is the start of the file
                        // so we've done a short read, we should have the character start
                        // somewhere in the usable buffer.
                        if (firstCharPosition == 3 || firstCharPosition == bytesToRead)
                        {
                            throw new InvalidDataException("Invalid UTF-8 data");
                        }
                    }
                    leftOverData = firstCharPosition;

                    int charsRead = encoding.GetChars(buffer, firstCharPosition, bytesToRead - firstCharPosition, charBuffer, 0);
                    int endExclusive = charsRead;

                    for (int i = charsRead - 1; i >= 0; i--)
                    {
                        char lookingAt = charBuffer[i];
                        if (swallowCarriageReturn)
                        {
                            swallowCarriageReturn = false;
                            if (lookingAt == '\r')
                            {
                                endExclusive--;
                                continue;
                            }
                        }
                        // Anything non-line-breaking, just keep looking backwards
                        if (lookingAt != '\n' && lookingAt != '\r')
                        {
                            continue;
                        }
                        // End of CRLF? Swallow the preceding CR
                        if (lookingAt == '\n')
                        {
                            swallowCarriageReturn = true;
                        }
                        int start = i + 1;
                        string bufferContents = new string(charBuffer, start, endExclusive - start);
                        endExclusive = i;
                        string stringToYield = previousEnd == null ? bufferContents : bufferContents + previousEnd;
                        if (!firstYield || stringToYield.Length != 0)
                        {
                            yield return stringToYield;
                        }
                        firstYield = false;
                        previousEnd = null;
                    }

                    previousEnd = endExclusive == 0 ? null : (new string(charBuffer, 0, endExclusive) + previousEnd);

                    // If we didn't decode the start of the array, put it at the end for next time
                    if (leftOverData != 0)
                    {
                        Buffer.BlockCopy(buffer, 0, buffer, bufferSize, leftOverData);
                    }
                }
                if (leftOverData != 0)
                {
                    // At the start of the final buffer, we had the end of another character.
                    throw new InvalidDataException("Invalid UTF-8 data at start of stream");
                }
                if (firstYield && string.IsNullOrEmpty(previousEnd))
                {
                    yield break;
                }
                yield return previousEnd ?? "";
            }
            finally
            {
                stream.Dispose();
            }
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }
}


// StreamUtil.cs:
public static class StreamUtil
{
    public static void ReadExactly(Stream input, byte[] buffer, int bytesToRead)
    {
        int index = 0;
        while (index < bytesToRead)
        {
            int read = input.Read(buffer, index, bytesToRead - index);
            if (read == 0)
            {
                throw new EndOfStreamException
                    (String.Format("End of stream reached with {0} byte{1} left to read.",
                                   bytesToRead - index,
                                   bytesToRead - index == 1 ? "s" : ""));
            }
            index += read;
        }
    }
}

Feedback sehr willkommen. Das hat Spaß gemacht :)


114
2018-01-17 07:35



Sie könnten File.ReadLines verwenden, um den Zeilen-Iterator zu erhalten

foreach (var line in File.ReadLines(@"C:\temp\ReverseRead.txt").Reverse())
{
    if (noNeedToReadFurther)
        break;

    // process line here
    Console.WriteLine(line);
}

BEARBEITEN:

Nach dem Lesen applejacks01's Kommentar, ich mache ein paar Tests und es tut aussehen wie  .Reverse() lädt tatsächlich die ganze Datei.

ich benutzte File.ReadLines() zu drucken erste Linie einer 40MB Datei - Speicherverbrauch der Konsolen App war 5 MB. Dann benutzt File.ReadLines().Reverse() zu drucken letzte Linie der gleichen Datei - Speicherverbrauch war 95 MB.

Fazit

Was auch immer `Reverse () 'tut, es ist keine gute Wahl zum Lesen der Unterseite einer großen Datei.


5
2018-06-09 10:18



Ich legte die Datei Zeile für Zeile in eine Liste und benutzte dann List.Reverse ();

        StreamReader objReader = new StreamReader(filename);
        string sLine = "";
        ArrayList arrText = new ArrayList();

        while (sLine != null)
        {
            sLine = objReader.ReadLine();
            if (sLine != null)
                arrText.Add(sLine);
        }
        objReader.Close();


        arrText.Reverse();

        foreach (string sOutput in arrText)
        {

...


2
2017-12-26 14:35



Um einen Datei-Iterator zu erstellen, können Sie Folgendes tun:

BEARBEITEN:

Dies ist meine feste Version eines Reverse-File-Readers mit fester Breite:

public static IEnumerable<string> readFile()
{
    using (FileStream reader = new FileStream(@"c:\test.txt",FileMode.Open,FileAccess.Read))
    {
        int i=0;
        StringBuilder lineBuffer = new StringBuilder();
        int byteRead;
        while (-i < reader.Length)
        {
            reader.Seek(--i, SeekOrigin.End);
            byteRead = reader.ReadByte();
            if (byteRead == 10 && lineBuffer.Length > 0)
            {
                yield return Reverse(lineBuffer.ToString());
                lineBuffer.Remove(0, lineBuffer.Length);
            }
            lineBuffer.Append((char)byteRead);
        }
        yield return Reverse(lineBuffer.ToString());
        reader.Close();
    }
}

public static string Reverse(string str)
{
    char[] arr = new char[str.Length];
    for (int i = 0; i < str.Length; i++)
        arr[i] = str[str.Length - 1 - i];
    return new string(arr);
}

1
2018-01-17 07:27



Sie können die Datei um jeweils ein Zeichen rückwärts lesen und alle Zeichen zwischenspeichern, bis Sie einen Wagenrücklauf und / oder Zeilenvorschub erreichen.

Sie kehren dann die gesammelte Zeichenkette um und senden sie als Linie.


1
2018-01-17 07:40



Ich weiß, dass dieser Beitrag sehr alt ist, aber da ich nicht herausfinden konnte, wie man die am meisten gewählte Lösung verwendet, habe ich endlich folgendes gefunden: Hier ist die beste Antwort, die ich gefunden habe mit einem geringen Speicheraufwand in VB und C #

http://www.blakepell.com/2010-11-29-backward-file-reader-vb-csharp-source

Hoffe, ich werde anderen damit helfen, weil es mir Stunden gekostet hat, endlich diesen Beitrag zu finden!

[Bearbeiten]

Hier ist der c # -Code:

//*********************************************************************************************************************************
//
//             Class:  BackwardReader
//      Initial Date:  11/29/2010
//     Last Modified:  11/29/2010
//     Programmer(s):  Original C# Source - the_real_herminator
//                     http://social.msdn.microsoft.com/forums/en-US/csharpgeneral/thread/9acdde1a-03cd-4018-9f87-6e201d8f5d09
//                     VB Converstion - Blake Pell
//
//*********************************************************************************************************************************

using System.Text;
using System.IO;
public class BackwardReader
{
    private string path;
    private FileStream fs = null;
    public BackwardReader(string path)
    {
        this.path = path;
        fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
        fs.Seek(0, SeekOrigin.End);
    }
    public string Readline()
    {
        byte[] line;
        byte[] text = new byte[1];
        long position = 0;
        int count;
        fs.Seek(0, SeekOrigin.Current);
        position = fs.Position;
        //do we have trailing rn?
        if (fs.Length > 1)
        {
            byte[] vagnretur = new byte[2];
            fs.Seek(-2, SeekOrigin.Current);
            fs.Read(vagnretur, 0, 2);
            if (ASCIIEncoding.ASCII.GetString(vagnretur).Equals("rn"))
            {
                //move it back
                fs.Seek(-2, SeekOrigin.Current);
                position = fs.Position;
            }
        }
        while (fs.Position > 0)
        {
            text.Initialize();
            //read one char
            fs.Read(text, 0, 1);
            string asciiText = ASCIIEncoding.ASCII.GetString(text);
            //moveback to the charachter before
            fs.Seek(-2, SeekOrigin.Current);
            if (asciiText.Equals("n"))
            {
                fs.Read(text, 0, 1);
                asciiText = ASCIIEncoding.ASCII.GetString(text);
                if (asciiText.Equals("r"))
                {
                    fs.Seek(1, SeekOrigin.Current);
                    break;
                }
            }
        }
        count = int.Parse((position - fs.Position).ToString());
        line = new byte[count];
        fs.Read(line, 0, count);
        fs.Seek(-count, SeekOrigin.Current);
        return ASCIIEncoding.ASCII.GetString(line);
    }
    public bool SOF
    {
        get
        {
            return fs.Position == 0;
        }
    }
    public void Close()
    {
        fs.Close();
    }
}

1
2018-04-09 19:52



Ich wollte das Gleiche machen. Hier ist mein Code. Diese Klasse erstellt temporäre Dateien, die Teile der großen Datei enthalten. Dies wird Speicherblähung vermeiden. Der Benutzer kann angeben, ob die Datei umgekehrt werden soll. Dementsprechend wird der Inhalt umgekehrt zurückgegeben.

Diese Klasse kann auch verwendet werden, um große Daten in einer einzigen Datei zu schreiben, ohne Speicher aufzublähen.

Bitte geben Sie Feedback.

        using System;
        using System.Collections.Generic;
        using System.Diagnostics;
        using System.IO;
        using System.Linq;
        using System.Text;
        using System.Threading.Tasks;

        namespace BigFileService
        {    
            public class BigFileDumper
            {
                /// <summary>
                /// Buffer that will store the lines until it is full.
                /// Then it will dump it to temp files.
                /// </summary>
                public int CHUNK_SIZE = 1000;
                public bool ReverseIt { get; set; }
                public long TotalLineCount { get { return totalLineCount; } }
                private long totalLineCount;
                private int BufferCount = 0;
                private StreamWriter Writer;
                /// <summary>
                /// List of files that would store the chunks.
                /// </summary>
                private List<string> LstTempFiles;
                private string ParentDirectory;
                private char[] trimchars = { '/', '\\'};


                public BigFileDumper(string FolderPathToWrite)
                {
                    this.LstTempFiles = new List<string>();
                    this.ParentDirectory = FolderPathToWrite.TrimEnd(trimchars) + "\\" + "BIG_FILE_DUMP";
                    this.totalLineCount = 0;
                    this.BufferCount = 0;
                    this.Initialize();
                }

                private void Initialize()
                {
                    // Delete existing directory.
                    if (Directory.Exists(this.ParentDirectory))
                    {
                        Directory.Delete(this.ParentDirectory, true);
                    }

                    // Create a new directory.
                    Directory.CreateDirectory(this.ParentDirectory);
                }

                public void WriteLine(string line)
                {
                    if (this.BufferCount == 0)
                    {
                        string newFile = "DumpFile_" + LstTempFiles.Count();
                        LstTempFiles.Add(newFile);
                        Writer = new StreamWriter(this.ParentDirectory + "\\" + newFile);
                    }
                    // Keep on adding in the buffer as long as size is okay.
                    if (this.BufferCount < this.CHUNK_SIZE)
                    {
                        this.totalLineCount++; // main count
                        this.BufferCount++; // Chunk count.
                        Writer.WriteLine(line);
                    }
                    else
                    {
                        // Buffer is full, time to create a new file.
                        // Close the existing file first.
                        Writer.Close();
                        // Make buffer count 0 again.
                        this.BufferCount = 0;
                        this.WriteLine(line);
                    }
                }

                public void Close()
                {
                    if (Writer != null)
                        Writer.Close();
                }

                public string GetFullFile()
                {
                    if (LstTempFiles.Count <= 0)
                    {
                        Debug.Assert(false, "There are no files created.");
                        return "";
                    }
                    string returnFilename = this.ParentDirectory + "\\" + "FullFile";
                    if (File.Exists(returnFilename) == false)
                    {
                        // Create a consolidated file from the existing small dump files.
                        // Now this is interesting. We will open the small dump files one by one.
                        // Depending on whether the user require inverted file, we will read them in descending order & reverted, 
                        // or ascending order in normal way.

                        if (this.ReverseIt)
                            this.LstTempFiles.Reverse();

                        foreach (var fileName in LstTempFiles)
                        {
                            string fullFileName = this.ParentDirectory + "\\" + fileName;
// FileLines will use small memory depending on size of CHUNK. User has control.
                            var fileLines = File.ReadAllLines(fullFileName);

                            // Time to write in the writer.
                            if (this.ReverseIt)
                                fileLines = fileLines.Reverse().ToArray();

                            // Write the lines 
                            File.AppendAllLines(returnFilename, fileLines);
                        }
                    }

                    return returnFilename;
                }
            }
        }

Dieser Service kann wie folgt verwendet werden -

void TestBigFileDump_File(string BIG_FILE, string FOLDER_PATH_FOR_CHUNK_FILES)
        {
            // Start processing the input Big file.
            StreamReader reader = new StreamReader(BIG_FILE);
            // Create a dump file class object to handle efficient memory management.
            var bigFileDumper = new BigFileDumper(FOLDER_PATH_FOR_CHUNK_FILES);
            // Set to reverse the output file.
            bigFileDumper.ReverseIt = true;
            bigFileDumper.CHUNK_SIZE = 100; // How much at a time to keep in RAM before dumping to local file.

            while (reader.EndOfStream == false)
            {
                string line = reader.ReadLine();
                bigFileDumper.WriteLine(line);
            }
            bigFileDumper.Close();
            reader.Close();

            // Get back full reversed file.
            var reversedFilename = bigFileDumper.GetFullFile();
            Console.WriteLine("Check output file - " + reversedFilename);
        }

0
2017-10-14 10:51