textgrid_convert package¶
Submodules¶
textgrid_convert.ArgParser module¶
Functionality to parse CLI arguments.
-
textgrid_convert.ArgParser.arg_parser= ArgumentParser(prog='sphinx-build', usage=None, description='convert srt and sbv files to Praat textgrid', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True)¶ Set up read write and convert arguments
textgrid_convert.ParserABC module¶
Abstract Base class for implementing transcription parsers
-
class
textgrid_convert.ParserABC.ParserABC¶ Bases:
objectAbstract base class for Parsers to feed textgrid conversion
-
from_file()¶ Read file from disk
-
parse_timestamp(timestamp)¶ Convert timestamp to datetime.tme
Parameters: timestamp (str) – Returns: timestamp in milliseconds
-
parse_transcription(transcription)¶ Convert transcription input to transcription dictionary
Parameters: transcription (str) –
-
to_file()¶ Write file to disk
-
to_textgrid(input_dict=None, output_file=None, speaker_name='Speaker1', adapt_endstamps=0.001)¶ FIXME: add output_file Convert internal dict to Praat Textgrid format “Specs” here: http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html Time needs to be secs.milisecs, round to 2
Parameters: - speaker_name (str) –
- adapt_endstamps (float) – if given, will adapt end stamps to < start stamp
Returns: TextGrid compatible string
-
transcription= None¶
-
transcription_dict= None¶
-
unique_id= None¶
-
textgrid_convert.iotools module¶
Collect read and write functions.
-
textgrid_convert.iotools.filewriter(filename, outstring, strict=True)¶ Parameters: - filename (str) –
- outstring (str) –
- strict (Bool) – if True, will not overwrite
Returns: True, False
textgrid_convert.preproctools module¶
Data preprocessing tools.
-
textgrid_convert.preproctools.adapt_timestamps(input_dict, gap=0.1)¶ Adapt time end stamps to not overlap with following start time stamp.
Parameters: - input_dict (dict) – dictionary with timestamps, e.g. self.transcription_dict in a Parser
- gap (float) – gap to introduce between end and start index after adapt
Returns: dict
textgrid_convert.revParser module¶
-
textgrid_convert.revParser.parse_revstamp(timestamp)¶ Convert timestamp from rev format (00:00:20,000) to ms
Parameters: timestamp (str) – Returns: int
-
class
textgrid_convert.revParser.revParser(transcription)¶ Bases:
textgrid_convert.ParserABC.ParserABC# transcription dict is formatted like so: {chunk_id(int): {“speaker_name”: “”, “text”: “”, “start”: float, “end”: float}}
-
parse_timestamp(timestamp)¶ Convert from rev timestamps to ms
-
parse_transcription(speaker=None)¶ Specs are here: https://www.rev.com/api/attachmentsgetcontent
-
speakers= ()¶
-
to_darla_textgrid(speaker_id=None, alias='sentence')¶ Change TextGrid to the format DARLA understands: only “sentence” grids
Parameters: speaker_id (int) – ID of the speaker to keep, will default to first found Returns: str to be fed into DARLA
-
textgrid_convert.sbvParser module¶
-
class
textgrid_convert.sbvParser.sbvParser(transcription)¶ Bases:
textgrid_convert.ParserABC.ParserABCRead and parse an sbv formatted file Inofficial specs here: GGL
-
file_name¶ Type: optional
-
sbv_text¶ Type: str
-
parse_timestamp(timestamp)¶ Convert timestamps from sbv format 0:00:00.599 to ms
-
parse_transcription(transcription, time_stamp_sep=', ')¶ Pull the stuff from sbv into a dictionary of format {chunk_id: { “speaker”: str, “text”: str, “start”: int, “end”: int}}
Parameters: - transcription (str) –
- time_stamp_sep (str) –
Returns: dict as described above
-
sbv_generator(filein, separator='')¶ Parameters: - filein (file read object or other iterable) –
- separator (str) – separator between records
Returns: generator over chunk_id, timestamp, text FIXME: deque here
-
sbv_textparse(speaker_and_text, speaker='Speaker 1', speaker_regex=re.compile('[A-Z]+:'))¶ Parameters: speaker_and_text (str) – Returns: tuple (SPEAKER(str), text(str))
-
to_darla_textgrid(speaker_id=None, speaker_name=None, alias='sentence')¶ Change TextGrid to the format DARLA understands: only “sentence” grids
Parameters: - speaker_id (int) – NA for sbvs
- speaker_name (str) – name of the speaker to extact
- alias – the name to use for texttier – DARLA wants ‘sentence’
Returns: str to be fed into DARLA
-
textgrid_convert.srtParser module¶
-
class
textgrid_convert.srtParser.srtParser(transcription)¶ Bases:
textgrid_convert.ParserABC.ParserABCRead and parse an srt formatted file Inofficial specs here: http://forum.doom9.org/showthread.php?p=470941#post470941
-
file_name¶ Type: optional
-
srt_text¶ Type: str
-
parse_timestamp(timestamp)¶ Convert from srt style timestamp 00:59:58,89 to ms
Parameters: timestamp (str) – Returns: int
-
parse_transcription(srt_text=None, speaker_name='Speaker 1', time_stamp_sep=' --> ')¶ Pull the stuff from srt into a dictionary of format {chunk_id: {“text”: “”, “start”: int, “end”: int}}
Parameters: - srt_text (str) –
- speaker_name (str) –
- time_stamp_sep (str) – placeholder between start and end time stamp
Returns: dict as described above
-
srt_generator(filein, separator='\n')¶ Parameters: - filein (file read object or other iterable) –
- separator (str) – separator between records
Returns: generator over chunk_id, timestamp, text
-
to_darla_textgrid(speaker_id=None, speaker_name=None, alias='sentence')¶ Change TextGrid to the format DARLA understands: only “sentence” grids
Parameters: - speaker_id (int) – NA for sbvs
- speaker_name (str) – name of the speaker to extact
- alias – the name to use for texttier – DARLA wants ‘sentence’
Returns: str to be fed into DARLA
-
textgrid_convert.textgridtools module¶
Collect TextGrid related functionality here
-
textgrid_convert.textgridtools.collect_chunk_values(input_dict, key, strict=True)¶ Collect all values associated with chunks in input_dict.
Parameters: - input_dict (dict) – {chunk_id: {key: value}}
- key (str) –
- strict (Bool) – if True, will error out if key not present
Returns: list of results
-
textgrid_convert.textgridtools.ms_to_textgrid(milliseconds, strict=True)¶ Convert milliseconds to textgrid appropriate format,e.g. 12.88
Parameters: - milliseconds (int) –
- strict (Bool) – if True, will error out if no int given
-
textgrid_convert.textgridtools.to_long_textgrid(tier_dict, tier_key='speaker_name', tier_class='IntervalTier')¶ Create long form TextGrid, cf specs here:http://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html.
Parameters: - tier_dict (dict) –
- tier_key (str) –
- tier_class (str) – tier class to use for TextGrid
Returns: str of TextGrid
-
textgrid_convert.textgridtools.to_short_textgrid(tier_dict)¶
-
textgrid_convert.textgridtools.to_textgrid_time(timestamp, split_char='.')¶ FIXME: deprecate Output needs to be in mili seconds, round to 2
Parameters: timestamp (str) – Returns
textgrid_convert.ttextgrid_convert module¶
-
textgrid_convert.ttextgrid_convert.convert_to_darla(input_file, source_format, speaker_name='Speaker 1')¶ Convert from source_format in input_file to DARLA-compatible TextGrid
Parameters: - input_file (str) – path to input srt or sbv file to read from
- source_format (str) – either sbv or srt
- speaker_name (str) – optional speaker name
Returns: TextGrid formatted string
-
textgrid_convert.ttextgrid_convert.convert_to_txtgrid(input_file, source_format, speaker_name='Speaker 1')¶ Convert from source_format in input_file to TextGrid.
Parameters: - input_file (str) – path to input srt or sbv file to read from
- source_format (str) – either sbv or srt
- speaker_name (str) – optional speaker name
Returns: TextGrid formatted string
-
textgrid_convert.ttextgrid_convert.folder_source_format(input_folder, file_types=['.srt', '.sbv', '.json', '.rev'])¶ Check whether files in input_foldelibr have sbv, srt endings
Parameters: - input_folder (str) –
- file_types (iterable of str) – file endings to consider
Returns: str srt or sbv
Raises: ValueError if mix of extensions
-
textgrid_convert.ttextgrid_convert.guess_source_format(input_path, extension_map={'json': 'rev', 'sbv': 'sbv', 'srt': 'srt'})¶ Based on file extension of input_path, guess the format of transcription file.
Parameters: - input_path (str) – file name
- extension_map (dict) – dictoinary {file_ext: format}, e.g. {“srt”: “srt”}
Returns: format string, None if not found in extension_map
-
textgrid_convert.ttextgrid_convert.main(source_format, to, input_path, output_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textgrid-convert/checkouts/stable/docs'), suffix='_TEXTGRID.txt', strict=True)¶ Convert files(s) from input_path from to format to TextGrid. Optionally, write to output_path Example: convert from=sbv to=TextGrid and write to output_path=”home/patrick/output”
Parameters: - source_format (str) – file ending, currently accepts sbv and srt
- to (str) – file ending, only accepts TextGrid atm
- input_path (str) –
- output_path (str) –
- suffix (str) – string to append to file name for writing out TextGrid
- strict (Bool) – if True, will not overwrite files